Pyspark Wholesome Tutorial- Links to refer, PDfs

✅ Step 8: Final Result Collection or Write

If .collect(): data is pulled to the driver
If .write(): data is written from executors to storage (HDFS, S3, DB)

Here’s a deep dive into Step 8: Final Result Collection or Write in Apache Spark, covering:

What happens when results are collected or written
Differences in Databricks vs. on-prem BDL (Big Data Lake) Spark
Memory, network, disk behavior
Failure scenarios and performance tips

Operation	Description
`.collect()`	Brings result to Driver node memory
`.write()`	Writes result from Executors to external sink (e.g. file, DB)

⚠️ `.collect()` – Result Collection to Driver

🔍 What happens:

Spark triggers a job.
Tasks run on executors.
Results are pulled back to the driver node via network.
Returned as a list/array in local driver memory.

result = df.filter(...).collect()

🔥 Internals:

Uses network (not HDFS) to transfer data from executors.
If result size exceeds driver.memory → 💥 OutOfMemoryError
Can overwhelm driver if data > few MBs.

✅ Use when:

You need small datasets (like < 100K rows).
For debugging, testing, or local export.

❌ Avoid when:

Result is large (millions of rows or >1GB).
You are working in a distributed compute flow.

⚙️ Common Configs:

Config Name	Description
`spark.driver.memory`	Memory available to hold `.collect()`
`spark.driver.maxResultSize`	Max result size allowed (default: 1G)

--conf spark.driver.maxResultSize=2g

✅ `.write()` – Write Results to External Storage

This is the most production-safe way to handle large outputs.

df.write.mode("overwrite").parquet("s3://my-bucket/output/")

🔍 What happens:

Data is partitioned across executors.
Each task writes its portion of data:
- HDFS / DBFS / S3 / ADLS
- JDBC / Hive / Delta Lake
Metadata (schema, partition info) is optionally written.

✳️ Output Paths and Formats

Format	File Types	Parallel Write	Splittable	Typical Use
Parquet	`.parquet`	✅	✅	Analytics, ML
ORC	`.orc`	✅	✅	Hive + compression
Delta Lake	`.parquet` + txn	✅	✅	ACID tables in Databricks
CSV	`.csv`	✅	❌	Raw exports
JDBC	–	❌	❌	One row at a time

✅ Behavior in Databricks vs. On-Prem BDL

🟦 In Databricks

Uses DBFS or ADLS/S3 paths: df.write.format("delta").save("/mnt/datalake/output/")
Uses Delta Lake format by default for tables
Can write to:
- Managed/External Delta Tables
- Unity Catalog tables
- SQL Warehouses

Example:

df.write.format("delta").mode("overwrite").saveAsTable("sales_summary")

Benefits:

Auto-commits via Delta Transaction Log
Can use VACUUM, OPTIMIZE, ZORDER post-write
Fully scalable + trackable via job UI

🟩 In On-Prem BDL or Hadoop Cluster

Uses HDFS or NFS-mounted output paths: df.write.format("orc").save("/data/warehouse/output/")
Needs correct permissions + HDFS quotas
More manual cleanup and checkpointing

🧠 Behind the Scenes: Write Path

Executor tasks write files in parallel (1 file per partition).
Temporary files written in _temporary directory.
Upon success:
- _SUCCESS marker file is added
- Temp files renamed and committed
Driver finalizes metadata (for Hive/Delta)

⚠️ Failure Points

Point of Failure	What Happens	Fix
Driver crash during collect	Job fails	Avoid `.collect()` on large data
Network timeout	File write stalls	Use retries or increase `spark.network.timeout`
Partition skew	Some partitions take too long	Use `repartition()` or `salting`
Too many small files	Metadata overhead in storage	Use `coalesce()` or `repartition()`
Overwrite in use	Data loss if misused	Use `mode("append")` for incremental writes

✅ Best Practices

Goal	Recommendation
Small sample to local	`df.limit(100).collect()`
Full export to disk/S3/HDFS	`.write.format(...).save(...)`
Reduce number of files	`df.coalesce(1)` or `repartition(n)`
Save with schema & ACID	Use `Delta Lake` format in Databricks
SQL insert	`df.write.insertInto("table")` or `saveAsTable`
Avoid driver memory issues	NEVER `.collect()` large DataFrames

🧪 Example: Complete Script in Databricks

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("WriteExample").getOrCreate()

df = spark.read.option("header", True).csv("/mnt/raw/sales.csv")

df_clean = df.filter("amount > 1000").withColumnRenamed("amount", "total")

# Write as a Delta Table
df_clean.write.format("delta").mode("overwrite").saveAsTable("sales_summary")

# Optional: Optimize
spark.sql("OPTIMIZE sales_summary ZORDER BY (region)")

🧪 Example: BDL On-Prem Script (HDFS Output)

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("OnPremWrite") \
    .getOrCreate()

df = spark.read.json("hdfs:///data/in/orders")

df2 = df.select("order_id", "user_id", "amount")

# Repartition to reduce small files
df2.repartition(4).write \
    .mode("overwrite") \
    .parquet("hdfs:///warehouse/processed/orders/")

✅ Step 9: Job/Stage/Task Completion

Spark shows the progress in:

Spark UI (Driver web UI) – available at http://<driver-host>:4040
Databricks: Task DAG + Job metrics
Includes: Stages, Tasks, Shuffle Reads/Writes, GC Time, Input Size, etc.

✅ Step 10: Cleanup

Driver process exits
Executors shut down
Cluster resources are released

🧠 Diagram: Spark Architecture (Simplified)

+---------------------+     submits job to     +----------------------+
| Your Spark Program  |----------------------->| Spark Driver         |
+---------------------+                        +----------------------+
                                                      |
              +---------------------------------------+
              |
              v
     +--------------------+
     |  DAG Scheduler     |    → divides into stages
     +--------------------+
              |
              v
     +--------------------+     → tasks created per partition
     | Task Scheduler     |
     +--------------------+
              |
              v
+--------------------+  +--------------------+   +--------------------+
|  Executor on Node1 |  | Executor on Node2  |   | Executor on NodeN  |
+--------------------+  +--------------------+   +--------------------+

✅ Example: Real Job Flow

df = spark.read.csv("data.csv")          # Stage 1: File read
df2 = df.filter("value > 100")           # Still Stage 1 (narrow)
df3 = df2.groupBy("category").count()    # Stage 2 (wide, shuffle)
df3.write.parquet("out/")                # Stage 3: File write

🔍 Monitoring Spark Execution

Driver UI (port 4040): Stages, DAG, Storage, Executors
Databricks: “View” → “Job/Task Graph”
Use .explain() on DataFrames to inspect the physical plan.

⚠️ Spark Optimization Tips

Task	Recommendation
Shuffle joins	Use `broadcast()` for small table
Partition skew	Use `salting` or `repartition()`
Memory management	Avoid `.collect()` on large datasets
Resource allocation	Tune executor memory & cores
Caching reused datasets	Use `.cache()` or `.persist()`

HintsToday

recent posts

about

✅ Step 8: Final Result Collection or Write

⚠️ `.collect()` – Result Collection to Driver

🔍 What happens:

🔥 Internals:

✅ Use when:

❌ Avoid when:

⚙️ Common Configs:

✅ `.write()` – Write Results to External Storage

🔍 What happens:

✳️ Output Paths and Formats

✅ Behavior in Databricks vs. On-Prem BDL

🟦 In Databricks

Example:

Benefits:

🟩 In On-Prem BDL or Hadoop Cluster

🧠 Behind the Scenes: Write Path

⚠️ Failure Points

✅ Best Practices

🧪 Example: Complete Script in Databricks

🧪 Example: BDL On-Prem Script (HDFS Output)

✅ Step 9: Job/Stage/Task Completion

✅ Step 10: Cleanup

🧠 Diagram: Spark Architecture (Simplified)

✅ Example: Real Job Flow

🔍 Monitoring Spark Execution

⚠️ Spark Optimization Tips

Like this:

recent posts

about

Pyspark Wholesome Tutorial- Links to refer, PDfs

✅ Step 8: Final Result Collection or Write

⚠️ .collect() – Result Collection to Driver

🔍 What happens:

🔥 Internals:

✅ Use when:

❌ Avoid when:

⚙️ Common Configs:

✅ .write() – Write Results to External Storage

🔍 What happens:

✳️ Output Paths and Formats

✅ Behavior in Databricks vs. On-Prem BDL

🟦 In Databricks

Example:

Benefits:

🟩 In On-Prem BDL or Hadoop Cluster

🧠 Behind the Scenes: Write Path

⚠️ Failure Points

✅ Best Practices

🧪 Example: Complete Script in Databricks

🧪 Example: BDL On-Prem Script (HDFS Output)

✅ Step 9: Job/Stage/Task Completion

✅ Step 10: Cleanup

🧠 Diagram: Spark Architecture (Simplified)

✅ Example: Real Job Flow

🔍 Monitoring Spark Execution

⚠️ Spark Optimization Tips

Like this:

⚠️ `.collect()` – Result Collection to Driver

✅ `.write()` – Write Results to External Storage