Uses HDFS or NFS-mounted output paths: df.write.format("orc").save("/data/warehouse/output/")
Needs correct permissions + HDFS quotas
More manual cleanup and checkpointing
π§ Behind the Scenes: Write Path
Executor tasks write files in parallel (1 file per partition).
Temporary files written in _temporary directory.
Upon success:
_SUCCESS marker file is added
Temp files renamed and committed
Driver finalizes metadata (for Hive/Delta)
β οΈ Failure Points
Point of Failure
What Happens
Fix
Driver crash during collect
Job fails
Avoid .collect() on large data
Network timeout
File write stalls
Use retries or increase spark.network.timeout
Partition skew
Some partitions take too long
Use repartition() or salting
Too many small files
Metadata overhead in storage
Use coalesce() or repartition()
Overwrite in use
Data loss if misused
Use mode("append") for incremental writes
β Best Practices
Goal
Recommendation
Small sample to local
df.limit(100).collect()
Full export to disk/S3/HDFS
.write.format(...).save(...)
Reduce number of files
df.coalesce(1) or repartition(n)
Save with schema & ACID
Use Delta Lake format in Databricks
SQL insert
df.write.insertInto("table") or saveAsTable
Avoid driver memory issues
NEVER .collect() large DataFrames
π§ͺ Example: Complete Script in Databricks
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("WriteExample").getOrCreate()
df = spark.read.option("header", True).csv("/mnt/raw/sales.csv")
df_clean = df.filter("amount > 1000").withColumnRenamed("amount", "total")
# Write as a Delta Table
df_clean.write.format("delta").mode("overwrite").saveAsTable("sales_summary")
# Optional: Optimize
spark.sql("OPTIMIZE sales_summary ZORDER BY (region)")
π§ͺ Example: BDL On-Prem Script (HDFS Output)
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("OnPremWrite") \
.getOrCreate()
df = spark.read.json("hdfs:///data/in/orders")
df2 = df.select("order_id", "user_id", "amount")
# Repartition to reduce small files
df2.repartition(4).write \
.mode("overwrite") \
.parquet("hdfs:///warehouse/processed/orders/")
β Step 9: Job/Stage/Task Completion
Spark shows the progress in:
Spark UI (Driver web UI) β available at http://<driver-host>:4040
Databricks: Task DAG + Job metrics
Includes: Stages, Tasks, Shuffle Reads/Writes, GC Time, Input Size, etc.
β Step 10: Cleanup
Driver process exits
Executors shut down
Cluster resources are released
π§ Diagram: Spark Architecture (Simplified)
+---------------------+ submits job to +----------------------+
| Your Spark Program |----------------------->| Spark Driver |
+---------------------+ +----------------------+
|
+---------------------------------------+
|
v
+--------------------+
| DAG Scheduler | β divides into stages
+--------------------+
|
v
+--------------------+ β tasks created per partition
| Task Scheduler |
+--------------------+
|
v
+--------------------+ +--------------------+ +--------------------+
| Executor on Node1 | | Executor on Node2 | | Executor on NodeN |
+--------------------+ +--------------------+ +--------------------+