Optimizations in Pyspark:- Explain with Examples, Adaptive Query Execution (AQE) in Detail

In PySpark, optimizing transformations is crucial for performance, especially when working with large datasets. Here’s a breakdown of best practices focused on broadcasting, caching, partitioning, and related Spark operations, with a focus on correct order and reasoning:

🔁 Broadcast vs Cache: Which First?

✅ Best Practice: Broadcast Before Cache

Reason: You should broadcast before caching, because Spark uses the current state of a DataFrame at the time of caching.
If you cache first, then broadcast later, Spark will not re-optimize with the broadcast hint unless the DataFrame is recomputed.

# ✅ Correct
small_df = spark.read.parquet("small_table").hint("broadcast")
small_df.cache()

# ❌ Incorrect
small_df = spark.read.parquet("small_table").cache()
small_df = small_df.hint("broadcast")  # Hint might not be effective now

⚡ Best Practices for Optimizing PySpark Code

🔹 1. Use `broadcast()` for small lookup tables

Avoid large shuffles in joins
Ideal for dimension tables (< 10MB–100MB)

from pyspark.sql.functions import broadcast
df_joined = large_df.join(broadcast(small_df), "key")

🔹 2. Persist/Cache only when reused

Use .cache() or .persist() when a DataFrame is reused across multiple stages.
Don’t cache everything — Spark stores it in memory, which could cause eviction or spills.

df.cache()  # Use only if df is used multiple times

🔹 3. Filter Early (Pushdown Predicate)

Apply .filter() or .where() as early as possible to reduce data size.
Especially effective if the filter can be pushed down to the source (e.g., Parquet, JDBC).

🔹 4. Avoid UDFs where possible

UDFs are black boxes — Spark can’t optimize them.
Prefer built-in functions (pyspark.sql.functions) or SQL expressions.

🔹 5. Repartition Intelligently

Avoid full .repartition() unless necessary.
Use .coalesce(n) to reduce partitions.
Use .repartition("key") before wide transformations like joins.

🔹 6. Use Column Pruning

Select only required columns early: df.select("col1", "col2")
Reduces serialization, shuffle, and memory usage.

🔹 7. Avoid Narrow-to-Wide Transformations

Avoid code like:

df1.join(df2, "key").groupBy("key").agg(...)

Instead: pre-aggregate before the join if possible.

🔹 8. Use Bucketing and Partitioning for Hive/Delta Tables

Helps Spark skip unnecessary data.
Use .bucketBy() or .partitionBy() when writing large tables.

🔹 9. Materialize Intermediate Steps

For complex pipelines, materialize with cache(), persist(), or write to disk with .checkpoint() or .write.format("delta").

🔹 10. Avoid Collect/Count on Large Data

These actions pull data to the driver and may cause OOM errors.
Use take(n) instead of collect() for sampling.

Bonus Tip: 🧠 Use Explain() to Analyze DAG

df.explain(True)

Use to verify if broadcast join is actually applied
Understand physical plan and detect unnecessary shuffles

Here’s a clean, concise mini reference table comparing cache vs persist vs broadcast vs checkpoint in PySpark:

✅ PySpark Optimization: Comparison Table

Feature	Purpose	Storage Level	Triggers Execution?	Use Case
`cache()`	Store DataFrame in memory (default)	MEMORY_AND_DISK	❌ No (needs action)	Reuse same DataFrame multiple times, fits mostly in memory
`persist()`	Custom storage levels (e.g., disk-only)	Customizable	❌ No (needs action)	When memory is insufficient or network reuse is needed
`broadcast()`	Send small DataFrame to all executors	N/A	✅ Yes (used in join)	Speed up joins with small dimension tables
`checkpoint()`	Break lineage, save to reliable storage	HDFS (or configured)	❌ No (needs action)	Long DAGs, avoid recomputation, recovery from faults

🔍 Detailed Notes

🟢 `cache()`

Shortcut for .persist(StorageLevel.MEMORY_AND_DISK)
Good default when using the DataFrame 2+ times
Use after filtering or transforming to avoid redoing work

🟡 `persist(StorageLevel)`

from pyspark import StorageLevel
df.persist(StorageLevel.DISK_ONLY)

Choose memory, disk, or memory+disk+serialization
Use when memory is limited

🔵 `broadcast()`

from pyspark.sql.functions import broadcast
df_large.join(broadcast(df_small), "key")

Efficient for joining large DataFrame with small one
Avoids shuffle
Should be used before caching

🔴 `checkpoint()`

spark.sparkContext.setCheckpointDir("/tmp/checkpoints")
df.checkpoint()

Used to truncate long lineage chains
Helps with stability and performance on iterative algorithms (e.g., Graph, ML)
Requires HDFS or a distributed file system

✅ When to Use What

Scenario	Use
Reusing a DataFrame many times	`cache()` or `persist()`
Small lookup table in a join	`broadcast()`
Running into stage failure/lineage issues	`checkpoint()`
Memory-constrained environment	`persist(StorageLevel.DISK_ONLY)`

Yes, absolutely! ✅ You can and should optimize writes in both vanilla PySpark and Databricks. Below are write optimization techniques, split into general PySpark (vanilla) and Databricks-specific approaches:

🚀 WRITE OPTIMIZATIONS IN VANILLA PYSPARK

1. 🔄 Repartition Before Write

Prevents too many small files
Example:

df.repartition(10).write.parquet("/output/path")

2. 🧩 Coalesce to Reduce Output Files

Use .coalesce(n) when writing less partitions
Ideal for appends or final writes

df.coalesce(1).write.mode("overwrite").csv("/output/single_file")

3. 🧱 Partition By Columns

Use .partitionBy() on high-cardinality columns for large tables

df.write.partitionBy("year", "country").parquet("/partitioned/output")

4. 🗃️ Use Efficient Formats

Prefer Parquet, ORC, or Delta over CSV/JSON

df.write.format("parquet").save("/parquet_path")

5. 🔒 Avoid Small Files Problem

Write with repartition() or coalesce() to avoid file explosion

⚡ ADDITIONAL WRITE OPTIMIZATIONS IN DATABRICKS

Databricks offers Delta Lake and platform-level features:

1. 🔁 `optimize` + `ZORDER` (Delta Lake only)

Optimizes file layout

OPTIMIZE delta.`/delta/events` ZORDER BY (event_date, user_id)

2. 🧹 Auto Compaction + Optimized Writes

Enable in Delta Lake to reduce small files

SET spark.databricks.delta.autoCompact.enabled = true;
SET spark.databricks.delta.optimizeWrite.enabled = true;

3. 🧯 Schema Evolution

Enable auto schema merge (only Delta supports it)

df.write.option("mergeSchema", "true").format("delta").mode("overwrite").save("/path")

4. ✅ Use Databricks Runtime File Writers

Faster than open-source Spark for writes
Databricks Runtime uses Photon, vectorized IO, and write optimizations under the hood

📋 SUMMARY TABLE

Technique	Vanilla PySpark	Databricks	Format
`repartition(n)` / `coalesce(n)`	✅	✅	All
`partitionBy()`	✅	✅	All
`optimize` + `ZORDER`	❌	✅	Delta
Auto-compaction	❌	✅	Delta
Optimized Write	❌	✅	Delta
Schema Evolution	Limited	✅	Delta
Format: Parquet/Delta	✅	✅	All

Yes, absolutely! ✅ Compression is one of the most effective and often overlooked write optimizations in both vanilla PySpark and Databricks.

🗜️ COMPRESSION OPTIONS IN PYSPARK

You can specify compression using .option("compression", "type") when writing data.

🔹 Compression Types by Format

Format	Supported Compression	Default Compression
Parquet	`snappy`, `gzip`, `brotli`, `lz4`, `zstd`, `uncompressed`	`snappy`
CSV	`gzip`, `bzip2`, `lz4`, `deflate`, `none`	`none`
JSON	Same as CSV	`none`
ORC	`zlib`, `snappy`, `none`	`zlib`
Delta	Uses Parquet compression (`snappy` by default)	`snappy`

✅ Example: Write CSV with GZIP

df.write.option("compression", "gzip").csv("/path/csv_gzip")

✅ Example: Write Parquet with ZSTD

df.write.option("compression", "zstd").parquet("/path/parquet_zstd")

✅ Set Global Compression via Spark Config

spark.conf.set("spark.sql.parquet.compression.codec", "zstd")
spark.conf.set("spark.sql.orc.compression.codec", "snappy")

🔍 When to Use Which Compression?

Compression	Best For	Trade-offs
`snappy`	Balanced: speed + size	Fastest; slightly larger files
`gzip`	Maximum compression	Slower to write/read
`zstd`	High compression + speed	Newer; often best overall
`brotli`	Text-heavy data	Slower; best for web/text data
`none`	Testing or full I/O speed	Large files; not optimized

📘 TIP: Delta Format

When using Delta, compression is handled via Parquet, so you can still configure it globally:

spark.conf.set("spark.sql.parquet.compression.codec", "zstd")

df.write.format("delta").save("/delta/path")

📦 Real-World Best Practices

✅ Use snappy or zstd for Parquet/Delta in most workloads.
🧪 Use gzip if file size is more important than speed.
✅ Set compression globally in Spark configs for consistency.
🚫 Don’t use compression=None in production unless required.

recent posts

about