The 10 tuning steps I’ve listed are excellent practices and absolutely help prevent or mitigate OOM (Out of Memory) errors in PySpark, especially when working with large datasets in distributed environments like Databricks, EMR, or on-prem Spark clusters.
Let’s walk through how each step contributes to overcoming OOM issues, in simpler, clear terms:
✅ 1. Tune Spark Memory Settings
spark.conf.set("spark.executor.memory", "8g")
spark.conf.set("spark.driver.memory", "4g")
- By increasing executor memory, you’re giving more heap space to each worker to process partitions, joins, and aggregations.
- OOMs often happen when shuffles or cached data overflow the default memory limit.
- Also ensure
spark.executor.memoryOverhead
is set properly for PySpark (default is 384MB or 10%).
✅ 2. Enable Dynamic Allocation
spark.conf.set("spark.dynamicAllocation.enabled", "true")
- If your job needs more parallelism temporarily, Spark will auto-provision extra executors (if cluster has room).
- Helps when a few partitions are very large — Spark can give them isolated executors to avoid overloading others.
- Prevents over-provisioning memory for smaller jobs.
✅ 3. Adaptive Query Execution (AQE)
spark.conf.set("spark.sql.adaptive.enabled", "true")
- Dynamically changes join types (e.g., broadcast if small table detected), and re-optimizes stages based on data at runtime.
- Prevents memory-heavy shuffle joins by replacing them with more efficient strategies.
- Reduces skew and large intermediate stages that may crash due to memory pressure.
✅ 4. Always Enforce Schema
df = spark.read.schema(schema).json(path)
- Avoids expensive schema inference, especially with complex nested files.
- Prevents OOMs during the metadata scan phase or malformed file loading.
- Helps Spark parallelize reads better by knowing column types upfront.
✅ 5. Repartition Wisely
df = df.repartition(200, "country")
- Too few partitions → large partitions → high memory use → OOM 💥
- Too many small partitions → overhead, more tasks to manage
- Finding a balance (like 128–256 MB per partition) reduces executor memory load.
✅ 6. Handle Skewed Joins
df1.withColumn("join_key_salted", ...)
- If some keys appear too often (e.g., a “null” or “India” in millions of rows), they cause skew, leading to one massive task and OOM.
- Salting breaks up skewed keys into “subkeys” so data gets evenly distributed, reducing memory spike in any single executor.
✅ 7. Selective Caching
df.persist(StorageLevel.MEMORY_AND_DISK)
- Only cache if:
- You use the same DataFrame more than once
- It fits in memory or can spill to disk
- Unused or oversized caching is a common OOM cause — Spark tries to keep cached data in memory unless you specify
MEMORY_AND_DISK
.
✅ 8. Broadcast Joins
large_df.join(broadcast(small_df), "user_id")
- Instead of shuffling both sides of a join, you broadcast the smaller one to all executors.
- Avoids massive shuffle files, memory-heavy sort-merge joins.
- Very effective when small table < 10 MB (or increase
spark.sql.autoBroadcastJoinThreshold
).
✅ 9. Use Spark UI
- Spark UI → your best friend to debug memory issues:
- Stage timelines
- Shuffle read/write stats
- Task duration spikes
- Skewed tasks
- Cached storage usage
- It shows whether tasks are spilling to disk, retrying, or failing due to GC overhead.
✅ 10. Partition When Writing Data
df.write.partitionBy("year", "month").parquet(path)
- Saves data logically organized, reducing memory & compute when later reading only a subset.
- Avoids massive file loads when only part of the data is needed.
- Especially critical in wide tables or historical event logs.
✅ Conclusion
👉 Yes — these steps will significantly reduce the chances of OOM errors, especially in the following scenarios:
Scenario | Memory-safe strategy |
---|---|
Huge joins or aggregations | Broadcast join, AQE, salting, repartitioning |
Large file loads | Use schema, avoid .collect() or .toPandas() |
Reusing data multiple times | Cache with care, only if it fits |
Skewed data | Salt join keys, monitor in Spark UI |
Large output write | Use partitioned write to avoid large shuffles |