# REASON: Join operations can cause executor Out Of Memory (OOM) errors when:
# – One side of the join (especially in a shuffle join) is too large to fit in executor memory.
# – Skewed keys cause data to pile up in specific partitions.
# – Spark performs a wide transformation and shuffles too much data.
# HOW TO AVOID OOM IN JOINS:
# 1. Use Broadcast Join (if one side is small enough)
from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_df), “join_key”)
# 2. Repartition by join key to evenly distribute data
df1 = df1.repartition(“join_key”)
df2 = df2.repartition(“join_key”)
# 3. Filter unnecessary columns before join to reduce memory footprint
df1_trimmed = df1.select(“join_key”, “needed_col1”)
df2_trimmed = df2.select(“join_key”, “needed_col2”)
# 4. Handle skew by salting
from pyspark.sql.functions import rand, concat_ws
# Salt one side of the skewed join
df1 = df1.withColumn(“salted_key”, concat_ws(“_”, df1[“join_key”], (rand() * 10).cast(“int”)))
df2 = df2.withColumn(“salted_key”, concat_ws(“_”, df2[“join_key”], (rand() * 10).cast(“int”)))
# Then join on salted_key
# 5. Use spark.sql.autoBroadcastJoinThreshold
wisely
spark.conf.set(“spark.sql.autoBroadcastJoinThreshold”, 50 * 1024 * 1024) # 50MB
# 6. Monitor partition sizes and optimize memory settings
# Avoid shuffling massive partitions; tune spark.sql.shuffle.partitions
# 7. Cache intermediate DataFrames if reused
df1.cache()
`
**Summary:**
OOM during joins typically occurs due to shuffling large or skewed data. Mitigate it using broadcast joins, repartitioning, filtering columns, salting for skew, tuning configs, and caching when needed.