How joins work internally, by default which join is used, which side bigger table should be used left or right, does it matter in spark
Excellent question! Let’s break down how joins work internally in Spark, which is the default, and whether left/right side matters—especially from a performance perspective.
🔧 1. How Joins Work Internally in Spark
When you do a .join()
in PySpark or Spark SQL, Spark can execute it using one of these physical join strategies:
🔹 1. Broadcast Hash Join (BHJ)
- Spark broadcasts the smaller table to all executors.
- Fastest join type, avoids shuffling the big table.
- Triggered automatically if one side is smaller than a threshold (
spark.sql.autoBroadcastJoinThreshold
, default ~10MB).
🔹 2. Sort-Merge Join (SMJ)
- Both tables are shuffled and sorted by the join key.
- Default for large tables with equality joins when no broadcast is possible.
- Efficient for big joins but has high shuffle cost.
🔹 3. Shuffle Hash Join
- When one table is small enough to fit in memory and SMJ is not ideal.
- Not common; usually a fallback join strategy.
🔹 4. Cartesian Join
- No join condition is given (very expensive).
- Every row from one table is joined with every row from another.
✅ 2. Default Join in PySpark
- The default logical join is:
df1.join(df2, on="id") → INNER JOIN
- The default physical plan used is:
- Broadcast Hash Join → if Spark detects a small table.
- Sort-Merge Join → otherwise.
📌 3. Which Side Should Be the Bigger Table?
Does Left vs Right Matter?
Technically:
- No, for correctness: Spark produces the correct output regardless of left/right.
- Yes, for performance: in broadcast joins, it’s better to:
🔸 Put the smaller table on the right side of the join.
Why? Because Spark tends to broadcast the right table by default.
Example:
# Good: df_big LEFT JOIN df_small
df_big.join(broadcast(df_small), "id", "left")
Spark Hint:
You can force broadcast using:
from pyspark.sql.functions import broadcast
df_big.join(broadcast(df_small), "id")
🧠 Summary Table
Concept | Explanation |
---|---|
Default Join Type | INNER JOIN |
Default Join Strategy | Broadcast if possible, else Sort-Merge |
When to use Broadcast | Small dimension or lookup table (<10MB by default) |
Left vs Right – Does it matter? | ✅ For performance (right side is broadcasted by default) |
Best Practice | Broadcast the small table, cache if reused, use join hints if needed |
Leave a Reply