How joins work internally, by default which join is used, which side bigger table should be used left or right, does it matter in spark

Excellent question! Let’s break down how joins work internally in Spark, which is the default, and whether left/right side matters—especially from a performance perspective.


🔧 1. How Joins Work Internally in Spark

When you do a .join() in PySpark or Spark SQL, Spark can execute it using one of these physical join strategies:

🔹 1. Broadcast Hash Join (BHJ)

  • Spark broadcasts the smaller table to all executors.
  • Fastest join type, avoids shuffling the big table.
  • Triggered automatically if one side is smaller than a threshold (spark.sql.autoBroadcastJoinThreshold, default ~10MB).

🔹 2. Sort-Merge Join (SMJ)

  • Both tables are shuffled and sorted by the join key.
  • Default for large tables with equality joins when no broadcast is possible.
  • Efficient for big joins but has high shuffle cost.

🔹 3. Shuffle Hash Join

  • When one table is small enough to fit in memory and SMJ is not ideal.
  • Not common; usually a fallback join strategy.

🔹 4. Cartesian Join

  • No join condition is given (very expensive).
  • Every row from one table is joined with every row from another.

✅ 2. Default Join in PySpark

  • The default logical join is: df1.join(df2, on="id") → INNER JOIN
  • The default physical plan used is:
    • Broadcast Hash Join → if Spark detects a small table.
    • Sort-Merge Join → otherwise.

📌 3. Which Side Should Be the Bigger Table?

Does Left vs Right Matter?

Technically:

  • No, for correctness: Spark produces the correct output regardless of left/right.
  • Yes, for performance: in broadcast joins, it’s better to:

🔸 Put the smaller table on the right side of the join.
Why? Because Spark tends to broadcast the right table by default.

Example:

# Good: df_big LEFT JOIN df_small
df_big.join(broadcast(df_small), "id", "left")

Spark Hint:

You can force broadcast using:

from pyspark.sql.functions import broadcast
df_big.join(broadcast(df_small), "id")

🧠 Summary Table

ConceptExplanation
Default Join TypeINNER JOIN
Default Join StrategyBroadcast if possible, else Sort-Merge
When to use BroadcastSmall dimension or lookup table (<10MB by default)
Left vs Right – Does it matter?✅ For performance (right side is broadcasted by default)
Best PracticeBroadcast the small table, cache if reused, use join hints if needed

Pages: 1 2 3 4 5 6 7


Discover more from HintsToday

Subscribe to get the latest posts sent to your email.

Posted in

Leave a Reply

Discover more from HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading