Spark SQL Join Types- Syntax examples, Comparision

Here’s the PySpark equivalent of all 4 types of joins — inner, left, right, and full outer — with duplicate key behavior clearly illustrated.

✅ Step 1: Setup Sample DataFrames

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("JoinExamples").getOrCreate()

data_a = [(1, "Alice"), (1, "Asha"), (2, "Bob")]
data_b = [(1, "Mumbai"), (1, "Delhi"), (3, "Bangalore")]

df_a = spark.createDataFrame(data_a, ["id", "name"])
df_b = spark.createDataFrame(data_b, ["id", "city"])

1️⃣ Inner Join (default)

df_inner = df_a.join(df_b, on="id", how="inner")
df_inner.show()

✅ Output:

+---+-----+--------+
| id| name|   city |
+---+-----+--------+
|  1|Alice| Mumbai |
|  1|Alice|  Delhi |
|  1| Asha| Mumbai |
|  1| Asha|  Delhi |
+---+-----+--------+

All id=1 rows from both sides are matched → 4 rows total.

2️⃣ Left Join

df_left = df_a.join(df_b, on="id", how="left")
df_left.show()

✅ Output:

+---+-----+--------+
| id| name|   city |
+---+-----+--------+
|  1|Alice| Mumbai |
|  1|Alice|  Delhi |
|  1| Asha| Mumbai |
|  1| Asha|  Delhi |
|  2|  Bob|    null|
+---+-----+--------+

All rows from df_a preserved. id=2 gets NULL in city.

3️⃣ Right Join

df_right = df_a.join(df_b, on="id", how="right")
df_right.show()

✅ Output:

+---+-----+----------+
| id| name|     city |
+---+-----+----------+
|  1|Alice|   Mumbai |
|  1|Alice|    Delhi |
|  1| Asha|   Mumbai |
|  1| Asha|    Delhi |
|  3| null|Bangalore |
+---+-----+----------+

All rows from df_b preserved. id=3 gets NULL in name.

4️⃣ Full Outer Join

df_outer = df_a.join(df_b, on="id", how="outer")
df_outer.show()

✅ Output:

+---+-----+----------+
| id| name|     city |
+---+-----+----------+
|  1|Alice|   Mumbai |
|  1|Alice|    Delhi |
|  1| Asha|   Mumbai |
|  1| Asha|    Delhi |
|  2|  Bob|      null|
|  3| null|Bangalore |
+---+-----+----------+

All rows from both sides are included. id=2 and id=3 filled with NULLs where unmatched.

🧠 Summary of Join Syntax in PySpark

df_a.join(df_b, on="id", how="inner")        # Only matching
df_a.join(df_b, on="id", how="left")         # Keep all from left (df_a)
df_a.join(df_b, on="id", how="right")        # Keep all from right (df_b)
df_a.join(df_b, on="id", how="outer")        # Keep all from both

df1.join(
df2,
when(df1.id == df2.id, df1.id).otherwise(df1.id_null).equalTo(df2.id)
)
Can the same complex join condition be written in Spark SQL?

Yes, the same complex join condition can absolutely be written in Spark SQL, but with a slightly different syntax than the PySpark DataFrame API.

🧾 What You’re Doing in PySpark

Your PySpark code:

df1.join(
    df2,
    when(df1.id == df2.id, df1.id).otherwise(df1.id_null).equalTo(df2.id)
)

This is equivalent to a conditional join, where:

You compare df1.id to df2.id, but if that fails, fall back to df1.id_null == df2.id.

✅ Equivalent Spark SQL Join

To do this in Spark SQL, you need to register your DataFrames as temporary views and then write a SQL query like:

df1.createOrReplaceTempView("df1")
df2.createOrReplaceTempView("df2")

spark.sql("""
SELECT *
FROM df1
JOIN df2
ON CASE 
     WHEN df1.id = df2.id THEN df1.id
     ELSE df1.id_null
   END = df2.id
""")

🔍 Explanation:

The CASE WHEN inside the ON clause replicates the when(...).otherwise(...) logic.
It evaluates the condition and applies the right value for the join.

⚠️ Important Notes

Spark SQL doesn’t allow full when().otherwise().equalTo() syntax, but the CASE WHEN ... END is the correct equivalent.
If you’re doing left joins, you can still use the same logic and add LEFT JOIN instead of JOIN.

🚀 Bonus — With `LEFT JOIN`

SELECT *
FROM df1
LEFT JOIN df2
ON CASE 
     WHEN df1.id = df2.id THEN df1.id
     ELSE df1.id_null
   END = df2.id

Spark Sql joins when Data has duplicate values on columns used for joining

When you perform a join in Spark SQL (or PySpark DataFrame API), and the join keys have duplicate values in one or both DataFrames, Spark performs a cartesian multiplication of the matching rows — this is expected behavior and is standard SQL semantics.

🔍 Example: Join with Duplicates

Table A

id	name
1	Alice
1	Asha

Table B

id	city
1	Mumbai
1	Bangalore

SQL Join:

SELECT * FROM A JOIN B ON A.id = B.id

🧮 Result:

This will produce 4 rows due to the 2×2 Cartesian match:

A.id	name	B.id	city
1	Alice	1	Mumbai
1	Alice	1	Bangalore
1	Asha	1	Mumbai
1	Asha	1	Bangalore

⚠️ Why This Happens

This is not a bug — it’s how relational joins work:

If both tables have n and m rows with the same key, the join returns n × m rows for that key.
This applies to inner join, left join, right join, and full outer join (each with its own rules).

✅ How to Handle It

✅ If You Want Only One Match Per Key:

Use one of the following techniques:

1. Drop Duplicates Before Join

SELECT * 
FROM (SELECT DISTINCT id, name FROM A) a
JOIN (SELECT DISTINCT id, city FROM B) b
ON a.id = b.id

2. Use Row Numbering and Select First Match

WITH ranked_b AS (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY city) AS rn
  FROM B
)
SELECT a.*, b.city
FROM A a
JOIN ranked_b b
  ON a.id = b.id AND b.rn = 1

This gives one row from B per key, joined to all matching rows in A (or vice versa).

3. Aggregate before joining

If you’re just interested in a summary (e.g., count, max, first):

SELECT a.*, b.first_city
FROM A a
JOIN (
  SELECT id, FIRST(city, TRUE) AS first_city
  FROM B
  GROUP BY id
) b
ON a.id = b.id

✅ Summary

Scenario	Result
Duplicates in both tables	Cartesian match
Need 1-to-1 match	Deduplicate / rank
Need aggregated info	Use `GROUP BY`

Absolutely. Let’s break down how duplicates in join keys behave across different types of Spark SQL joins: inner, left, right, and full outer join — especially when duplicates are present.

🎯 Setup: Two Tables with Duplicate Join Keys

🔵 Table A (left table)

id	name
1	Alice
1	Asha
2	Bob

🟢 Table B (right table)

id	city
1	Mumbai
1	Delhi
3	Bangalore

1️⃣ INNER JOIN

SELECT * FROM A INNER JOIN B ON A.id = B.id

Only rows with matching id values in both tables.
Matching duplicates result in cartesian multiplication (cross join for each match).

✅ Output:

id	name	city
1	Alice	Mumbai
1	Alice	Delhi
1	Asha	Mumbai
1	Asha	Delhi

id=2 (Bob) and id=3 (Bangalore) are ignored — no match in both.

2️⃣ LEFT JOIN

SELECT * FROM A LEFT JOIN B ON A.id = B.id

All rows from A (left side) are kept.
Matches from B are added; if no match, B columns are NULL.
Duplicates from A are also multiplied if multiple B matches exist.

✅ Output:

id	name	city
1	Alice	Mumbai
1	Alice	Delhi
1	Asha	Mumbai
1	Asha	Delhi
2	Bob	NULL

Bob has no match → gets NULL.

3️⃣ RIGHT JOIN

SELECT * FROM A RIGHT JOIN B ON A.id = B.id

All rows from B (right side) are kept.
Matches from A are added; if no match, A columns are NULL.

✅ Output:

id	name	city
1	Alice	Mumbai
1	Alice	Delhi
1	Asha	Mumbai
1	Asha	Delhi
3	NULL	Bangalore

Bangalore has no match → name is NULL.

4️⃣ FULL OUTER JOIN

SELECT * FROM A FULL OUTER JOIN B ON A.id = B.id

Keeps all rows from both A and B.
Where no match is found, fills the opposite side with NULL.

✅ Output:

id	name	city
1	Alice	Mumbai
1	Alice	Delhi
1	Asha	Mumbai
1	Asha	Delhi
2	Bob	NULL
3	NULL	Bangalore

Includes all matching and non-matching rows from both tables.

🧠 Summary Table

Join Type	Keeps Unmatched Rows from	Duplicates Cause Multiplication?
Inner Join	None	✅ Yes
Left Join	Left Table (A)	✅ Yes
Right Join	Right Table (B)	✅ Yes
Full Outer Join	Both Tables	✅ Yes

🛠 Tips for Managing Duplicates

If duplicates are unwanted, deduplicate using DISTINCT or ROW_NUMBER() OVER (...).
To keep only one match, use aggregation or filtering on ROW_NUMBER().

Here’s the PySpark equivalent of all 4 types of joins — inner, left, right, and full outer — with duplicate key behavior clearly illustrated.

✅ Step 1: Setup Sample DataFrames

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("JoinExamples").getOrCreate()

data_a = [(1, "Alice"), (1, "Asha"), (2, "Bob")]
data_b = [(1, "Mumbai"), (1, "Delhi"), (3, "Bangalore")]

df_a = spark.createDataFrame(data_a, ["id", "name"])
df_b = spark.createDataFrame(data_b, ["id", "city"])

1️⃣ Inner Join (default)

df_inner = df_a.join(df_b, on="id", how="inner")
df_inner.show()

✅ Output:

+---+-----+--------+
| id| name|   city |
+---+-----+--------+
|  1|Alice| Mumbai |
|  1|Alice|  Delhi |
|  1| Asha| Mumbai |
|  1| Asha|  Delhi |
+---+-----+--------+

All id=1 rows from both sides are matched → 4 rows total.

2️⃣ Left Join

df_left = df_a.join(df_b, on="id", how="left")
df_left.show()

✅ Output:

+---+-----+--------+
| id| name|   city |
+---+-----+--------+
|  1|Alice| Mumbai |
|  1|Alice|  Delhi |
|  1| Asha| Mumbai |
|  1| Asha|  Delhi |
|  2|  Bob|    null|
+---+-----+--------+

All rows from df_a preserved. id=2 gets NULL in city.

3️⃣ Right Join

df_right = df_a.join(df_b, on="id", how="right")
df_right.show()

✅ Output:

+---+-----+----------+
| id| name|     city |
+---+-----+----------+
|  1|Alice|   Mumbai |
|  1|Alice|    Delhi |
|  1| Asha|   Mumbai |
|  1| Asha|    Delhi |
|  3| null|Bangalore |
+---+-----+----------+

All rows from df_b preserved. id=3 gets NULL in name.

4️⃣ Full Outer Join

df_outer = df_a.join(df_b, on="id", how="outer")
df_outer.show()

✅ Output:

+---+-----+----------+
| id| name|     city |
+---+-----+----------+
|  1|Alice|   Mumbai |
|  1|Alice|    Delhi |
|  1| Asha|   Mumbai |
|  1| Asha|    Delhi |
|  2|  Bob|      null|
|  3| null|Bangalore |
+---+-----+----------+

All rows from both sides are included. id=2 and id=3 filled with NULLs where unmatched.

🧠 Summary of Join Syntax in PySpark

df_a.join(df_b, on="id", how="inner")        # Only matching
df_a.join(df_b, on="id", how="left")         # Keep all from left (df_a)
df_a.join(df_b, on="id", how="right")        # Keep all from right (df_b)
df_a.join(df_b, on="id", how="outer")        # Keep all from both

recent posts

about

✅ Step 1: Setup Sample DataFrames

1️⃣ Inner Join (default)

✅ Output:

2️⃣ Left Join

✅ Output:

3️⃣ Right Join

✅ Output:

4️⃣ Full Outer Join

✅ Output:

🧾 What You’re Doing in PySpark

✅ Equivalent Spark SQL Join

🔍 Explanation:

⚠️ Important Notes

🚀 Bonus — With LEFT JOIN

🔍 Example: Join with Duplicates

Table A

Table B

SQL Join:

🧮 Result:

⚠️ Why This Happens

✅ How to Handle It

✅ If You Want Only One Match Per Key:

1. Drop Duplicates Before Join

2. Use Row Numbering and Select First Match

3. Aggregate before joining

✅ Summary

🎯 Setup: Two Tables with Duplicate Join Keys

🔵 Table A (left table)

🟢 Table B (right table)

1️⃣ INNER JOIN

✅ Output:

2️⃣ LEFT JOIN

✅ Output:

3️⃣ RIGHT JOIN

✅ Output:

4️⃣ FULL OUTER JOIN

✅ Output:

🧠 Summary Table

🛠 Tips for Managing Duplicates

✅ Step 1: Setup Sample DataFrames

1️⃣ Inner Join (default)

✅ Output:

2️⃣ Left Join

✅ Output:

3️⃣ Right Join

✅ Output:

4️⃣ Full Outer Join

✅ Output:

🧠 Summary of Join Syntax in PySpark

Discover more from HintsToday

Leave a Reply Cancel reply

Discover more from HintsToday

🚀 Bonus — With `LEFT JOIN`