Here’s the PySpark equivalent of all 4 types of joins — inner, left, right, and full outerwith duplicate key behavior clearly illustrated.


✅ Step 1: Setup Sample DataFrames

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("JoinExamples").getOrCreate()

data_a = [(1, "Alice"), (1, "Asha"), (2, "Bob")]
data_b = [(1, "Mumbai"), (1, "Delhi"), (3, "Bangalore")]

df_a = spark.createDataFrame(data_a, ["id", "name"])
df_b = spark.createDataFrame(data_b, ["id", "city"])

1️⃣ Inner Join (default)

df_inner = df_a.join(df_b, on="id", how="inner")
df_inner.show()

✅ Output:

+---+-----+--------+
| id| name|   city |
+---+-----+--------+
|  1|Alice| Mumbai |
|  1|Alice|  Delhi |
|  1| Asha| Mumbai |
|  1| Asha|  Delhi |
+---+-----+--------+

All id=1 rows from both sides are matched → 4 rows total.


2️⃣ Left Join

df_left = df_a.join(df_b, on="id", how="left")
df_left.show()

✅ Output:

+---+-----+--------+
| id| name|   city |
+---+-----+--------+
|  1|Alice| Mumbai |
|  1|Alice|  Delhi |
|  1| Asha| Mumbai |
|  1| Asha|  Delhi |
|  2|  Bob|    null|
+---+-----+--------+

All rows from df_a preserved. id=2 gets NULL in city.


3️⃣ Right Join

df_right = df_a.join(df_b, on="id", how="right")
df_right.show()

✅ Output:

+---+-----+----------+
| id| name|     city |
+---+-----+----------+
|  1|Alice|   Mumbai |
|  1|Alice|    Delhi |
|  1| Asha|   Mumbai |
|  1| Asha|    Delhi |
|  3| null|Bangalore |
+---+-----+----------+

All rows from df_b preserved. id=3 gets NULL in name.


4️⃣ Full Outer Join

df_outer = df_a.join(df_b, on="id", how="outer")
df_outer.show()

✅ Output:

+---+-----+----------+
| id| name|     city |
+---+-----+----------+
|  1|Alice|   Mumbai |
|  1|Alice|    Delhi |
|  1| Asha|   Mumbai |
|  1| Asha|    Delhi |
|  2|  Bob|      null|
|  3| null|Bangalore |
+---+-----+----------+

All rows from both sides are included. id=2 and id=3 filled with NULLs where unmatched.


🧠 Summary of Join Syntax in PySpark

df_a.join(df_b, on="id", how="inner")        # Only matching
df_a.join(df_b, on="id", how="left")         # Keep all from left (df_a)
df_a.join(df_b, on="id", how="right")        # Keep all from right (df_b)
df_a.join(df_b, on="id", how="outer")        # Keep all from both

df1.join(
df2,
when(df1.id == df2.id, df1.id).otherwise(df1.id_null).equalTo(df2.id)
)

Can the same complex join condition be written in Spark SQL?

Yes, the same complex join condition can absolutely be written in Spark SQL, but with a slightly different syntax than the PySpark DataFrame API.


🧾 What You’re Doing in PySpark

Your PySpark code:

df1.join(
    df2,
    when(df1.id == df2.id, df1.id).otherwise(df1.id_null).equalTo(df2.id)
)

This is equivalent to a conditional join, where:

  • You compare df1.id to df2.id, but if that fails, fall back to df1.id_null == df2.id.

✅ Equivalent Spark SQL Join

To do this in Spark SQL, you need to register your DataFrames as temporary views and then write a SQL query like:

df1.createOrReplaceTempView("df1")
df2.createOrReplaceTempView("df2")

spark.sql("""
SELECT *
FROM df1
JOIN df2
ON CASE 
     WHEN df1.id = df2.id THEN df1.id
     ELSE df1.id_null
   END = df2.id
""")

🔍 Explanation:

  • The CASE WHEN inside the ON clause replicates the when(...).otherwise(...) logic.
  • It evaluates the condition and applies the right value for the join.

⚠️ Important Notes

  1. Spark SQL doesn’t allow full when().otherwise().equalTo() syntax, but the CASE WHEN ... END is the correct equivalent.
  2. If you’re doing left joins, you can still use the same logic and add LEFT JOIN instead of JOIN.

🚀 Bonus — With LEFT JOIN

SELECT *
FROM df1
LEFT JOIN df2
ON CASE 
     WHEN df1.id = df2.id THEN df1.id
     ELSE df1.id_null
   END = df2.id

Spark Sql joins when Data has duplicate values on columns used for joining

When you perform a join in Spark SQL (or PySpark DataFrame API), and the join keys have duplicate values in one or both DataFrames, Spark performs a cartesian multiplication of the matching rows — this is expected behavior and is standard SQL semantics.


🔍 Example: Join with Duplicates

Table A

idname
1Alice
1Asha

Table B

idcity
1Mumbai
1Bangalore

SQL Join:

SELECT * FROM A JOIN B ON A.id = B.id

🧮 Result:

This will produce 4 rows due to the 2×2 Cartesian match:

A.idnameB.idcity
1Alice1Mumbai
1Alice1Bangalore
1Asha1Mumbai
1Asha1Bangalore

⚠️ Why This Happens

This is not a bug — it’s how relational joins work:

  • If both tables have n and m rows with the same key, the join returns n × m rows for that key.
  • This applies to inner join, left join, right join, and full outer join (each with its own rules).

✅ How to Handle It

✅ If You Want Only One Match Per Key:

Use one of the following techniques:

1. Drop Duplicates Before Join

SELECT * 
FROM (SELECT DISTINCT id, name FROM A) a
JOIN (SELECT DISTINCT id, city FROM B) b
ON a.id = b.id

2. Use Row Numbering and Select First Match

WITH ranked_b AS (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY city) AS rn
  FROM B
)
SELECT a.*, b.city
FROM A a
JOIN ranked_b b
  ON a.id = b.id AND b.rn = 1

This gives one row from B per key, joined to all matching rows in A (or vice versa).

3. Aggregate before joining

If you’re just interested in a summary (e.g., count, max, first):

SELECT a.*, b.first_city
FROM A a
JOIN (
  SELECT id, FIRST(city, TRUE) AS first_city
  FROM B
  GROUP BY id
) b
ON a.id = b.id

✅ Summary

ScenarioResult
Duplicates in both tablesCartesian match
Need 1-to-1 matchDeduplicate / rank
Need aggregated infoUse GROUP BY

Absolutely. Let’s break down how duplicates in join keys behave across different types of Spark SQL joins: inner, left, right, and full outer join — especially when duplicates are present.


🎯 Setup: Two Tables with Duplicate Join Keys

🔵 Table A (left table)

idname
1Alice
1Asha
2Bob

🟢 Table B (right table)

idcity
1Mumbai
1Delhi
3Bangalore

1️⃣ INNER JOIN

SELECT * FROM A INNER JOIN B ON A.id = B.id
  • Only rows with matching id values in both tables.
  • Matching duplicates result in cartesian multiplication (cross join for each match).

✅ Output:

idnamecity
1AliceMumbai
1AliceDelhi
1AshaMumbai
1AshaDelhi

id=2 (Bob) and id=3 (Bangalore) are ignored — no match in both.


2️⃣ LEFT JOIN

SELECT * FROM A LEFT JOIN B ON A.id = B.id
  • All rows from A (left side) are kept.
  • Matches from B are added; if no match, B columns are NULL.
  • Duplicates from A are also multiplied if multiple B matches exist.

✅ Output:

idnamecity
1AliceMumbai
1AliceDelhi
1AshaMumbai
1AshaDelhi
2BobNULL

Bob has no match → gets NULL.


3️⃣ RIGHT JOIN

SELECT * FROM A RIGHT JOIN B ON A.id = B.id
  • All rows from B (right side) are kept.
  • Matches from A are added; if no match, A columns are NULL.

✅ Output:

idnamecity
1AliceMumbai
1AliceDelhi
1AshaMumbai
1AshaDelhi
3NULLBangalore

Bangalore has no match → name is NULL.


4️⃣ FULL OUTER JOIN

SELECT * FROM A FULL OUTER JOIN B ON A.id = B.id
  • Keeps all rows from both A and B.
  • Where no match is found, fills the opposite side with NULL.

✅ Output:

idnamecity
1AliceMumbai
1AliceDelhi
1AshaMumbai
1AshaDelhi
2BobNULL
3NULLBangalore

Includes all matching and non-matching rows from both tables.


🧠 Summary Table

Join TypeKeeps Unmatched Rows fromDuplicates Cause Multiplication?
Inner JoinNone✅ Yes
Left JoinLeft Table (A)✅ Yes
Right JoinRight Table (B)✅ Yes
Full Outer JoinBoth Tables✅ Yes

🛠 Tips for Managing Duplicates

  • If duplicates are unwanted, deduplicate using DISTINCT or ROW_NUMBER() OVER (...).
  • To keep only one match, use aggregation or filtering on ROW_NUMBER().

Here’s the PySpark equivalent of all 4 types of joins — inner, left, right, and full outerwith duplicate key behavior clearly illustrated.


✅ Step 1: Setup Sample DataFrames

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("JoinExamples").getOrCreate()

data_a = [(1, "Alice"), (1, "Asha"), (2, "Bob")]
data_b = [(1, "Mumbai"), (1, "Delhi"), (3, "Bangalore")]

df_a = spark.createDataFrame(data_a, ["id", "name"])
df_b = spark.createDataFrame(data_b, ["id", "city"])

1️⃣ Inner Join (default)

df_inner = df_a.join(df_b, on="id", how="inner")
df_inner.show()

✅ Output:

+---+-----+--------+
| id| name|   city |
+---+-----+--------+
|  1|Alice| Mumbai |
|  1|Alice|  Delhi |
|  1| Asha| Mumbai |
|  1| Asha|  Delhi |
+---+-----+--------+

All id=1 rows from both sides are matched → 4 rows total.


2️⃣ Left Join

df_left = df_a.join(df_b, on="id", how="left")
df_left.show()

✅ Output:

+---+-----+--------+
| id| name|   city |
+---+-----+--------+
|  1|Alice| Mumbai |
|  1|Alice|  Delhi |
|  1| Asha| Mumbai |
|  1| Asha|  Delhi |
|  2|  Bob|    null|
+---+-----+--------+

All rows from df_a preserved. id=2 gets NULL in city.


3️⃣ Right Join

df_right = df_a.join(df_b, on="id", how="right")
df_right.show()

✅ Output:

+---+-----+----------+
| id| name|     city |
+---+-----+----------+
|  1|Alice|   Mumbai |
|  1|Alice|    Delhi |
|  1| Asha|   Mumbai |
|  1| Asha|    Delhi |
|  3| null|Bangalore |
+---+-----+----------+

All rows from df_b preserved. id=3 gets NULL in name.


4️⃣ Full Outer Join

df_outer = df_a.join(df_b, on="id", how="outer")
df_outer.show()

✅ Output:

+---+-----+----------+
| id| name|     city |
+---+-----+----------+
|  1|Alice|   Mumbai |
|  1|Alice|    Delhi |
|  1| Asha|   Mumbai |
|  1| Asha|    Delhi |
|  2|  Bob|      null|
|  3| null|Bangalore |
+---+-----+----------+

All rows from both sides are included. id=2 and id=3 filled with NULLs where unmatched.


🧠 Summary of Join Syntax in PySpark

df_a.join(df_b, on="id", how="inner")        # Only matching
df_a.join(df_b, on="id", how="left")         # Keep all from left (df_a)
df_a.join(df_b, on="id", how="right")        # Keep all from right (df_b)
df_a.join(df_b, on="id", how="outer")        # Keep all from both

Pages: 1 2

Posted in

Leave a Reply

Your email address will not be published. Required fields are marked *