Here’s the PySpark equivalent of all 4 types of joins — inner, left, right, and full outer — with duplicate key behavior clearly illustrated.
✅ Step 1: Setup Sample DataFrames
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("JoinExamples").getOrCreate()
data_a = [(1, "Alice"), (1, "Asha"), (2, "Bob")]
data_b = [(1, "Mumbai"), (1, "Delhi"), (3, "Bangalore")]
df_a = spark.createDataFrame(data_a, ["id", "name"])
df_b = spark.createDataFrame(data_b, ["id", "city"])
1️⃣ Inner Join (default)
df_inner = df_a.join(df_b, on="id", how="inner")
df_inner.show()
✅ Output:
+---+-----+--------+
| id| name| city |
+---+-----+--------+
| 1|Alice| Mumbai |
| 1|Alice| Delhi |
| 1| Asha| Mumbai |
| 1| Asha| Delhi |
+---+-----+--------+
All
id=1
rows from both sides are matched → 4 rows total.
2️⃣ Left Join
df_left = df_a.join(df_b, on="id", how="left")
df_left.show()
✅ Output:
+---+-----+--------+
| id| name| city |
+---+-----+--------+
| 1|Alice| Mumbai |
| 1|Alice| Delhi |
| 1| Asha| Mumbai |
| 1| Asha| Delhi |
| 2| Bob| null|
+---+-----+--------+
All rows from
df_a
preserved.id=2
gets NULL incity
.
3️⃣ Right Join
df_right = df_a.join(df_b, on="id", how="right")
df_right.show()
✅ Output:
+---+-----+----------+
| id| name| city |
+---+-----+----------+
| 1|Alice| Mumbai |
| 1|Alice| Delhi |
| 1| Asha| Mumbai |
| 1| Asha| Delhi |
| 3| null|Bangalore |
+---+-----+----------+
All rows from
df_b
preserved.id=3
gets NULL inname
.
4️⃣ Full Outer Join
df_outer = df_a.join(df_b, on="id", how="outer")
df_outer.show()
✅ Output:
+---+-----+----------+
| id| name| city |
+---+-----+----------+
| 1|Alice| Mumbai |
| 1|Alice| Delhi |
| 1| Asha| Mumbai |
| 1| Asha| Delhi |
| 2| Bob| null|
| 3| null|Bangalore |
+---+-----+----------+
All rows from both sides are included.
id=2
andid=3
filled with NULLs where unmatched.
🧠 Summary of Join Syntax in PySpark
df_a.join(df_b, on="id", how="inner") # Only matching
df_a.join(df_b, on="id", how="left") # Keep all from left (df_a)
df_a.join(df_b, on="id", how="right") # Keep all from right (df_b)
df_a.join(df_b, on="id", how="outer") # Keep all from both
df1.join(
df2,
when(df1.id == df2.id, df1.id).otherwise(df1.id_null).equalTo(df2.id)
)Can the same complex join condition be written in Spark SQL?
Yes, the same complex join condition can absolutely be written in Spark SQL, but with a slightly different syntax than the PySpark DataFrame API.
🧾 What You’re Doing in PySpark
Your PySpark code:
df1.join(
df2,
when(df1.id == df2.id, df1.id).otherwise(df1.id_null).equalTo(df2.id)
)
This is equivalent to a conditional join, where:
- You compare
df1.id
todf2.id
, but if that fails, fall back todf1.id_null == df2.id
.
✅ Equivalent Spark SQL Join
To do this in Spark SQL, you need to register your DataFrames as temporary views and then write a SQL query like:
df1.createOrReplaceTempView("df1")
df2.createOrReplaceTempView("df2")
spark.sql("""
SELECT *
FROM df1
JOIN df2
ON CASE
WHEN df1.id = df2.id THEN df1.id
ELSE df1.id_null
END = df2.id
""")
🔍 Explanation:
- The
CASE WHEN
inside theON
clause replicates thewhen(...).otherwise(...)
logic. - It evaluates the condition and applies the right value for the join.
⚠️ Important Notes
- Spark SQL doesn’t allow full
when().otherwise().equalTo()
syntax, but theCASE WHEN ... END
is the correct equivalent. - If you’re doing left joins, you can still use the same logic and add
LEFT JOIN
instead ofJOIN
.
🚀 Bonus — With LEFT JOIN
SELECT *
FROM df1
LEFT JOIN df2
ON CASE
WHEN df1.id = df2.id THEN df1.id
ELSE df1.id_null
END = df2.id
Spark Sql joins when Data has duplicate values on columns used for joining
When you perform a join in Spark SQL (or PySpark DataFrame API), and the join keys have duplicate values in one or both DataFrames, Spark performs a cartesian multiplication of the matching rows — this is expected behavior and is standard SQL semantics.
🔍 Example: Join with Duplicates
Table A
id | name |
---|---|
1 | Alice |
1 | Asha |
Table B
id | city |
---|---|
1 | Mumbai |
1 | Bangalore |
SQL Join:
SELECT * FROM A JOIN B ON A.id = B.id
🧮 Result:
This will produce 4 rows due to the 2×2 Cartesian match:
A.id | name | B.id | city |
---|---|---|---|
1 | Alice | 1 | Mumbai |
1 | Alice | 1 | Bangalore |
1 | Asha | 1 | Mumbai |
1 | Asha | 1 | Bangalore |
⚠️ Why This Happens
This is not a bug — it’s how relational joins work:
- If both tables have
n
andm
rows with the same key, the join returnsn × m
rows for that key. - This applies to inner join, left join, right join, and full outer join (each with its own rules).
✅ How to Handle It
✅ If You Want Only One Match Per Key:
Use one of the following techniques:
1. Drop Duplicates Before Join
SELECT *
FROM (SELECT DISTINCT id, name FROM A) a
JOIN (SELECT DISTINCT id, city FROM B) b
ON a.id = b.id
2. Use Row Numbering and Select First Match
WITH ranked_b AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY city) AS rn
FROM B
)
SELECT a.*, b.city
FROM A a
JOIN ranked_b b
ON a.id = b.id AND b.rn = 1
This gives one row from B per key, joined to all matching rows in A (or vice versa).
3. Aggregate before joining
If you’re just interested in a summary (e.g., count, max, first):
SELECT a.*, b.first_city
FROM A a
JOIN (
SELECT id, FIRST(city, TRUE) AS first_city
FROM B
GROUP BY id
) b
ON a.id = b.id
✅ Summary
Scenario | Result |
---|---|
Duplicates in both tables | Cartesian match |
Need 1-to-1 match | Deduplicate / rank |
Need aggregated info | Use GROUP BY |
Absolutely. Let’s break down how duplicates in join keys behave across different types of Spark SQL joins: inner, left, right, and full outer join — especially when duplicates are present.
🎯 Setup: Two Tables with Duplicate Join Keys
🔵 Table A (left table)
id | name |
---|---|
1 | Alice |
1 | Asha |
2 | Bob |
🟢 Table B (right table)
id | city |
---|---|
1 | Mumbai |
1 | Delhi |
3 | Bangalore |
1️⃣ INNER JOIN
SELECT * FROM A INNER JOIN B ON A.id = B.id
- Only rows with matching
id
values in both tables. - Matching duplicates result in cartesian multiplication (cross join for each match).
✅ Output:
id | name | city |
---|---|---|
1 | Alice | Mumbai |
1 | Alice | Delhi |
1 | Asha | Mumbai |
1 | Asha | Delhi |
id=2
(Bob) andid=3
(Bangalore) are ignored — no match in both.
2️⃣ LEFT JOIN
SELECT * FROM A LEFT JOIN B ON A.id = B.id
- All rows from A (left side) are kept.
- Matches from B are added; if no match, B columns are NULL.
- Duplicates from A are also multiplied if multiple B matches exist.
✅ Output:
id | name | city |
---|---|---|
1 | Alice | Mumbai |
1 | Alice | Delhi |
1 | Asha | Mumbai |
1 | Asha | Delhi |
2 | Bob | NULL |
Bob
has no match → gets NULL.
3️⃣ RIGHT JOIN
SELECT * FROM A RIGHT JOIN B ON A.id = B.id
- All rows from B (right side) are kept.
- Matches from A are added; if no match, A columns are NULL.
✅ Output:
id | name | city |
---|---|---|
1 | Alice | Mumbai |
1 | Alice | Delhi |
1 | Asha | Mumbai |
1 | Asha | Delhi |
3 | NULL | Bangalore |
Bangalore
has no match →name
is NULL.
4️⃣ FULL OUTER JOIN
SELECT * FROM A FULL OUTER JOIN B ON A.id = B.id
- Keeps all rows from both A and B.
- Where no match is found, fills the opposite side with NULL.
✅ Output:
id | name | city |
---|---|---|
1 | Alice | Mumbai |
1 | Alice | Delhi |
1 | Asha | Mumbai |
1 | Asha | Delhi |
2 | Bob | NULL |
3 | NULL | Bangalore |
Includes all matching and non-matching rows from both tables.
🧠 Summary Table
Join Type | Keeps Unmatched Rows from | Duplicates Cause Multiplication? |
---|---|---|
Inner Join | None | ✅ Yes |
Left Join | Left Table (A) | ✅ Yes |
Right Join | Right Table (B) | ✅ Yes |
Full Outer Join | Both Tables | ✅ Yes |
🛠 Tips for Managing Duplicates
- If duplicates are unwanted, deduplicate using
DISTINCT
orROW_NUMBER() OVER (...)
. - To keep only one match, use aggregation or filtering on
ROW_NUMBER()
.
Here’s the PySpark equivalent of all 4 types of joins — inner, left, right, and full outer — with duplicate key behavior clearly illustrated.
✅ Step 1: Setup Sample DataFrames
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("JoinExamples").getOrCreate()
data_a = [(1, "Alice"), (1, "Asha"), (2, "Bob")]
data_b = [(1, "Mumbai"), (1, "Delhi"), (3, "Bangalore")]
df_a = spark.createDataFrame(data_a, ["id", "name"])
df_b = spark.createDataFrame(data_b, ["id", "city"])
1️⃣ Inner Join (default)
df_inner = df_a.join(df_b, on="id", how="inner")
df_inner.show()
✅ Output:
+---+-----+--------+
| id| name| city |
+---+-----+--------+
| 1|Alice| Mumbai |
| 1|Alice| Delhi |
| 1| Asha| Mumbai |
| 1| Asha| Delhi |
+---+-----+--------+
All
id=1
rows from both sides are matched → 4 rows total.
2️⃣ Left Join
df_left = df_a.join(df_b, on="id", how="left")
df_left.show()
✅ Output:
+---+-----+--------+
| id| name| city |
+---+-----+--------+
| 1|Alice| Mumbai |
| 1|Alice| Delhi |
| 1| Asha| Mumbai |
| 1| Asha| Delhi |
| 2| Bob| null|
+---+-----+--------+
All rows from
df_a
preserved.id=2
gets NULL incity
.
3️⃣ Right Join
df_right = df_a.join(df_b, on="id", how="right")
df_right.show()
✅ Output:
+---+-----+----------+
| id| name| city |
+---+-----+----------+
| 1|Alice| Mumbai |
| 1|Alice| Delhi |
| 1| Asha| Mumbai |
| 1| Asha| Delhi |
| 3| null|Bangalore |
+---+-----+----------+
All rows from
df_b
preserved.id=3
gets NULL inname
.
4️⃣ Full Outer Join
df_outer = df_a.join(df_b, on="id", how="outer")
df_outer.show()
✅ Output:
+---+-----+----------+
| id| name| city |
+---+-----+----------+
| 1|Alice| Mumbai |
| 1|Alice| Delhi |
| 1| Asha| Mumbai |
| 1| Asha| Delhi |
| 2| Bob| null|
| 3| null|Bangalore |
+---+-----+----------+
All rows from both sides are included.
id=2
andid=3
filled with NULLs where unmatched.
🧠 Summary of Join Syntax in PySpark
df_a.join(df_b, on="id", how="inner") # Only matching
df_a.join(df_b, on="id", how="left") # Keep all from left (df_a)
df_a.join(df_b, on="id", how="right") # Keep all from right (df_b)
df_a.join(df_b, on="id", how="outer") # Keep all from both
Leave a Reply