Spark SQL supports several types of joins, each suited to different use cases. Below is a detailed explanation of each join type, including syntax examples and comparisons.
Types of Joins in Spark SQL
- Inner Join
- Left (Outer) Join
- Right (Outer) Join
- Full (Outer) Join
- Left Semi Join
- Left Anti Join
- Cross Join
1. Inner Join
An inner join returns only the rows that have matching values in both tables.
Syntax:
SELECT a.*, b.*
FROM tableA a
INNER JOIN tableB b
ON a.id = b.id;
Example:
SELECT employees.emp_id, employees.emp_name, departments.dept_name
FROM employees
INNER JOIN departments
ON employees.dept_id = departments.dept_id;
2. Left (Outer) Join
A left join returns all rows from the left table and the matched rows from the right table. If no match is found, NULLs are returned for columns from the right table.
Syntax:
SELECT a.*, b.*
FROM tableA a
LEFT JOIN tableB b
ON a.id = b.id;
Example:
SELECT employees.emp_id, employees.emp_name, departments.dept_name
FROM employees
LEFT JOIN departments
ON employees.dept_id = departments.dept_id;
3. Right (Outer) Join
A right join returns all rows from the right table and the matched rows from the left table. If no match is found, NULLs are returned for columns from the left table.
Syntax:
SELECT a.*, b.*
FROM tableA a
RIGHT JOIN tableB b
ON a.id = b.id;
Example:
SELECT employees.emp_id, employees.emp_name, departments.dept_name
FROM employees
RIGHT JOIN departments
ON employees.dept_id = departments.dept_id;
4. Full (Outer) Join
A full outer join returns all rows when there is a match in either left or right table. Rows without a match in one of the tables will have NULLs in the columns of the non-matching table.
Syntax:
SELECT a.*, b.*
FROM tableA a
FULL OUTER JOIN tableB b
ON a.id = b.id;
Example:
SELECT employees.emp_id, employees.emp_name, departments.dept_name
FROM employees
FULL OUTER JOIN departments
ON employees.dept_id = departments.dept_id;
5. Left Semi Join
A left semi join returns only the rows from the left table for which there is a match in the right table. It is equivalent to using an IN
clause.
Syntax:
SELECT a.*
FROM tableA a
LEFT SEMI JOIN tableB b
ON a.id = b.id;
Example:
SELECT employees.emp_id, employees.emp_name
FROM employees
LEFT SEMI JOIN departments
ON employees.dept_id = departments.dept_id;
6. Left Anti Join
A left anti join returns only the rows from the left table for which there is no match in the right table. It is equivalent to using a NOT IN
clause.
Syntax:
SELECT a.*
FROM tableA a
LEFT ANTI JOIN tableB b
ON a.id = b.id;
Example:
SELECT employees.emp_id, employees.emp_name
FROM employees
LEFT ANTI JOIN departments
ON employees.dept_id = departments.dept_id;
7. Cross Join
A cross join returns the Cartesian product of the two tables, meaning every row from the left table is joined with every row from the right table.
Syntax:
SELECT a.*, b.*
FROM tableA a
CROSS JOIN tableB b;
Example:
SELECT employees.emp_id, employees.emp_name, departments.dept_name
FROM employees
CROSS JOIN departments;
Comparison
Join Type | Returns Rows from Left Table | Returns Rows from Right Table | Returns Matched Rows | Returns Unmatched Rows with NULLs |
---|---|---|---|---|
Inner Join | Yes | Yes | Yes | No |
Left Join | Yes | No | Yes | Yes (for left table) |
Right Join | No | Yes | Yes | Yes (for right table) |
Full Outer Join | Yes | Yes | Yes | Yes (for both tables) |
Left Semi Join | Yes | No | Yes | No |
Left Anti Join | Yes | No | No | Yes (only for non-matching rows) |
Cross Join | Yes | Yes | N/A | N/A |
When you perform a join in Spark SQL (or PySpark DataFrame API), and the join keys have duplicate values in one or both DataFrames, Spark performs a cartesian multiplication of the matching rows — this is expected behavior and is standard SQL semantics.
🔍 Example: Join with Duplicates
Table A
id | name |
---|---|
1 | Alice |
1 | Asha |
Table B
id | city |
---|---|
1 | Mumbai |
1 | Bangalore |
SQL Join:
SELECT * FROM A JOIN B ON A.id = B.id
🧮 Result:
This will produce 4 rows due to the 2×2 Cartesian match:
A.id | name | B.id | city |
---|---|---|---|
1 | Alice | 1 | Mumbai |
1 | Alice | 1 | Bangalore |
1 | Asha | 1 | Mumbai |
1 | Asha | 1 | Bangalore |
⚠️ Why This Happens
This is not a bug — it’s how relational joins work:
- If both tables have
n
andm
rows with the same key, the join returnsn × m
rows for that key. - This applies to inner join, left join, right join, and full outer join (each with its own rules).
✅ How to Handle It
✅ If You Want Only One Match Per Key:
Use one of the following techniques:
1. Drop Duplicates Before Join
SELECT *
FROM (SELECT DISTINCT id, name FROM A) a
JOIN (SELECT DISTINCT id, city FROM B) b
ON a.id = b.id
2. Use Row Numbering and Select First Match
WITH ranked_b AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY city) AS rn
FROM B
)
SELECT a.*, b.city
FROM A a
JOIN ranked_b b
ON a.id = b.id AND b.rn = 1
This gives one row from B per key, joined to all matching rows in A (or vice versa).
3. Aggregate before joining
If you’re just interested in a summary (e.g., count, max, first):
SELECT a.*, b.first_city
FROM A a
JOIN (
SELECT id, FIRST(city, TRUE) AS first_city
FROM B
GROUP BY id
) b
ON a.id = b.id
✅ Summary
Scenario | Result |
---|---|
Duplicates in both tables | Cartesian match |
Need 1-to-1 match | Deduplicate / rank |
Need aggregated info | Use GROUP BY |
Let’s break down how duplicates in join keys behave across different types of Spark SQL joins: inner, left, right, and full outer join — especially when duplicates are present.
🎯 Setup: Two Tables with Duplicate Join Keys
🔵 Table A (left table)
id | name |
---|---|
1 | Alice |
1 | Asha |
2 | Bob |
🟢 Table B (right table)
id | city |
---|---|
1 | Mumbai |
1 | Delhi |
3 | Bangalore |
1️⃣ INNER JOIN
SELECT * FROM A INNER JOIN B ON A.id = B.id
- Only rows with matching
id
values in both tables. - Matching duplicates result in cartesian multiplication (cross join for each match).
✅ Output:
id | name | city |
---|---|---|
1 | Alice | Mumbai |
1 | Alice | Delhi |
1 | Asha | Mumbai |
1 | Asha | Delhi |
id=2
(Bob) andid=3
(Bangalore) are ignored — no match in both.
2️⃣ LEFT JOIN
SELECT * FROM A LEFT JOIN B ON A.id = B.id
- All rows from A (left side) are kept.
- Matches from B are added; if no match, B columns are NULL.
- Duplicates from A are also multiplied if multiple B matches exist.
✅ Output:
id | name | city |
---|---|---|
1 | Alice | Mumbai |
1 | Alice | Delhi |
1 | Asha | Mumbai |
1 | Asha | Delhi |
2 | Bob | NULL |
Bob
has no match → gets NULL.
3️⃣ RIGHT JOIN
SELECT * FROM A RIGHT JOIN B ON A.id = B.id
- All rows from B (right side) are kept.
- Matches from A are added; if no match, A columns are NULL.
✅ Output:
id | name | city |
---|---|---|
1 | Alice | Mumbai |
1 | Alice | Delhi |
1 | Asha | Mumbai |
1 | Asha | Delhi |
3 | NULL | Bangalore |
Bangalore
has no match →name
is NULL.
4️⃣ FULL OUTER JOIN
SELECT * FROM A FULL OUTER JOIN B ON A.id = B.id
- Keeps all rows from both A and B.
- Where no match is found, fills the opposite side with NULL.
✅ Output:
id | name | city |
---|---|---|
1 | Alice | Mumbai |
1 | Alice | Delhi |
1 | Asha | Mumbai |
1 | Asha | Delhi |
2 | Bob | NULL |
3 | NULL | Bangalore |
Includes all matching and non-matching rows from both tables.
🧠 Summary Table
Join Type | Keeps Unmatched Rows from | Duplicates Cause Multiplication? |
---|---|---|
Inner Join | None | ✅ Yes |
Left Join | Left Table (A) | ✅ Yes |
Right Join | Right Table (B) | ✅ Yes |
Full Outer Join | Both Tables | ✅ Yes |
🛠 Tips for Managing Duplicates
- If duplicates are unwanted, deduplicate using
DISTINCT
orROW_NUMBER() OVER (...)
. - To keep only one match, use aggregation or filtering on
ROW_NUMBER()
.
Leave a Reply