Spark SQL supports several types of joins, each suited to different use cases. Below is a detailed explanation of each join type, including syntax examples and comparisons.

Types of Joins in Spark SQL

  1. Inner Join
  2. Left (Outer) Join
  3. Right (Outer) Join
  4. Full (Outer) Join
  5. Left Semi Join
  6. Left Anti Join
  7. Cross Join

1. Inner Join

An inner join returns only the rows that have matching values in both tables.

Syntax:

SELECT a.*, b.*
FROM tableA a
INNER JOIN tableB b
ON a.id = b.id;

Example:

SELECT employees.emp_id, employees.emp_name, departments.dept_name
FROM employees
INNER JOIN departments
ON employees.dept_id = departments.dept_id;

2. Left (Outer) Join

A left join returns all rows from the left table and the matched rows from the right table. If no match is found, NULLs are returned for columns from the right table.

Syntax:

SELECT a.*, b.*
FROM tableA a
LEFT JOIN tableB b
ON a.id = b.id;

Example:

SELECT employees.emp_id, employees.emp_name, departments.dept_name
FROM employees
LEFT JOIN departments
ON employees.dept_id = departments.dept_id;

3. Right (Outer) Join

A right join returns all rows from the right table and the matched rows from the left table. If no match is found, NULLs are returned for columns from the left table.

Syntax:

SELECT a.*, b.*
FROM tableA a
RIGHT JOIN tableB b
ON a.id = b.id;

Example:

SELECT employees.emp_id, employees.emp_name, departments.dept_name
FROM employees
RIGHT JOIN departments
ON employees.dept_id = departments.dept_id;

4. Full (Outer) Join

A full outer join returns all rows when there is a match in either left or right table. Rows without a match in one of the tables will have NULLs in the columns of the non-matching table.

Syntax:

SELECT a.*, b.*
FROM tableA a
FULL OUTER JOIN tableB b
ON a.id = b.id;

Example:

SELECT employees.emp_id, employees.emp_name, departments.dept_name
FROM employees
FULL OUTER JOIN departments
ON employees.dept_id = departments.dept_id;

5. Left Semi Join

A left semi join returns only the rows from the left table for which there is a match in the right table. It is equivalent to using an IN clause.

Syntax:

SELECT a.*
FROM tableA a
LEFT SEMI JOIN tableB b
ON a.id = b.id;

Example:

SELECT employees.emp_id, employees.emp_name
FROM employees
LEFT SEMI JOIN departments
ON employees.dept_id = departments.dept_id;

6. Left Anti Join

A left anti join returns only the rows from the left table for which there is no match in the right table. It is equivalent to using a NOT IN clause.

Syntax:

SELECT a.*
FROM tableA a
LEFT ANTI JOIN tableB b
ON a.id = b.id;

Example:

SELECT employees.emp_id, employees.emp_name
FROM employees
LEFT ANTI JOIN departments
ON employees.dept_id = departments.dept_id;

7. Cross Join

A cross join returns the Cartesian product of the two tables, meaning every row from the left table is joined with every row from the right table.

Syntax:

SELECT a.*, b.*
FROM tableA a
CROSS JOIN tableB b;

Example:

SELECT employees.emp_id, employees.emp_name, departments.dept_name
FROM employees
CROSS JOIN departments;

Comparison

Join TypeReturns Rows from Left TableReturns Rows from Right TableReturns Matched RowsReturns Unmatched Rows with NULLs
Inner JoinYesYesYesNo
Left JoinYesNoYesYes (for left table)
Right JoinNoYesYesYes (for right table)
Full Outer JoinYesYesYesYes (for both tables)
Left Semi JoinYesNoYesNo
Left Anti JoinYesNoNoYes (only for non-matching rows)
Cross JoinYesYesN/AN/A

When you perform a join in Spark SQL (or PySpark DataFrame API), and the join keys have duplicate values in one or both DataFrames, Spark performs a cartesian multiplication of the matching rows — this is expected behavior and is standard SQL semantics.


🔍 Example: Join with Duplicates

Table A

idname
1Alice
1Asha

Table B

idcity
1Mumbai
1Bangalore

SQL Join:

SELECT * FROM A JOIN B ON A.id = B.id

🧮 Result:

This will produce 4 rows due to the 2×2 Cartesian match:

A.idnameB.idcity
1Alice1Mumbai
1Alice1Bangalore
1Asha1Mumbai
1Asha1Bangalore

⚠️ Why This Happens

This is not a bug — it’s how relational joins work:

  • If both tables have n and m rows with the same key, the join returns n × m rows for that key.
  • This applies to inner join, left join, right join, and full outer join (each with its own rules).

✅ How to Handle It

✅ If You Want Only One Match Per Key:

Use one of the following techniques:

1. Drop Duplicates Before Join

SELECT * 
FROM (SELECT DISTINCT id, name FROM A) a
JOIN (SELECT DISTINCT id, city FROM B) b
ON a.id = b.id

2. Use Row Numbering and Select First Match

WITH ranked_b AS (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY city) AS rn
  FROM B
)
SELECT a.*, b.city
FROM A a
JOIN ranked_b b
  ON a.id = b.id AND b.rn = 1

This gives one row from B per key, joined to all matching rows in A (or vice versa).

3. Aggregate before joining

If you’re just interested in a summary (e.g., count, max, first):

SELECT a.*, b.first_city
FROM A a
JOIN (
  SELECT id, FIRST(city, TRUE) AS first_city
  FROM B
  GROUP BY id
) b
ON a.id = b.id

✅ Summary

ScenarioResult
Duplicates in both tablesCartesian match
Need 1-to-1 matchDeduplicate / rank
Need aggregated infoUse GROUP BY

Let’s break down how duplicates in join keys behave across different types of Spark SQL joins: inner, left, right, and full outer join — especially when duplicates are present.


🎯 Setup: Two Tables with Duplicate Join Keys

🔵 Table A (left table)

idname
1Alice
1Asha
2Bob

🟢 Table B (right table)

idcity
1Mumbai
1Delhi
3Bangalore

1️⃣ INNER JOIN

SELECT * FROM A INNER JOIN B ON A.id = B.id
  • Only rows with matching id values in both tables.
  • Matching duplicates result in cartesian multiplication (cross join for each match).

✅ Output:

idnamecity
1AliceMumbai
1AliceDelhi
1AshaMumbai
1AshaDelhi

id=2 (Bob) and id=3 (Bangalore) are ignored — no match in both.


2️⃣ LEFT JOIN

SELECT * FROM A LEFT JOIN B ON A.id = B.id
  • All rows from A (left side) are kept.
  • Matches from B are added; if no match, B columns are NULL.
  • Duplicates from A are also multiplied if multiple B matches exist.

✅ Output:

idnamecity
1AliceMumbai
1AliceDelhi
1AshaMumbai
1AshaDelhi
2BobNULL

Bob has no match → gets NULL.


3️⃣ RIGHT JOIN

SELECT * FROM A RIGHT JOIN B ON A.id = B.id
  • All rows from B (right side) are kept.
  • Matches from A are added; if no match, A columns are NULL.

✅ Output:

idnamecity
1AliceMumbai
1AliceDelhi
1AshaMumbai
1AshaDelhi
3NULLBangalore

Bangalore has no match → name is NULL.


4️⃣ FULL OUTER JOIN

SELECT * FROM A FULL OUTER JOIN B ON A.id = B.id
  • Keeps all rows from both A and B.
  • Where no match is found, fills the opposite side with NULL.

✅ Output:

idnamecity
1AliceMumbai
1AliceDelhi
1AshaMumbai
1AshaDelhi
2BobNULL
3NULLBangalore

Includes all matching and non-matching rows from both tables.


🧠 Summary Table

Join TypeKeeps Unmatched Rows fromDuplicates Cause Multiplication?
Inner JoinNone✅ Yes
Left JoinLeft Table (A)✅ Yes
Right JoinRight Table (B)✅ Yes
Full Outer JoinBoth Tables✅ Yes

🛠 Tips for Managing Duplicates

  • If duplicates are unwanted, deduplicate using DISTINCT or ROW_NUMBER() OVER (...).
  • To keep only one match, use aggregation or filtering on ROW_NUMBER().


Pages: 1 2

Posted in

Leave a Reply

Your email address will not be published. Required fields are marked *