Spark SQL Join Types- Syntax examples, Comparision

Spark SQL supports several types of joins, each suited to different use cases. Below is a detailed explanation of each join type, including syntax examples and comparisons.

Types of Joins in Spark SQL

Inner Join
Left (Outer) Join
Right (Outer) Join
Full (Outer) Join
Left Semi Join
Left Anti Join
Cross Join

1. Inner Join

An inner join returns only the rows that have matching values in both tables.

Syntax:

SELECT a.*, b.*
FROM tableA a
INNER JOIN tableB b
ON a.id = b.id;

Example:

SELECT employees.emp_id, employees.emp_name, departments.dept_name
FROM employees
INNER JOIN departments
ON employees.dept_id = departments.dept_id;

2. Left (Outer) Join

A left join returns all rows from the left table and the matched rows from the right table. If no match is found, NULLs are returned for columns from the right table.

Syntax:

SELECT a.*, b.*
FROM tableA a
LEFT JOIN tableB b
ON a.id = b.id;

Example:

SELECT employees.emp_id, employees.emp_name, departments.dept_name
FROM employees
LEFT JOIN departments
ON employees.dept_id = departments.dept_id;

3. Right (Outer) Join

A right join returns all rows from the right table and the matched rows from the left table. If no match is found, NULLs are returned for columns from the left table.

Syntax:

SELECT a.*, b.*
FROM tableA a
RIGHT JOIN tableB b
ON a.id = b.id;

Example:

SELECT employees.emp_id, employees.emp_name, departments.dept_name
FROM employees
RIGHT JOIN departments
ON employees.dept_id = departments.dept_id;

4. Full (Outer) Join

A full outer join returns all rows when there is a match in either left or right table. Rows without a match in one of the tables will have NULLs in the columns of the non-matching table.

Syntax:

SELECT a.*, b.*
FROM tableA a
FULL OUTER JOIN tableB b
ON a.id = b.id;

Example:

SELECT employees.emp_id, employees.emp_name, departments.dept_name
FROM employees
FULL OUTER JOIN departments
ON employees.dept_id = departments.dept_id;

5. Left Semi Join

A left semi join returns only the rows from the left table for which there is a match in the right table. It is equivalent to using an IN clause.

Syntax:

SELECT a.*
FROM tableA a
LEFT SEMI JOIN tableB b
ON a.id = b.id;

Example:

SELECT employees.emp_id, employees.emp_name
FROM employees
LEFT SEMI JOIN departments
ON employees.dept_id = departments.dept_id;

6. Left Anti Join

A left anti join returns only the rows from the left table for which there is no match in the right table. It is equivalent to using a NOT IN clause.

Syntax:

SELECT a.*
FROM tableA a
LEFT ANTI JOIN tableB b
ON a.id = b.id;

Example:

SELECT employees.emp_id, employees.emp_name
FROM employees
LEFT ANTI JOIN departments
ON employees.dept_id = departments.dept_id;

7. Cross Join

A cross join returns the Cartesian product of the two tables, meaning every row from the left table is joined with every row from the right table.

Syntax:

SELECT a.*, b.*
FROM tableA a
CROSS JOIN tableB b;

Example:

SELECT employees.emp_id, employees.emp_name, departments.dept_name
FROM employees
CROSS JOIN departments;

Comparison

Join Type	Returns Rows from Left Table	Returns Rows from Right Table	Returns Matched Rows	Returns Unmatched Rows with NULLs
Inner Join	Yes	Yes	Yes	No
Left Join	Yes	No	Yes	Yes (for left table)
Right Join	No	Yes	Yes	Yes (for right table)
Full Outer Join	Yes	Yes	Yes	Yes (for both tables)
Left Semi Join	Yes	No	Yes	No
Left Anti Join	Yes	No	No	Yes (only for non-matching rows)
Cross Join	Yes	Yes	N/A	N/A

When you perform a join in Spark SQL (or PySpark DataFrame API), and the join keys have duplicate values in one or both DataFrames, Spark performs a cartesian multiplication of the matching rows — this is expected behavior and is standard SQL semantics.

🔍 Example: Join with Duplicates

Table A

id	name
1	Alice
1	Asha

Table B

id	city
1	Mumbai
1	Bangalore

SQL Join:

SELECT * FROM A JOIN B ON A.id = B.id

🧮 Result:

This will produce 4 rows due to the 2×2 Cartesian match:

A.id	name	B.id	city
1	Alice	1	Mumbai
1	Alice	1	Bangalore
1	Asha	1	Mumbai
1	Asha	1	Bangalore

⚠️ Why This Happens

This is not a bug — it’s how relational joins work:

If both tables have n and m rows with the same key, the join returns n × m rows for that key.
This applies to inner join, left join, right join, and full outer join (each with its own rules).

✅ How to Handle It

✅ If You Want Only One Match Per Key:

Use one of the following techniques:

1. Drop Duplicates Before Join

SELECT * 
FROM (SELECT DISTINCT id, name FROM A) a
JOIN (SELECT DISTINCT id, city FROM B) b
ON a.id = b.id

2. Use Row Numbering and Select First Match

WITH ranked_b AS (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY city) AS rn
  FROM B
)
SELECT a.*, b.city
FROM A a
JOIN ranked_b b
  ON a.id = b.id AND b.rn = 1

This gives one row from B per key, joined to all matching rows in A (or vice versa).

3. Aggregate before joining

If you’re just interested in a summary (e.g., count, max, first):

SELECT a.*, b.first_city
FROM A a
JOIN (
  SELECT id, FIRST(city, TRUE) AS first_city
  FROM B
  GROUP BY id
) b
ON a.id = b.id

✅ Summary

Scenario	Result
Duplicates in both tables	Cartesian match
Need 1-to-1 match	Deduplicate / rank
Need aggregated info	Use `GROUP BY`

Let’s break down how duplicates in join keys behave across different types of Spark SQL joins: inner, left, right, and full outer join — especially when duplicates are present.

🎯 Setup: Two Tables with Duplicate Join Keys

🔵 Table A (left table)

id	name
1	Alice
1	Asha
2	Bob

🟢 Table B (right table)

id	city
1	Mumbai
1	Delhi
3	Bangalore

1️⃣ INNER JOIN

SELECT * FROM A INNER JOIN B ON A.id = B.id

Only rows with matching id values in both tables.
Matching duplicates result in cartesian multiplication (cross join for each match).

✅ Output:

id	name	city
1	Alice	Mumbai
1	Alice	Delhi
1	Asha	Mumbai
1	Asha	Delhi

id=2 (Bob) and id=3 (Bangalore) are ignored — no match in both.

2️⃣ LEFT JOIN

SELECT * FROM A LEFT JOIN B ON A.id = B.id

All rows from A (left side) are kept.
Matches from B are added; if no match, B columns are NULL.
Duplicates from A are also multiplied if multiple B matches exist.

✅ Output:

id	name	city
1	Alice	Mumbai
1	Alice	Delhi
1	Asha	Mumbai
1	Asha	Delhi
2	Bob	NULL

Bob has no match → gets NULL.

3️⃣ RIGHT JOIN

SELECT * FROM A RIGHT JOIN B ON A.id = B.id

All rows from B (right side) are kept.
Matches from A are added; if no match, A columns are NULL.

✅ Output:

id	name	city
1	Alice	Mumbai
1	Alice	Delhi
1	Asha	Mumbai
1	Asha	Delhi
3	NULL	Bangalore

Bangalore has no match → name is NULL.

4️⃣ FULL OUTER JOIN

SELECT * FROM A FULL OUTER JOIN B ON A.id = B.id

Keeps all rows from both A and B.
Where no match is found, fills the opposite side with NULL.

✅ Output:

id	name	city
1	Alice	Mumbai
1	Alice	Delhi
1	Asha	Mumbai
1	Asha	Delhi
2	Bob	NULL
3	NULL	Bangalore

Includes all matching and non-matching rows from both tables.

🧠 Summary Table

Join Type	Keeps Unmatched Rows from	Duplicates Cause Multiplication?
Inner Join	None	✅ Yes
Left Join	Left Table (A)	✅ Yes
Right Join	Right Table (B)	✅ Yes
Full Outer Join	Both Tables	✅ Yes

🛠 Tips for Managing Duplicates

If duplicates are unwanted, deduplicate using DISTINCT or ROW_NUMBER() OVER (...).
To keep only one match, use aggregation or filtering on ROW_NUMBER().

HintsToday

recent posts

about