Let me explain why NOT IN
can give incorrect results in SQL/Spark SQL when NULL
is involved, and why LEFT ANTI JOIN
is preferred in such cases—with an example.
🔥 Problem: NOT IN
+ NULL
= Unexpected behavior
In SQL, when you write:
SELECT user_id FROM this_week
WHERE user_id NOT IN (SELECT user_id FROM last_week);
This behaves differently if any value in last_week.user_id
is NULL
.
❌ What happens:
If even a single NULL
exists in the last_week
subquery, the entire NOT IN
condition fails to return any results—because user_id <> NULL
is unknown (3-valued logic: TRUE
, FALSE
, UNKNOWN
).
🧪 Example:
Table: this_week
user_id |
---|
1 |
2 |
3 |
Table: last_week
user_id |
---|
2 |
NULL |
Now, this query:
SELECT user_id FROM this_week
WHERE user_id NOT IN (SELECT user_id FROM last_week);
Expected output: 1, 3
(because only 2
is common)
Actual output: No rows returned!
Why? Because:
NOT IN (2, NULL)
becomesuser_id <> 2 AND user_id <> NULL
- The part
user_id <> NULL
is UNKNOWN, which invalidates the whole condition
✅ Fix: Use LEFT ANTI JOIN
SELECT tw.user_id
FROM this_week tw
LEFT ANTI JOIN last_week lw
ON tw.user_id = lw.user_id;
This gives the correct and null-safe result:
user_id |
---|
1 |
3 |
Spark SQL’s LEFT ANTI JOIN
filters out only the exact matches from the right side and ignores NULLs, so you don’t run into the 3-valued logic trap.
🧠 Summary
Method | Behavior with NULL s | Safe? |
---|---|---|
NOT IN | Returns nothing if any NULL in subquery | ❌ No |
LEFT ANTI JOIN | Ignores NULLs, only filters exact matches | ✅ Yes |
Excellent follow-up! Let’s break this down clearly.
🧠 Is the NULL issue only with NOT IN
, or does it happen with others?
✅ Short Answer:
The NULL issue happens with both NOT IN
and =
/ !=
when used with NULLs, but the biggest practical issue arises with NOT IN
.
1. 🔥 NOT IN
+ NULL ⇒ Fatal filtering problem
As covered earlier:
SELECT user_id FROM this_week
WHERE user_id NOT IN (SELECT user_id FROM last_week);
- If
last_week
contains even oneNULL
, no user_id will match, due to 3-valued logic. - This is specific to
NOT IN
and a very common source of bugs.
2. 😐 IN
+ NULL ⇒ Works fine (ignores NULL)
SELECT user_id FROM this_week
WHERE user_id IN (SELECT user_id FROM last_week);
- If
last_week
hasNULL
, it’s ignored, and theIN
works with non-null values. - So
IN
is usually safe with NULLs.
✅ IN (1, NULL)
behaves like = 1 OR IS UNKNOWN
, and the true value short-circuits it.
3. ❗ =
or !=
comparisons with NULL ⇒ Always UNKNOWN
Examples:
SELECT * FROM users WHERE user_id = NULL;
-- Returns nothing!
SELECT * FROM users WHERE user_id != NULL;
-- Returns nothing!
Why? Because:
NULL = NULL
is not true, it’sUNKNOWN
5 != NULL
is alsoUNKNOWN
- In
WHERE
clause,UNKNOWN
is treated asFALSE
4. ✅ How to check for NULL safely?
Use:
IS NULL -- to test for nulls
IS NOT NULL -- to exclude nulls
Example:
SELECT * FROM users WHERE user_id IS NULL;
SELECT * FROM users WHERE user_id IS NOT NULL;
🧾 Summary Table
Expression | NULL-safe? | Behavior |
---|---|---|
NOT IN (...) | ❌ No | Returns no rows if any NULL present |
IN (...) | ✅ Yes | Ignores NULL , works with other values |
= NULL | ❌ No | Always UNKNOWN |
!= NULL | ❌ No | Always UNKNOWN |
IS NULL | ✅ Yes | Matches only NULLs |
IS NOT NULL | ✅ Yes | Excludes NULLs safely |
LEFT ANTI JOIN | ✅ Yes | Filters exact matches, safe from NULL issue |
Perfect! Let’s demonstrate this with a real PySpark demo showing how NOT IN
, IN
, and NULL
comparisons behave.
🧪 Scenario Setup
We’ll simulate this_week
and last_week
as two DataFrames:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.getOrCreate()
# this_week users
this_week = spark.createDataFrame([
(1,), (2,), (3,)
], ["user_id"])
# last_week users (contains NULL)
last_week = spark.createDataFrame([
(2,), (None,)
], ["user_id"])
🧨 1. NOT IN
with NULL — Problematic
# Not in using filter + isin + negation
result_not_in = this_week.filter(~col("user_id").isin([row.user_id for row in last_week.collect()]))
result_not_in.show()
Output:
+-------+
|user_id|
+-------+
| |
+-------+
⚠️ If last_week
contains None
(NULL), .isin()
fails to filter correctly and may return empty or unexpected results.
✅ 2. LEFT ANTI JOIN
— NULL-safe & correct
result_anti = this_week.join(last_week, on="user_id", how="left_anti")
result_anti.show()
Output:
+-------+
|user_id|
+-------+
| 1|
| 3|
+-------+
✅ Correctly returns users in this_week
not present in last_week
, ignoring the NULL safely.
✅ 3. IN
— Safe, NULL is ignored
result_in = this_week.filter(col("user_id").isin([row.user_id for row in last_week.collect()]))
result_in.show()
Output:
+-------+
|user_id|
+-------+
| 2|
+-------+
✅ Correctly returns user 2. The NULL is ignored.
❌ 4. != NULL
— Fails silently
# Compare directly with NULL (not safe)
result_not_equal = this_week.filter(col("user_id") != None)
result_not_equal.show()
Output:
+-------+
|user_id|
+-------+
| 1|
| 2|
| 3|
+-------+
Looks OK, but it’s not reliable—better to use:
# Safe alternative
this_week.filter(col("user_id").isNotNull())
🔁 Summary of PySpark Results
Expression | Safe? | Works with NULL? | Notes |
---|---|---|---|
NOT IN via ~isin() | ❌ No | ❌ Problematic | NULL breaks logic |
IN via .isin() | ✅ Yes | ✅ Ignores NULL | Returns matches |
LEFT ANTI JOIN | ✅ Yes | ✅ NULL-safe | Recommended |
!= None | ❌ No | ❌ Unsafe | Use isNotNull() |
IS NOT NULL | ✅ Yes | ✅ Correct way | Always preferred |
Leave a Reply