Let me explain why NOT IN
can give incorrect results in SQL/Spark SQL when NULL
is involved, and why LEFT ANTI JOIN
is preferred in such casesβwith an example.
π₯ Problem: NOT IN
+ NULL
= Unexpected behavior
In SQL, when you write:
SELECT user_id FROM this_week
WHERE user_id NOT IN (SELECT user_id FROM last_week);
This behaves differently if any value in last_week.user_id
is NULL
.
β What happens:
If even a single NULL
exists in the last_week
subquery, the entire NOT IN
condition fails to return any resultsβbecause user_id <> NULL
is unknown (3-valued logic: TRUE
, FALSE
, UNKNOWN
).
π§ͺ Example:
Table: this_week
user_id |
---|
1 |
2 |
3 |
Table: last_week
user_id |
---|
2 |
NULL |
Now, this query:
SELECT user_id FROM this_week
WHERE user_id NOT IN (SELECT user_id FROM last_week);
Expected output: 1, 3
(because only 2
is common)
Actual output: No rows returned!
Why? Because:
NOT IN (2, NULL)
becomesuser_id <> 2 AND user_id <> NULL
- The part
user_id <> NULL
is UNKNOWN, which invalidates the whole condition
β
Fix: Use LEFT ANTI JOIN
SELECT tw.user_id
FROM this_week tw
LEFT ANTI JOIN last_week lw
ON tw.user_id = lw.user_id;
This gives the correct and null-safe result:
user_id |
---|
1 |
3 |
Spark SQL’s LEFT ANTI JOIN
filters out only the exact matches from the right side and ignores NULLs, so you donβt run into the 3-valued logic trap.
π§ Summary
Method | Behavior with NULL s | Safe? |
---|---|---|
NOT IN | Returns nothing if any NULL in subquery | β No |
LEFT ANTI JOIN | Ignores NULLs, only filters exact matches | β Yes |
Excellent follow-up! Let’s break this down clearly.
π§ Is the NULL issue only with NOT IN
, or does it happen with others?
β Short Answer:
The NULL issue happens with both NOT IN
and =
/ !=
when used with NULLs, but the biggest practical issue arises with NOT IN
.
1. π₯ NOT IN
+ NULL β Fatal filtering problem
As covered earlier:
SELECT user_id FROM this_week
WHERE user_id NOT IN (SELECT user_id FROM last_week);
- If
last_week
contains even oneNULL
, no user_id will match, due to 3-valued logic. - This is specific to
NOT IN
and a very common source of bugs.
2. π IN
+ NULL β Works fine (ignores NULL)
SELECT user_id FROM this_week
WHERE user_id IN (SELECT user_id FROM last_week);
- If
last_week
hasNULL
, it’s ignored, and theIN
works with non-null values. - So
IN
is usually safe with NULLs.
β
IN (1, NULL)
behaves like = 1 OR IS UNKNOWN
, and the true value short-circuits it.
3. β =
or !=
comparisons with NULL β Always UNKNOWN
Examples:
SELECT * FROM users WHERE user_id = NULL;
-- Returns nothing!
SELECT * FROM users WHERE user_id != NULL;
-- Returns nothing!
Why? Because:
NULL = NULL
is not true, it’sUNKNOWN
5 != NULL
is alsoUNKNOWN
- In
WHERE
clause,UNKNOWN
is treated asFALSE
4. β How to check for NULL safely?
Use:
IS NULL -- to test for nulls
IS NOT NULL -- to exclude nulls
Example:
SELECT * FROM users WHERE user_id IS NULL;
SELECT * FROM users WHERE user_id IS NOT NULL;
π§Ύ Summary Table
Expression | NULL-safe? | Behavior |
---|---|---|
NOT IN (...) | β No | Returns no rows if any NULL present |
IN (...) | β Yes | Ignores NULL , works with other values |
= NULL | β No | Always UNKNOWN |
!= NULL | β No | Always UNKNOWN |
IS NULL | β Yes | Matches only NULLs |
IS NOT NULL | β Yes | Excludes NULLs safely |
LEFT ANTI JOIN | β Yes | Filters exact matches, safe from NULL issue |
Perfect! Let’s demonstrate this with a real PySpark demo showing how NOT IN
, IN
, and NULL
comparisons behave.
π§ͺ Scenario Setup
Weβll simulate this_week
and last_week
as two DataFrames:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.getOrCreate()
# this_week users
this_week = spark.createDataFrame([
(1,), (2,), (3,)
], ["user_id"])
# last_week users (contains NULL)
last_week = spark.createDataFrame([
(2,), (None,)
], ["user_id"])
𧨠1. NOT IN
with NULL β Problematic
# Not in using filter + isin + negation
result_not_in = this_week.filter(~col("user_id").isin([row.user_id for row in last_week.collect()]))
result_not_in.show()
Output:
+-------+
|user_id|
+-------+
| |
+-------+
β οΈ If last_week
contains None
(NULL), .isin()
fails to filter correctly and may return empty or unexpected results.
β
2. LEFT ANTI JOIN
β NULL-safe & correct
result_anti = this_week.join(last_week, on="user_id", how="left_anti")
result_anti.show()
Output:
+-------+
|user_id|
+-------+
| 1|
| 3|
+-------+
β
Correctly returns users in this_week
not present in last_week
, ignoring the NULL safely.
β
3. IN
β Safe, NULL is ignored
result_in = this_week.filter(col("user_id").isin([row.user_id for row in last_week.collect()]))
result_in.show()
Output:
+-------+
|user_id|
+-------+
| 2|
+-------+
β Correctly returns user 2. The NULL is ignored.
β 4. != NULL
β Fails silently
# Compare directly with NULL (not safe)
result_not_equal = this_week.filter(col("user_id") != None)
result_not_equal.show()
Output:
+-------+
|user_id|
+-------+
| 1|
| 2|
| 3|
+-------+
Looks OK, but it’s not reliableβbetter to use:
# Safe alternative
this_week.filter(col("user_id").isNotNull())
π Summary of PySpark Results
Expression | Safe? | Works with NULL? | Notes |
---|---|---|---|
NOT IN via ~isin() | β No | β Problematic | NULL breaks logic |
IN via .isin() | β Yes | β Ignores NULL | Returns matches |
LEFT ANTI JOIN | β Yes | β NULL-safe | Recommended |
!= None | β No | β Unsafe | Use isNotNull() |
IS NOT NULL | β Yes | β Correct way | Always preferred |
Leave a Reply