Let me explain why NOT IN can give incorrect results in SQL/Spark SQL when NULL is involved, and why LEFT ANTI JOIN is preferred in such casesβ€”with an example.


πŸ”₯ Problem: NOT IN + NULL = Unexpected behavior

In SQL, when you write:

SELECT user_id FROM this_week
WHERE user_id NOT IN (SELECT user_id FROM last_week);

This behaves differently if any value in last_week.user_id is NULL.

❌ What happens:

If even a single NULL exists in the last_week subquery, the entire NOT IN condition fails to return any resultsβ€”because user_id <> NULL is unknown (3-valued logic: TRUE, FALSE, UNKNOWN).


πŸ§ͺ Example:

Table: this_week

user_id
1
2
3

Table: last_week

user_id
2
NULL

Now, this query:

SELECT user_id FROM this_week
WHERE user_id NOT IN (SELECT user_id FROM last_week);

Expected output: 1, 3 (because only 2 is common)

Actual output: No rows returned!

Why? Because:

  • NOT IN (2, NULL) becomes user_id <> 2 AND user_id <> NULL
  • The part user_id <> NULL is UNKNOWN, which invalidates the whole condition

βœ… Fix: Use LEFT ANTI JOIN

SELECT tw.user_id
FROM this_week tw
LEFT ANTI JOIN last_week lw
ON tw.user_id = lw.user_id;

This gives the correct and null-safe result:

user_id
1
3

Spark SQL’s LEFT ANTI JOIN filters out only the exact matches from the right side and ignores NULLs, so you don’t run into the 3-valued logic trap.


🧠 Summary

MethodBehavior with NULLsSafe?
NOT INReturns nothing if any NULL in subquery❌ No
LEFT ANTI JOINIgnores NULLs, only filters exact matchesβœ… Yes

Excellent follow-up! Let’s break this down clearly.


🧠 Is the NULL issue only with NOT IN, or does it happen with others?

βœ… Short Answer:

The NULL issue happens with both NOT IN and = / != when used with NULLs, but the biggest practical issue arises with NOT IN.


1. πŸ”₯ NOT IN + NULL β‡’ Fatal filtering problem

As covered earlier:

SELECT user_id FROM this_week
WHERE user_id NOT IN (SELECT user_id FROM last_week);
  • If last_week contains even one NULL, no user_id will match, due to 3-valued logic.
  • This is specific to NOT IN and a very common source of bugs.

2. 😐 IN + NULL β‡’ Works fine (ignores NULL)

SELECT user_id FROM this_week
WHERE user_id IN (SELECT user_id FROM last_week);
  • If last_week has NULL, it’s ignored, and the IN works with non-null values.
  • So IN is usually safe with NULLs.

βœ… IN (1, NULL) behaves like = 1 OR IS UNKNOWN, and the true value short-circuits it.


3. ❗ = or != comparisons with NULL β‡’ Always UNKNOWN

Examples:

SELECT * FROM users WHERE user_id = NULL;
-- Returns nothing!

SELECT * FROM users WHERE user_id != NULL;
-- Returns nothing!

Why? Because:

  • NULL = NULL is not true, it’s UNKNOWN
  • 5 != NULL is also UNKNOWN
  • In WHERE clause, UNKNOWN is treated as FALSE

4. βœ… How to check for NULL safely?

Use:

IS NULL       -- to test for nulls
IS NOT NULL   -- to exclude nulls

Example:

SELECT * FROM users WHERE user_id IS NULL;
SELECT * FROM users WHERE user_id IS NOT NULL;

🧾 Summary Table

ExpressionNULL-safe?Behavior
NOT IN (...)❌ NoReturns no rows if any NULL present
IN (...)βœ… YesIgnores NULL, works with other values
= NULL❌ NoAlways UNKNOWN
!= NULL❌ NoAlways UNKNOWN
IS NULLβœ… YesMatches only NULLs
IS NOT NULLβœ… YesExcludes NULLs safely
LEFT ANTI JOINβœ… YesFilters exact matches, safe from NULL issue

Perfect! Let’s demonstrate this with a real PySpark demo showing how NOT IN, IN, and NULL comparisons behave.


πŸ§ͺ Scenario Setup

We’ll simulate this_week and last_week as two DataFrames:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.getOrCreate()

# this_week users
this_week = spark.createDataFrame([
    (1,), (2,), (3,)
], ["user_id"])

# last_week users (contains NULL)
last_week = spark.createDataFrame([
    (2,), (None,)
], ["user_id"])

🧨 1. NOT IN with NULL β€” Problematic

# Not in using filter + isin + negation
result_not_in = this_week.filter(~col("user_id").isin([row.user_id for row in last_week.collect()]))
result_not_in.show()

Output:

+-------+
|user_id|
+-------+
|       |
+-------+

⚠️ If last_week contains None (NULL), .isin() fails to filter correctly and may return empty or unexpected results.


βœ… 2. LEFT ANTI JOIN β€” NULL-safe & correct

result_anti = this_week.join(last_week, on="user_id", how="left_anti")
result_anti.show()

Output:

+-------+
|user_id|
+-------+
|      1|
|      3|
+-------+

βœ… Correctly returns users in this_week not present in last_week, ignoring the NULL safely.


βœ… 3. IN β€” Safe, NULL is ignored

result_in = this_week.filter(col("user_id").isin([row.user_id for row in last_week.collect()]))
result_in.show()

Output:

+-------+
|user_id|
+-------+
|      2|
+-------+

βœ… Correctly returns user 2. The NULL is ignored.


❌ 4. != NULL β€” Fails silently

# Compare directly with NULL (not safe)
result_not_equal = this_week.filter(col("user_id") != None)
result_not_equal.show()

Output:

+-------+
|user_id|
+-------+
|      1|
|      2|
|      3|
+-------+

Looks OK, but it’s not reliableβ€”better to use:

# Safe alternative
this_week.filter(col("user_id").isNotNull())

πŸ” Summary of PySpark Results

ExpressionSafe?Works with NULL?Notes
NOT IN via ~isin()❌ No❌ ProblematicNULL breaks logic
IN via .isin()βœ… Yesβœ… Ignores NULLReturns matches
LEFT ANTI JOINβœ… Yesβœ… NULL-safeRecommended
!= None❌ No❌ UnsafeUse isNotNull()
IS NOT NULLβœ… Yesβœ… Correct wayAlways preferred

Pages: 1 2 3


Discover more from HintsToday

Subscribe to get the latest posts sent to your email.

Posted in

Leave a Reply

Discover more from HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading

Subscribe