Comprehensive guide to important Points and tricky conceptual issues in SQL

Let me explain why NOT IN can give incorrect results in SQL/Spark SQL when NULL is involved, and why LEFT ANTI JOIN is preferred in such cases—with an example.

🔥 Problem: `NOT IN` + `NULL` = Unexpected behavior

In SQL, when you write:

SELECT user_id FROM this_week
WHERE user_id NOT IN (SELECT user_id FROM last_week);

This behaves differently if any value in last_week.user_id is NULL.

❌ What happens:

If even a single NULL exists in the last_week subquery, the entire NOT IN condition fails to return any results—because user_id <> NULL is unknown (3-valued logic: TRUE, FALSE, UNKNOWN).

🧪 Example:

Table: `this_week`

user_id
1
2
3

Table: `last_week`

user_id
2
NULL

Now, this query:

SELECT user_id FROM this_week
WHERE user_id NOT IN (SELECT user_id FROM last_week);

Expected output: 1, 3 (because only 2 is common)

Actual output: No rows returned!

Why? Because:

NOT IN (2, NULL) becomes user_id <> 2 AND user_id <> NULL
The part user_id <> NULL is UNKNOWN, which invalidates the whole condition

✅ Fix: Use `LEFT ANTI JOIN`

SELECT tw.user_id
FROM this_week tw
LEFT ANTI JOIN last_week lw
ON tw.user_id = lw.user_id;

This gives the correct and null-safe result:

user_id
1
3

Spark SQL’s LEFT ANTI JOIN filters out only the exact matches from the right side and ignores NULLs, so you don’t run into the 3-valued logic trap.

🧠 Summary

Method	Behavior with `NULL`s	Safe?
`NOT IN`	Returns nothing if any `NULL` in subquery	❌ No
`LEFT ANTI JOIN`	Ignores NULLs, only filters exact matches	✅ Yes

Excellent follow-up! Let’s break this down clearly.

🧠 Is the NULL issue only with `NOT IN`, or does it happen with others?

✅ Short Answer:

The NULL issue happens with both NOT IN and = / != when used with NULLs, but the biggest practical issue arises with NOT IN.

1. 🔥 `NOT IN` + NULL ⇒ Fatal filtering problem

As covered earlier:

SELECT user_id FROM this_week
WHERE user_id NOT IN (SELECT user_id FROM last_week);

If last_week contains even one NULL, no user_id will match, due to 3-valued logic.
This is specific to NOT IN and a very common source of bugs.

2. 😐 `IN` + NULL ⇒ Works fine (ignores NULL)

SELECT user_id FROM this_week
WHERE user_id IN (SELECT user_id FROM last_week);

If last_week has NULL, it’s ignored, and the IN works with non-null values.
So IN is usually safe with NULLs.

✅ IN (1, NULL) behaves like = 1 OR IS UNKNOWN, and the true value short-circuits it.

3. ❗ `=` or `!=` comparisons with NULL ⇒ Always UNKNOWN

Examples:

SELECT * FROM users WHERE user_id = NULL;
-- Returns nothing!

SELECT * FROM users WHERE user_id != NULL;
-- Returns nothing!

Why? Because:

NULL = NULL is not true, it’s UNKNOWN
5 != NULL is also UNKNOWN
In WHERE clause, UNKNOWN is treated as FALSE

4. ✅ How to check for NULL safely?

Use:

IS NULL       -- to test for nulls
IS NOT NULL   -- to exclude nulls

Example:

SELECT * FROM users WHERE user_id IS NULL;
SELECT * FROM users WHERE user_id IS NOT NULL;

🧾 Summary Table

Expression	NULL-safe?	Behavior
`NOT IN (...)`	❌ No	Returns no rows if any `NULL` present
`IN (...)`	✅ Yes	Ignores `NULL`, works with other values
`= NULL`	❌ No	Always `UNKNOWN`
`!= NULL`	❌ No	Always `UNKNOWN`
`IS NULL`	✅ Yes	Matches only NULLs
`IS NOT NULL`	✅ Yes	Excludes NULLs safely
`LEFT ANTI JOIN`	✅ Yes	Filters exact matches, safe from NULL issue

Perfect! Let’s demonstrate this with a real PySpark demo showing how NOT IN, IN, and NULL comparisons behave.

🧪 Scenario Setup

We’ll simulate this_week and last_week as two DataFrames:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.getOrCreate()

# this_week users
this_week = spark.createDataFrame([
    (1,), (2,), (3,)
], ["user_id"])

# last_week users (contains NULL)
last_week = spark.createDataFrame([
    (2,), (None,)
], ["user_id"])

🧨 1. `NOT IN` with NULL — Problematic

# Not in using filter + isin + negation
result_not_in = this_week.filter(~col("user_id").isin([row.user_id for row in last_week.collect()]))
result_not_in.show()

Output:

+-------+
|user_id|
+-------+
|       |
+-------+

⚠️ If last_week contains None (NULL), .isin() fails to filter correctly and may return empty or unexpected results.

✅ 2. `LEFT ANTI JOIN` — NULL-safe & correct

result_anti = this_week.join(last_week, on="user_id", how="left_anti")
result_anti.show()

Output:

+-------+
|user_id|
+-------+
|      1|
|      3|
+-------+

✅ Correctly returns users in this_week not present in last_week, ignoring the NULL safely.

✅ 3. `IN` — Safe, NULL is ignored

result_in = this_week.filter(col("user_id").isin([row.user_id for row in last_week.collect()]))
result_in.show()

Output:

+-------+
|user_id|
+-------+
|      2|
+-------+

✅ Correctly returns user 2. The NULL is ignored.

❌ 4. `!= NULL` — Fails silently

# Compare directly with NULL (not safe)
result_not_equal = this_week.filter(col("user_id") != None)
result_not_equal.show()

Output:

+-------+
|user_id|
+-------+
|      1|
|      2|
|      3|
+-------+

Looks OK, but it’s not reliable—better to use:

# Safe alternative
this_week.filter(col("user_id").isNotNull())

🔁 Summary of PySpark Results

Expression	Safe?	Works with NULL?	Notes
`NOT IN` via `~isin()`	❌ No	❌ Problematic	NULL breaks logic
`IN` via `.isin()`	✅ Yes	✅ Ignores NULL	Returns matches
`LEFT ANTI JOIN`	✅ Yes	✅ NULL-safe	Recommended
`!= None`	❌ No	❌ Unsafe	Use `isNotNull()`
`IS NOT NULL`	✅ Yes	✅ Correct way	Always preferred

HintsToday

recent posts

about

🔥 Problem: `NOT IN` + `NULL` = Unexpected behavior

❌ What happens:

🧪 Example:

Table: `this_week`

Table: `last_week`

✅ Fix: Use `LEFT ANTI JOIN`

🧠 Summary

🧠 Is the NULL issue only with `NOT IN`, or does it happen with others?

✅ Short Answer:

1. 🔥 `NOT IN` + NULL ⇒ Fatal filtering problem

2. 😐 `IN` + NULL ⇒ Works fine (ignores NULL)

3. ❗ `=` or `!=` comparisons with NULL ⇒ Always UNKNOWN

Examples:

4. ✅ How to check for NULL safely?

Example:

🧾 Summary Table

🧪 Scenario Setup

🧨 1. `NOT IN` with NULL — Problematic

Output:

✅ 2. `LEFT ANTI JOIN` — NULL-safe & correct

Output:

✅ 3. `IN` — Safe, NULL is ignored

Output:

❌ 4. `!= NULL` — Fails silently

Output:

🔁 Summary of PySpark Results

Like this:

Discover more from HintsToday

Leave a ReplyCancel reply

recent posts

about