Let’s analyze the join behavior in Spark SQL when dealing with NULLs in join keys.
✅ Input Tables:
Table1
col1 |
---|
1 |
0 |
null |
null |
Table2
col1 |
---|
1 |
0 |
null |
⚠️ Important Concept in SQL and Spark:
- NULL is not equal to NULL in SQL (and Spark SQL).
- Therefore,
NULL = NULL
is false, so joins on NULL values won’t match unless special handling is used (<=>
operator in Spark SQL for null-safe equality).
🔁 INNER JOIN (ON t1.col1 = t2.col1
)
This only matches values that are equal and not NULL.
Matching keys:
1
matches1
0
matches0
null
doesn’t matchnull
✅ Output:
2 matching rows (for 1
and 0
)
👉 Count: 2
🔁 LEFT JOIN (ON t1.col1 = t2.col1
)
This returns all rows from Table1, and matched rows from Table2.
Matching:
1
→ matches1
in Table2 ✅0
→ matches0
in Table2 ✅null
→ does not matchnull
(no match) ❌null
→ does not matchnull
(no match) ❌
✅ Output:
t1.col1 | t2.col1 |
---|---|
1 | 1 |
0 | 0 |
null | null ❌ → becomes NULL |
null | null ❌ → becomes NULL |
👉 Count: 4
total rows
✅ Summary of Counts:
Join Type | Row Count |
---|---|
INNER JOIN | 2 |
LEFT JOIN | 4 |
If you use null-safe join in Spark SQL using t1.col1 <=> t2.col1
, then even NULLs will match. Let me know if you want that case too.
Here are 20+ tricky and real-world SQL questions like the NULL join behavior we discussed above — focusing on JOINs, NULLs, GROUP BY, WHERE vs HAVING, aggregates, and edge cases. These are interview favorites and Spark SQL-safe.
🔥 TRICKY SQL QUESTIONS FOR INTERVIEWS (WITH EXPLANATIONS)
1. How many rows does this INNER JOIN return?
SELECT *
FROM t1
JOIN t2 ON t1.col = t2.col;
🧠 Trick: If both tables contain NULL
, they won’t match.
👉 Expected: Matches only on non-null values.
2. What happens if you do COUNT(col)
vs COUNT(*)
?
SELECT COUNT(col), COUNT(*) FROM my_table;
🧠 COUNT(col)
excludes NULLs, COUNT(*)
counts all rows.
👉 Easy to confuse in metrics/aggregates.
3. Difference between WHERE
and HAVING
with aggregates
SELECT col, COUNT(*) FROM t
GROUP BY col
HAVING COUNT(*) > 1;
🧠 HAVING
filters after aggregation.
👉 Using WHERE COUNT(*) > 1
gives error.
4. What does this return?
SELECT *
FROM t1
LEFT JOIN t2 ON t1.col = t2.col
WHERE t2.col IS NULL;
🧠 This gives anti-join behavior — rows from t1
with no match in t2
.
5. What’s the output of this?
SELECT col1 = NULL FROM my_table;
🧠 Always returns NULL
(not true/false).
👉 Use IS NULL
instead.
6. NULLs in IN
clause
SELECT * FROM t WHERE col IN (1, 2, NULL);
🧠 If col
is NULL
, it won’t match.
👉 IN
with NULL behaves like col = NULL
— dangerous.
7. Can GROUP BY NULL
? What happens?
SELECT col, COUNT(*) FROM t GROUP BY col;
🧠 All NULLs are grouped together into one NULL group.
8. What does this return?
SELECT DISTINCT NULL FROM my_table;
🧠 Only one row with NULL
— duplicates are eliminated even for NULL.
9. Using NOT IN
with NULLs
SELECT * FROM t1
WHERE col NOT IN (SELECT col FROM t2);
🧠 If t2.col
contains even one NULL, whole query returns no rows.
10. LEFT JOIN with filter in WHERE vs ON
-- Version A
SELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.id WHERE t2.status = 'ACTIVE';
-- Version B
SELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.id AND t2.status = 'ACTIVE';
🧠 A = behaves like INNER JOIN
🧠 B = proper LEFT JOIN with condition inside JOIN
11. GROUP BY without SELECT columns
SELECT COUNT(*) FROM t GROUP BY col;
🧠 Valid — but you won’t see what col
was grouped on. Returns just counts.
12. Using IS DISTINCT FROM
SELECT * FROM t1 JOIN t2 ON t1.col IS NOT DISTINCT FROM t2.col;
🧠 Behaves like Spark SQL’s <=>
— null-safe join.
13. What is result of NULL + 5
or NULL || 'abc'
?
🧠 Always returns NULL
. Any arithmetic or concat with NULL
→ NULL
.
14. Using COALESCE
vs IFNULL
vs NVL
SELECT COALESCE(col, 'default') FROM t;
🧠 All work similarly; some are dialect-specific (e.g., NVL
in Oracle).
15. Case-insensitive vs case-sensitive joins
SELECT * FROM t1 JOIN t2 ON t1.name = t2.name;
🧠 Depends on collation — e.g., in Postgres it’s case-sensitive, in MySQL it may not be.
16. Can we ORDER BY a column not in SELECT?
SELECT name FROM t ORDER BY age;
🧠 ✅ Allowed in most DBs but not allowed in some strict modes.
17. Can we GROUP BY alias?
SELECT salary * 0.1 AS tax FROM t GROUP BY tax;
🧠 ❌ Most engines don’t allow alias in GROUP BY (use expression again).
18. Difference between COUNT(*) OVER() and COUNT(col) OVER()
SELECT name, COUNT(*) OVER() FROM t;
🧠 COUNT(*) OVER()
counts all rows, even NULLs.
🧠 COUNT(col) OVER()
ignores NULLs.
19. How to remove duplicates but keep one?
SELECT DISTINCT ON (col) * FROM t;
🧠 Supported only in Postgres. Other DBs use ROW_NUMBER.
20. Default NULL sorting behavior in ORDER BY
SELECT * FROM t ORDER BY col;
🧠 NULLS FIRST
or NULLS LAST
behavior varies:
- Postgres: NULLS come first in ASC
- Oracle: NULLS come last in ASC
- Spark SQL: NULLS come last by default in ASC
21. What happens with CROSS JOIN with NULLs?
SELECT * FROM t1 CROSS JOIN t2;
🧠 Produces Cartesian product, regardless of NULLs.
22. Filter rows where col has at least one NULL duplicate
SELECT * FROM t
WHERE col IN (
SELECT col FROM t GROUP BY col HAVING COUNT(*) > 1
);
🧠 Doesn’t include NULLs unless explicitly filtered using IS NULL
.
🧠 Want More?
Perfect! here’s a powerful collection of SQL puzzles and interview-level scenarios with a focus on traps, Spark behavior, and hands-on examples.
🔥 SECTION 1: Window Function Puzzles (Ranking, Aggregates, Traps)
🧩 1. Top 2 Salaries Per Department (DENSE_RANK trap)
SELECT *
FROM (
SELECT *, DENSE_RANK() OVER (PARTITION BY dept_id ORDER BY salary DESC) as rnk
FROM employees
) tmp
WHERE rnk <= 2;
🧠 Trap: RANK()
and DENSE_RANK()
behave differently for ties.
🧩 2. Running Total of Sales Per Region
SELECT region, sales_date, amount,
SUM(amount) OVER (PARTITION BY region ORDER BY sales_date) AS running_total
FROM sales;
🧠 Tracks how sales build up over time.
🧩 3. First Purchase Date Per Customer
SELECT *,
FIRST_VALUE(purchase_date) OVER (PARTITION BY customer_id ORDER BY purchase_date) AS first_purchase
FROM purchases;
🧠 FIRST_VALUE with ordering — always include ORDER BY
for expected result.
🧩 4. Detect Consecutive Repeats (LEAD/LAG)
SELECT *,
CASE WHEN status = LAG(status) OVER (PARTITION BY user_id ORDER BY ts) THEN 1 ELSE 0 END AS repeat
FROM user_activity;
🧠 Use LAG()
for comparing current vs previous row.
🔁 SECTION 2: Top-N Per Group — Real Interview Puzzle
🧩 5. Get Latest 2 Orders Per Customer
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) as rn
FROM orders
) t
WHERE rn <= 2;
🧠 Always remember:
ROW_NUMBER()
→ unique per rowRANK()
→ skip ranks on tiesDENSE_RANK()
→ no skipping
🧩 6. Find Product With 2nd Highest Revenue Globally
SELECT product_id, total_revenue
FROM (
SELECT product_id, SUM(revenue) as total_revenue,
RANK() OVER (ORDER BY SUM(revenue) DESC) as rnk
FROM sales
GROUP BY product_id
) x
WHERE rnk = 2;
🧠 You can use RANK()
when there could be ties.
🔄 SECTION 3: Complex Joins and Self-Joins
🧩 7. Self Join to Find Direct Manager
SELECT e.emp_id, e.name, m.name AS manager_name
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.emp_id;
🧠 Self-join trick: alias both instances clearly.
🧩 8. Get Employees Who Have Same Salary As Someone Else
SELECT DISTINCT a.emp_id, a.salary
FROM employees a
JOIN employees b ON a.salary = b.salary AND a.emp_id != b.emp_id;
🧠 Use self-join to detect non-unique attribute (salary here).
🧩 9. Anti-Join (NOT EXISTS vs NOT IN vs LEFT JOIN)
-- Option 1: LEFT JOIN
SELECT a.*
FROM table1 a
LEFT JOIN table2 b ON a.id = b.id
WHERE b.id IS NULL;
-- Option 2: NOT EXISTS
SELECT * FROM table1 a
WHERE NOT EXISTS (SELECT 1 FROM table2 b WHERE a.id = b.id);
🧠 NOT IN
fails if b.id
has even 1 NULL. Use NOT EXISTS
instead.
⚖️ SECTION 4: Set Operations (INTERSECT, EXCEPT)
🧩 10. Customers Who Bought A but Not B
SELECT customer_id FROM orders WHERE product = 'A'
EXCEPT
SELECT customer_id FROM orders WHERE product = 'B';
🧠 EXCEPT
removes rows in second query, and removes duplicates.
🧩 11. Customers Who Bought Both A and B
SELECT customer_id FROM orders WHERE product = 'A'
INTERSECT
SELECT customer_id FROM orders WHERE product = 'B';
🧠 INTERSECT
= set intersection, returns distinct rows.
🧩 12. Spark SQL Equivalent using NOT EXISTS
or LEFT ANTI JOIN
-- Customers who bought A but not B
SELECT DISTINCT a.customer_id
FROM orders a
LEFT ANTI JOIN orders b
ON a.customer_id = b.customer_id AND b.product = 'B'
WHERE a.product = 'A';
🧠 Spark doesn’t support EXCEPT
in DataFrame API — use anti-join
.
🧪 SECTION 5: Spark SQL NULL Behavior Puzzles
🧩 13. Will NULL = NULL return true in Spark SQL?
SELECT * FROM t1 JOIN t2 ON t1.col = t2.col;
🧠 NO — Spark SQL behaves like ANSI SQL. NULL ≠ NULL.
🧩 14. Use null-safe join (<=>
)
SELECT * FROM t1 JOIN t2 ON t1.col <=> t2.col;
🧠 Only in Spark: <=>
compares NULLs safely.
🧩 15. Count rows where col IN (1, NULL)
SELECT * FROM t WHERE col IN (1, NULL);
🧠 Trap: Always returns unknown for NULL, may exclude valid rows.
🧩 16. LEFT JOIN with NULL filter in WHERE clause
SELECT * FROM A LEFT JOIN B ON A.id = B.id
WHERE B.id IS NULL;
🧠 This gives anti-join — all rows in A not matched in B.
🔥 You’ve got it — here’s a comprehensive mega-drop of Spark-flavored traps, hands-on DataFrame operations, window wizardry, and performance tuning secrets — all with real code and explanations, just like you’d face in real projects or interviews.
⚡️ SECTION 1: Spark-Specific Traps & Conceptual Gotchas
🧨 1. NULL Join Trap (again but worse)
df1 = spark.createDataFrame([(1,), (None,)], ["id"])
df2 = spark.createDataFrame([(None,), (1,), (2,)], ["id"])
df1.join(df2, "id", "inner").show()
🧠 ✅ Joins on NULL fail unless you use:
df1.join(df2, df1["id"] <=> df2["id"], "inner").show()
🧨 2. Filter after Aggregation Trap
df.groupBy("id").agg(F.count("*").alias("cnt")).filter("cnt > 1")
🧠 Works ✅
But this fails if you’re thinking in SQL mode:
SELECT * FROM df WHERE COUNT(*) > 1 -- ❌ Invalid in SQL
🧨 3. Filtering null
using !=
df.filter(df["col"] != "X").show()
🧠 Will exclude NULL rows, but won’t include them (since NULL != 'X'
is UNKNOWN → filtered out). Use .isNull()
explicitly to include NULLs.
🧨 4. explode()
on NULL column
df = spark.createDataFrame([(1, None), (2, ["a", "b"])], ["id", "tags"])
df.select("id", F.explode("tags")).show()
🧠 Row with NULL list disappears. Add a safe guard:
F.explode(F.coalesce("tags", F.array().cast("array<string>")))
🧨 5. Caching Trap
df = df.filter(...).cache()
df.show() # triggers caching
df.unpersist()
🧠 Don’t .unpersist()
too early, or you’ll lose performance gains.
🧪 SECTION 2: Hands-On DataFrame APIs (Tricks with withColumn
, rank
, etc.)
🎯 6. withColumn: Derive new column with logic
df.withColumn("discounted_price", F.col("price") * 0.9)
🎯 7. Create flag if column value changed from previous row
window = Window.partitionBy("user_id").orderBy("timestamp")
df.withColumn("prev_action", F.lag("action").over(window)) \
.withColumn("changed", F.when(F.col("action") != F.col("prev_action"), 1).otherwise(0))
🎯 8. Multiple Explodes (flattening nested arrays)
df.select("id", F.explode("items").alias("item")) \
.select("id", F.explode("item.tags").alias("tag"))
🧠 Chain explode()
s to flatten deeply nested JSON.
🎯 9. rank() to get top-N per group
window = Window.partitionBy("dept").orderBy(F.desc("salary"))
df.withColumn("rnk", F.rank().over(window)).filter("rnk <= 2")
🎯 10. pivot() to transform row → column
df.groupBy("dept").pivot("month").agg(F.sum("salary"))
🧠 SECTION 3: Complex Window Function Patterns
🌀 11. Gaps & Islands – Consecutive Login Sessions
from pyspark.sql.functions import col, lag, sum as sum_, when, datediff
w = Window.partitionBy("user_id").orderBy("login_date")
df = df.withColumn("prev_date", lag("login_date").over(w)) \
.withColumn("gap", datediff("login_date", "prev_date")) \
.withColumn("is_new_session", when(col("gap") > 1, 1).otherwise(0)) \
.withColumn("session_id", sum_("is_new_session").over(w.rowsBetween(Window.unboundedPreceding, 0)))
🌀 12. Rolling Average (moving window)
w = Window.partitionBy("user_id").orderBy("ts").rowsBetween(-2, 0)
df.withColumn("rolling_avg", F.avg("score").over(w))
🌀 13. Detect Streak of Same Values
# Assign a group ID whenever the value changes (incrementally)
df = df.withColumn("change", F.when(F.lag("val").over(w) != F.col("val"), 1).otherwise(0))
df = df.withColumn("streak_id", F.sum("change").over(w))
⚙️ SECTION 4: Partitioning, Repartitioning & Optimization
🚀 14. Repartition Before Join (avoid skew)
dfA = dfA.repartition("join_key")
dfB = dfB.repartition("join_key")
dfA.join(dfB, "join_key")
🧠 Ensures balanced shuffle and parallelism.
🚀 15. Broadcast Join (manual + hint)
# Manual
df_large.join(F.broadcast(df_small), "id")
# SQL hint
spark.sql("SELECT /*+ BROADCAST(t2) */ * FROM t1 JOIN t2 ON t1.id = t2.id")
🚀 16. Skew Handling (Salting)
dfA = dfA.withColumn("salt", F.expr("CAST(rand()*10 AS INT)"))
dfB = dfB.withColumn("salt", F.expr("CAST(rand()*10 AS INT)"))
dfA.join(dfB, ["key", "salt"])
🧠 Distributes skewed key across multiple partitions.
🚀 17. Persist with partitioning
df.write.partitionBy("country", "year").mode("overwrite").parquet("/path")
🚀 18. Use explain()
to trace execution plan
df.join(df2, "id").select("name").explain(True)
🧠 Helps track join strategy, partitioning, broadcast usage, etc.
🎓 SECTION 5: Realistic Spark Interview Brain-Teasers
🧩 19. Second Highest Salary Per Department (with tie)
w = Window.partitionBy("dept").orderBy(F.desc("salary"))
df.withColumn("rnk", F.rank().over(w)).filter("rnk = 2")
🧩 20. User Sessions with Duration & Gaps
w = Window.partitionBy("user_id").orderBy("timestamp")
df = df.withColumn("prev_ts", F.lag("timestamp").over(w)) \
.withColumn("gap", F.unix_timestamp("timestamp") - F.unix_timestamp("prev_ts")) \
.withColumn("new_session", F.when(col("gap") > 900, 1).otherwise(0)) \
.withColumn("session_id", F.sum("new_session").over(w))
🧩 21. Find Rows with Duplicate AND Unique Columns
df.groupBy("email").count().filter("count = 1") # unique emails
df.groupBy("email").count().filter("count > 1") # duplicate emails
🧠 Wrap-up: What You Just Got
Area | What’s Covered |
---|---|
Spark Traps | NULL behavior, join quirks, filter gotchas |
DataFrame Mastery | withColumn , explode , rank , pivot , lag , lead |
Window Functions | Streaks, gaps/islands, rolling avg, LAG-based change detection |
Performance Optimization | Broadcast join, salt key join, partitioning, caching |
Interview Brain-Teasers | Top-N per group, skewed joins, user sessions, realistic ranking problems |
Here’s your all‑in‑one SQL & Spark mega‑sheet, combining everything we’ve discussed plus extra tricky LeetCode questions—all in one go!
🧩 Part 1: Spark‑Flavored SQL Traps (with code & explanation)
# | Question | Spark Code or SQL | Explanation |
---|---|---|---|
1 | Will NULL = NULL match in a join? | df1.join(df2, df1.id <=> df2.id, "inner") | Spark treats NULL = NULL as false; <=> is null‑safe equality (GitHub, Stack Overflow) |
2 | Why some joined columns are always null after join? | See StackOverflow case: non‑serializable logic in UDF causing weird nulls (Stack Overflow) | |
3 | How include rows with null join keys in result? | Use null‑safe join <=> , or add salting, or coalesce before join (Stack Overflow, Spark Tutorial Point) | |
4 | Replace null result of left join with default? | .join(..., "left").fillna(0).fillna("0") (Stack Overflow, Stack Overflow) |
🧪 Part 2: Hands‑On Spark DataFrame Code (with DataFrame APIs)
- withColumn, conditional, ranking, explode:
from pyspark.sql import functions as F df.withColumn("discount", F.col("price")*0.9) .withColumn("prev", F.lag("action").over(w)) .withColumn("changed", F.when(F.col("action") != F.col("prev"), 1).otherwise(0))
- Explode with null-safe fallback:
df.select("id", F.explode(F.coalesce("tags", F.array().cast("array<string>"))).alias("tag"))
- Window rank:
df.withColumn("rank", F.rank().over(Window.partitionBy("dept").orderBy(F.desc("salary")))) .filter("rank <= 2")
🌊 Part 3: Complex Window Patterns (Gaps, Rolling Avg, Streaks)
- Consecutive sessions (gaps/islands):
df.withColumn("prev", F.lag("login_date").over(w)) \ .withColumn("gap", F.datediff("login_date", "prev")) \ .withColumn("new_session", F.when(F.col("gap") > 1, 1).otherwise(0)) \ .withColumn("session_id", F.sum("new_session").over(w))
- Rolling average (last 3 scores):
w2 = Window.partitionBy("user").orderBy("ts").rowsBetween(-2, 0) df.withColumn("rolling_avg", F.avg("score").over(w2))
- Streak detection (same val runs):
df.withColumn("change", F.when(F.lag("val").over(w) != F.col("val"), 1).otherwise(0)) \ .withColumn("streak_id", F.sum("change").over(w))
🚀 Part 4: Partitioning & Optimization Techniques
- Repartition by join key:
dfA.repartition("key").join(dfB.repartition("key"), "key")
- Broadcast small side:
df_large.join(F.broadcast(df_small), "id")
- Salting skewed keys: add
salt = rand()*10
to key, thenjoin(["key", "salt"])
- Inspect plan:
df.join(df2,...).explain(True)
🎓 Part 5: Spark + SQL Tricky/LIT code Brain‑Teasers
# | Puzzle | Answer / Explanation |
---|---|---|
A | Top‑2 per group with ties | Use rank() → then rank <= 2 or dense_rank() if you want no gaps |
B | 2ᵗʰ highest salary per dept | RANK() OVER (PARTITION BY dept ORDER BY salary DESC) = 2 |
C | Anti‑join in Spark | df1.join(df2, cond, "leftanti") or NOT EXISTS |
D | Full outer with coalesce | df1.join(df2, "id", "fullouter").select(coalesce(df1.id, df2.id)) |
E | COUNT vs COUNT(col) | COUNT(*) counts all rows; COUNT(col) skips NULLs |
F | WHERE vs HAVING | Filter groups via HAVING after aggregation; cannot use WHERE count(*) > 1 |
G | NULL in IN (...) list | NULL in IN list causes unknown → may drop rows unexpectedly |
H | DISTINCT NULLs | SELECT DISTINCT NULL returns only one row |
I | ORDER BY on column not in SELECT | Allowed in many engines; but fails in strict SQL modes |
🧠 Part 6: Classic LeetCode SQL Brain‑Teasers (conceptual & tricky)
1. Rising Temperature (LeetCode)
SELECT w2.id
FROM Weather w1
JOIN Weather w2
ON datediff(w2.recordDate, w1.recordDate) = 1
AND w2.temperature > w1.temperature;
🧠 Join with date logic and inequality condition. Multiple solutions exist. (Stack Overflow, Exchangetuts, Read Medium articles with AI, Stack Overflow, Stack Overflow, Stack Overflow, GitHub)
2. Second Highest Salary (185 Dept Top 3 Salaries)
SELECT Dept, MIN(salary) AS SecondHighest
FROM (
SELECT Dept, salary,
DENSE_RANK() OVER (PARTITION BY Dept ORDER BY salary DESC) AS rnk
FROM Employee
) t
WHERE rnk = 2;
🧠 Uses DENSE_RANK()
to handle ties. (GitHub)
3. Customers who ordered A but not B
SELECT customer_id
FROM Orders
WHERE product = 'A'
EXCEPT
SELECT customer_id FROM Orders WHERE product = 'B';
🧠 Removes any customer with B. Spark equivalent: left anti join. (GitHub, codelearner.in)
4. Confirmation Rate Per User
SELECT s.user_id,
ROUND(AVG(CASE WHEN c.action = 'confirmed' THEN 1 ELSE 0 END), 2) AS confirmation_rate
FROM signups s
LEFT JOIN confirmations c ON s.user_id = c.user_id
GROUP BY s.user_id;
🧠 Aggregates mixed case logic to compute rate. (codelearner.in)
5. Students attendance cross join trick
SELECT s.student_id, sb.subject_name,
COUNT(e.subject_name) AS attended
FROM students s
CROSS JOIN subjects sb
LEFT JOIN examinations e
ON e.student_id = s.student_id
AND e.subject_name = sb.subject_name
GROUP BY s.student_id, sb.subject_name;
🧠 Generates full matrix first, then left join attendance. (codelearner.in)
🧭 Recap & Quick Reference Table
- Spark NULL behavior: use
<=>
, overflow null-safe logic - Window functions:
rank()
,dense_rank()
,lag()
,lead()
, rolling avg, streaks - Joins: anti, null-safe, salting/skew, repartitioning, broadcast
- Aggregation traps: COUNT, GROUP BY, HAVING vs WHERE
- Set ops:
EXCEPT
,INTERSECT
, or equivalent Spark anti/join logic - LeetCode puzzles: rising temperature, second highest salary, confirmation rate, cross‑join attendance
Leave a Reply