SQL Tricky Conceptual Interview Questions

Let’s analyze the join behavior in Spark SQL when dealing with NULLs in join keys.

✅ Input Tables:

`Table1`

col1
1
0
null
null

`Table2`

col1
1
0
null

⚠️ Important Concept in SQL and Spark:

NULL is not equal to NULL in SQL (and Spark SQL).
Therefore, NULL = NULL is false, so joins on NULL values won’t match unless special handling is used (<=> operator in Spark SQL for null-safe equality).

🔁 INNER JOIN (`ON t1.col1 = t2.col1`)

This only matches values that are equal and not NULL.

Matching keys:

1 matches 1
0 matches 0
null doesn’t match null

✅ Output:

2 matching rows (for 1 and 0)

👉 Count: `2`

🔁 LEFT JOIN (`ON t1.col1 = t2.col1`)

This returns all rows from Table1, and matched rows from Table2.

Matching:

1 → matches 1 in Table2 ✅
0 → matches 0 in Table2 ✅
null → does not match null (no match) ❌
null → does not match null (no match) ❌

✅ Output:

t1.col1	t2.col1
1	1
0	0
null	null ❌ → becomes NULL
null	null ❌ → becomes NULL

👉 Count: `4` total rows

✅ Summary of Counts:

Join Type	Row Count
INNER JOIN	`2`
LEFT JOIN	`4`

If you use null-safe join in Spark SQL using t1.col1 <=> t2.col1, then even NULLs will match. Let me know if you want that case too.

Here are 20+ tricky and real-world SQL questions like the NULL join behavior we discussed above — focusing on JOINs, NULLs, GROUP BY, WHERE vs HAVING, aggregates, and edge cases. These are interview favorites and Spark SQL-safe.

🔥 TRICKY SQL QUESTIONS FOR INTERVIEWS (WITH EXPLANATIONS)

1. How many rows does this INNER JOIN return?

SELECT *
FROM t1
JOIN t2 ON t1.col = t2.col;

🧠 Trick: If both tables contain NULL, they won’t match.
👉 Expected: Matches only on non-null values.

2. **What happens if you do `COUNT(col)` vs `COUNT(*)`?**

SELECT COUNT(col), COUNT(*) FROM my_table;

🧠 COUNT(col) excludes NULLs, COUNT(*) counts all rows.
👉 Easy to confuse in metrics/aggregates.

3. Difference between `WHERE` and `HAVING` with aggregates

SELECT col, COUNT(*) FROM t
GROUP BY col
HAVING COUNT(*) > 1;

🧠 HAVING filters after aggregation.
👉 Using WHERE COUNT(*) > 1 gives error.

4. What does this return?

SELECT *
FROM t1
LEFT JOIN t2 ON t1.col = t2.col
WHERE t2.col IS NULL;

🧠 This gives anti-join behavior — rows from t1 with no match in t2.

5. What’s the output of this?

SELECT col1 = NULL FROM my_table;

🧠 Always returns NULL (not true/false).
👉 Use IS NULL instead.

6. NULLs in `IN` clause

SELECT * FROM t WHERE col IN (1, 2, NULL);

🧠 If col is NULL, it won’t match.
👉 IN with NULL behaves like col = NULL — dangerous.

7. Can GROUP BY `NULL`? What happens?

SELECT col, COUNT(*) FROM t GROUP BY col;

🧠 All NULLs are grouped together into one NULL group.

8. What does this return?

SELECT DISTINCT NULL FROM my_table;

🧠 Only one row with NULL — duplicates are eliminated even for NULL.

9. Using `NOT IN` with NULLs

SELECT * FROM t1
WHERE col NOT IN (SELECT col FROM t2);

🧠 If t2.col contains even one NULL, whole query returns no rows.

10. LEFT JOIN with filter in WHERE vs ON

-- Version A
SELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.id WHERE t2.status = 'ACTIVE';

-- Version B
SELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.id AND t2.status = 'ACTIVE';

🧠 A = behaves like INNER JOIN
🧠 B = proper LEFT JOIN with condition inside JOIN

11. GROUP BY without SELECT columns

SELECT COUNT(*) FROM t GROUP BY col;

🧠 Valid — but you won’t see what col was grouped on. Returns just counts.

12. Using `IS DISTINCT FROM`

SELECT * FROM t1 JOIN t2 ON t1.col IS NOT DISTINCT FROM t2.col;

🧠 Behaves like Spark SQL’s <=> — null-safe join.

13. What is result of `NULL + 5` or `NULL || 'abc'`?

🧠 Always returns NULL. Any arithmetic or concat with NULL → NULL.

14. Using `COALESCE` vs `IFNULL` vs `NVL`

SELECT COALESCE(col, 'default') FROM t;

🧠 All work similarly; some are dialect-specific (e.g., NVL in Oracle).

15. Case-insensitive vs case-sensitive joins

SELECT * FROM t1 JOIN t2 ON t1.name = t2.name;

🧠 Depends on collation — e.g., in Postgres it’s case-sensitive, in MySQL it may not be.

16. Can we ORDER BY a column not in SELECT?

SELECT name FROM t ORDER BY age;

🧠 ✅ Allowed in most DBs but not allowed in some strict modes.

17. Can we GROUP BY alias?

SELECT salary * 0.1 AS tax FROM t GROUP BY tax;

🧠 ❌ Most engines don’t allow alias in GROUP BY (use expression again).

18. **Difference between COUNT(*) OVER() and COUNT(col) OVER()**

SELECT name, COUNT(*) OVER() FROM t;

🧠 COUNT(*) OVER() counts all rows, even NULLs.
🧠 COUNT(col) OVER() ignores NULLs.

19. How to remove duplicates but keep one?

SELECT DISTINCT ON (col) * FROM t;

🧠 Supported only in Postgres. Other DBs use ROW_NUMBER.

20. Default NULL sorting behavior in ORDER BY

SELECT * FROM t ORDER BY col;

🧠 NULLS FIRST or NULLS LAST behavior varies:

Postgres: NULLS come first in ASC
Oracle: NULLS come last in ASC
Spark SQL: NULLS come last by default in ASC

21. What happens with CROSS JOIN with NULLs?

SELECT * FROM t1 CROSS JOIN t2;

🧠 Produces Cartesian product, regardless of NULLs.

22. Filter rows where col has at least one NULL duplicate

SELECT * FROM t
WHERE col IN (
  SELECT col FROM t GROUP BY col HAVING COUNT(*) > 1
);

🧠 Doesn’t include NULLs unless explicitly filtered using IS NULL.

🧠 Want More?

Perfect! here’s a powerful collection of SQL puzzles and interview-level scenarios with a focus on traps, Spark behavior, and hands-on examples.

🔥 SECTION 1: Window Function Puzzles (Ranking, Aggregates, Traps)

🧩 1. Top 2 Salaries Per Department (DENSE_RANK trap)

SELECT *
FROM (
  SELECT *, DENSE_RANK() OVER (PARTITION BY dept_id ORDER BY salary DESC) as rnk
  FROM employees
) tmp
WHERE rnk <= 2;

🧠 Trap: RANK() and DENSE_RANK() behave differently for ties.

🧩 2. Running Total of Sales Per Region

SELECT region, sales_date, amount,
  SUM(amount) OVER (PARTITION BY region ORDER BY sales_date) AS running_total
FROM sales;

🧠 Tracks how sales build up over time.

🧩 3. First Purchase Date Per Customer

SELECT *,
  FIRST_VALUE(purchase_date) OVER (PARTITION BY customer_id ORDER BY purchase_date) AS first_purchase
FROM purchases;

🧠 FIRST_VALUE with ordering — always include ORDER BY for expected result.

🧩 4. Detect Consecutive Repeats (LEAD/LAG)

SELECT *,
  CASE WHEN status = LAG(status) OVER (PARTITION BY user_id ORDER BY ts) THEN 1 ELSE 0 END AS repeat
FROM user_activity;

🧠 Use LAG() for comparing current vs previous row.

🔁 SECTION 2: Top-N Per Group — Real Interview Puzzle

🧩 5. Get Latest 2 Orders Per Customer

SELECT * FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) as rn
  FROM orders
) t
WHERE rn <= 2;

🧠 Always remember:

ROW_NUMBER() → unique per row
RANK() → skip ranks on ties
DENSE_RANK() → no skipping

🧩 6. Find Product With 2nd Highest Revenue Globally

SELECT product_id, total_revenue
FROM (
  SELECT product_id, SUM(revenue) as total_revenue,
         RANK() OVER (ORDER BY SUM(revenue) DESC) as rnk
  FROM sales
  GROUP BY product_id
) x
WHERE rnk = 2;

🧠 You can use RANK() when there could be ties.

🔄 SECTION 3: Complex Joins and Self-Joins

🧩 7. Self Join to Find Direct Manager

SELECT e.emp_id, e.name, m.name AS manager_name
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.emp_id;

🧠 Self-join trick: alias both instances clearly.

🧩 8. Get Employees Who Have Same Salary As Someone Else

SELECT DISTINCT a.emp_id, a.salary
FROM employees a
JOIN employees b ON a.salary = b.salary AND a.emp_id != b.emp_id;

🧠 Use self-join to detect non-unique attribute (salary here).

🧩 9. Anti-Join (NOT EXISTS vs NOT IN vs LEFT JOIN)

-- Option 1: LEFT JOIN
SELECT a.*
FROM table1 a
LEFT JOIN table2 b ON a.id = b.id
WHERE b.id IS NULL;

-- Option 2: NOT EXISTS
SELECT * FROM table1 a
WHERE NOT EXISTS (SELECT 1 FROM table2 b WHERE a.id = b.id);

🧠 NOT IN fails if b.id has even 1 NULL. Use NOT EXISTS instead.

⚖️ SECTION 4: Set Operations (INTERSECT, EXCEPT)

🧩 10. Customers Who Bought A but Not B

SELECT customer_id FROM orders WHERE product = 'A'
EXCEPT
SELECT customer_id FROM orders WHERE product = 'B';

🧠 EXCEPT removes rows in second query, and removes duplicates.

🧩 11. Customers Who Bought Both A and B

SELECT customer_id FROM orders WHERE product = 'A'
INTERSECT
SELECT customer_id FROM orders WHERE product = 'B';

🧠 INTERSECT = set intersection, returns distinct rows.

🧩 12. Spark SQL Equivalent using `NOT EXISTS` or `LEFT ANTI JOIN`

-- Customers who bought A but not B
SELECT DISTINCT a.customer_id
FROM orders a
LEFT ANTI JOIN orders b
  ON a.customer_id = b.customer_id AND b.product = 'B'
WHERE a.product = 'A';

🧠 Spark doesn’t support EXCEPT in DataFrame API — use anti-join.

🧪 SECTION 5: Spark SQL NULL Behavior Puzzles

🧩 13. Will NULL = NULL return true in Spark SQL?

SELECT * FROM t1 JOIN t2 ON t1.col = t2.col;

🧠 NO — Spark SQL behaves like ANSI SQL. NULL ≠ NULL.

🧩 14. Use null-safe join (`<=>`)

SELECT * FROM t1 JOIN t2 ON t1.col <=> t2.col;

🧠 Only in Spark: <=> compares NULLs safely.

🧩 15. Count rows where col IN (1, NULL)

SELECT * FROM t WHERE col IN (1, NULL);

🧠 Trap: Always returns unknown for NULL, may exclude valid rows.

🧩 16. LEFT JOIN with NULL filter in WHERE clause

SELECT * FROM A LEFT JOIN B ON A.id = B.id
WHERE B.id IS NULL;

🧠 This gives anti-join — all rows in A not matched in B.

🔥 You’ve got it — here’s a comprehensive mega-drop of Spark-flavored traps, hands-on DataFrame operations, window wizardry, and performance tuning secrets — all with real code and explanations, just like you’d face in real projects or interviews.

⚡️ SECTION 1: Spark-Specific Traps & Conceptual Gotchas

🧨 1. NULL Join Trap (again but worse)

df1 = spark.createDataFrame([(1,), (None,)], ["id"])
df2 = spark.createDataFrame([(None,), (1,), (2,)], ["id"])

df1.join(df2, "id", "inner").show()

🧠 ✅ Joins on NULL fail unless you use:

df1.join(df2, df1["id"] <=> df2["id"], "inner").show()

🧨 2. Filter after Aggregation Trap

df.groupBy("id").agg(F.count("*").alias("cnt")).filter("cnt > 1")

🧠 Works ✅
But this fails if you’re thinking in SQL mode:

SELECT * FROM df WHERE COUNT(*) > 1  -- ❌ Invalid in SQL

🧨 3. Filtering `null` using `!=`

df.filter(df["col"] != "X").show()

🧠 Will exclude NULL rows, but won’t include them (since NULL != 'X' is UNKNOWN → filtered out). Use .isNull() explicitly to include NULLs.

🧨 4. `explode()` on NULL column

df = spark.createDataFrame([(1, None), (2, ["a", "b"])], ["id", "tags"])
df.select("id", F.explode("tags")).show()

🧠 Row with NULL list disappears. Add a safe guard:

F.explode(F.coalesce("tags", F.array().cast("array<string>")))

🧨 5. Caching Trap

df = df.filter(...).cache()
df.show()  # triggers caching
df.unpersist()

🧠 Don’t .unpersist() too early, or you’ll lose performance gains.

🧪 SECTION 2: Hands-On DataFrame APIs (Tricks with `withColumn`, `rank`, etc.)

🎯 6. withColumn: Derive new column with logic

df.withColumn("discounted_price", F.col("price") * 0.9)

🎯 7. Create flag if column value changed from previous row

window = Window.partitionBy("user_id").orderBy("timestamp")

df.withColumn("prev_action", F.lag("action").over(window)) \
  .withColumn("changed", F.when(F.col("action") != F.col("prev_action"), 1).otherwise(0))

🎯 8. Multiple Explodes (flattening nested arrays)

df.select("id", F.explode("items").alias("item")) \
  .select("id", F.explode("item.tags").alias("tag"))

🧠 Chain explode()s to flatten deeply nested JSON.

🎯 9. rank() to get top-N per group

window = Window.partitionBy("dept").orderBy(F.desc("salary"))

df.withColumn("rnk", F.rank().over(window)).filter("rnk <= 2")

🎯 10. pivot() to transform row → column

df.groupBy("dept").pivot("month").agg(F.sum("salary"))

🧠 SECTION 3: Complex Window Function Patterns

🌀 11. Gaps & Islands – Consecutive Login Sessions

from pyspark.sql.functions import col, lag, sum as sum_, when, datediff

w = Window.partitionBy("user_id").orderBy("login_date")
df = df.withColumn("prev_date", lag("login_date").over(w)) \
       .withColumn("gap", datediff("login_date", "prev_date")) \
       .withColumn("is_new_session", when(col("gap") > 1, 1).otherwise(0)) \
       .withColumn("session_id", sum_("is_new_session").over(w.rowsBetween(Window.unboundedPreceding, 0)))

🌀 12. Rolling Average (moving window)

w = Window.partitionBy("user_id").orderBy("ts").rowsBetween(-2, 0)

df.withColumn("rolling_avg", F.avg("score").over(w))

🌀 13. Detect Streak of Same Values

# Assign a group ID whenever the value changes (incrementally)
df = df.withColumn("change", F.when(F.lag("val").over(w) != F.col("val"), 1).otherwise(0))
df = df.withColumn("streak_id", F.sum("change").over(w))

⚙️ SECTION 4: Partitioning, Repartitioning & Optimization

🚀 14. Repartition Before Join (avoid skew)

dfA = dfA.repartition("join_key")
dfB = dfB.repartition("join_key")

dfA.join(dfB, "join_key")

🧠 Ensures balanced shuffle and parallelism.

🚀 15. Broadcast Join (manual + hint)

# Manual
df_large.join(F.broadcast(df_small), "id")

# SQL hint
spark.sql("SELECT /*+ BROADCAST(t2) */ * FROM t1 JOIN t2 ON t1.id = t2.id")

🚀 16. Skew Handling (Salting)

dfA = dfA.withColumn("salt", F.expr("CAST(rand()*10 AS INT)"))
dfB = dfB.withColumn("salt", F.expr("CAST(rand()*10 AS INT)"))

dfA.join(dfB, ["key", "salt"])

🧠 Distributes skewed key across multiple partitions.

🚀 17. Persist with partitioning

df.write.partitionBy("country", "year").mode("overwrite").parquet("/path")

🚀 18. Use `explain()` to trace execution plan

df.join(df2, "id").select("name").explain(True)

🧠 Helps track join strategy, partitioning, broadcast usage, etc.

🎓 SECTION 5: Realistic Spark Interview Brain-Teasers

🧩 19. Second Highest Salary Per Department (with tie)

w = Window.partitionBy("dept").orderBy(F.desc("salary"))
df.withColumn("rnk", F.rank().over(w)).filter("rnk = 2")

🧩 20. User Sessions with Duration & Gaps

w = Window.partitionBy("user_id").orderBy("timestamp")

df = df.withColumn("prev_ts", F.lag("timestamp").over(w)) \
       .withColumn("gap", F.unix_timestamp("timestamp") - F.unix_timestamp("prev_ts")) \
       .withColumn("new_session", F.when(col("gap") > 900, 1).otherwise(0)) \
       .withColumn("session_id", F.sum("new_session").over(w))

🧩 21. Find Rows with Duplicate AND Unique Columns

df.groupBy("email").count().filter("count = 1")  # unique emails
df.groupBy("email").count().filter("count > 1")  # duplicate emails

🧠 Wrap-up: What You Just Got

Area	What’s Covered
Spark Traps	NULL behavior, join quirks, filter gotchas
DataFrame Mastery	`withColumn`, `explode`, `rank`, `pivot`, `lag`, `lead`
Window Functions	Streaks, gaps/islands, rolling avg, LAG-based change detection
Performance Optimization	Broadcast join, salt key join, partitioning, caching
Interview Brain-Teasers	Top-N per group, skewed joins, user sessions, realistic ranking problems

Here’s your all‑in‑one SQL & Spark mega‑sheet, combining everything we’ve discussed plus extra tricky LeetCode questions—all in one go!

🧩 Part 1: Spark‑Flavored SQL Traps (with code & explanation)

#	Question	Spark Code or SQL	Explanation
1	Will `NULL = NULL` match in a join?	`df1.join(df2, df1.id <=> df2.id, "inner")`	Spark treats `NULL = NULL` as false; `<=>` is null‑safe equality (GitHub, Stack Overflow)
2	Why some joined columns are always null after join?	See StackOverflow case: non‑serializable logic in UDF causing weird nulls (Stack Overflow)
3	How include rows with null join keys in result?	Use null‑safe join `<=>`, or add salting, or coalesce before join (Stack Overflow, Spark Tutorial Point)
4	Replace null result of left join with default?	`.join(..., "left").fillna(0).fillna("0")` (Stack Overflow, Stack Overflow)

🧪 Part 2: Hands‑On Spark DataFrame Code (with DataFrame APIs)

withColumn, conditional, ranking, explode: from pyspark.sql import functions as F df.withColumn("discount", F.col("price")*0.9) .withColumn("prev", F.lag("action").over(w)) .withColumn("changed", F.when(F.col("action") != F.col("prev"), 1).otherwise(0))
Explode with null-safe fallback: df.select("id", F.explode(F.coalesce("tags", F.array().cast("array<string>"))).alias("tag"))
Window rank: df.withColumn("rank", F.rank().over(Window.partitionBy("dept").orderBy(F.desc("salary")))) .filter("rank <= 2")

🌊 Part 3: Complex Window Patterns (Gaps, Rolling Avg, Streaks)

Consecutive sessions (gaps/islands): df.withColumn("prev", F.lag("login_date").over(w)) \ .withColumn("gap", F.datediff("login_date", "prev")) \ .withColumn("new_session", F.when(F.col("gap") > 1, 1).otherwise(0)) \ .withColumn("session_id", F.sum("new_session").over(w))
Rolling average (last 3 scores): w2 = Window.partitionBy("user").orderBy("ts").rowsBetween(-2, 0) df.withColumn("rolling_avg", F.avg("score").over(w2))
Streak detection (same val runs): df.withColumn("change", F.when(F.lag("val").over(w) != F.col("val"), 1).otherwise(0)) \ .withColumn("streak_id", F.sum("change").over(w))

🚀 Part 4: Partitioning & Optimization Techniques

Repartition by join key: dfA.repartition("key").join(dfB.repartition("key"), "key")
Broadcast small side: df_large.join(F.broadcast(df_small), "id")
Salting skewed keys: add salt = rand()*10 to key, then join(["key", "salt"])
Inspect plan: df.join(df2,...).explain(True)

🎓 Part 5: Spark + SQL Tricky/LIT code Brain‑Teasers

#	Puzzle	Answer / Explanation
A	Top‑2 per group with ties	Use `rank()` → then `rank <= 2` or `dense_rank()` if you want no gaps
B	2ᵗʰ highest salary per dept	`RANK() OVER (PARTITION BY dept ORDER BY salary DESC)` = 2
C	Anti‑join in Spark	`df1.join(df2, cond, "leftanti")` or `NOT EXISTS`
D	Full outer with coalesce	`df1.join(df2, "id", "fullouter").select(coalesce(df1.id, df2.id))`
E	COUNT vs COUNT(col)	`COUNT(*)` counts all rows; `COUNT(col)` skips NULLs
F	WHERE vs HAVING	Filter groups via `HAVING` after aggregation; cannot use `WHERE count(*) > 1`
G	NULL in `IN (...)` list	`NULL` in IN list causes unknown → may drop rows unexpectedly
H	DISTINCT NULLs	`SELECT DISTINCT NULL` returns only one row
I	ORDER BY on column not in SELECT	Allowed in many engines; but fails in strict SQL modes

🧠 Part 6: Classic LeetCode SQL Brain‑Teasers (conceptual & tricky)

1. Rising Temperature (LeetCode)

SELECT w2.id
FROM Weather w1
JOIN Weather w2
  ON datediff(w2.recordDate, w1.recordDate) = 1
  AND w2.temperature > w1.temperature;

🧠 Join with date logic and inequality condition. Multiple solutions exist. (Stack Overflow, Exchangetuts, Read Medium articles with AI, Stack Overflow, Stack Overflow, Stack Overflow, GitHub)

2. Second Highest Salary (185 Dept Top 3 Salaries)

SELECT Dept, MIN(salary) AS SecondHighest
FROM (
  SELECT Dept, salary,
         DENSE_RANK() OVER (PARTITION BY Dept ORDER BY salary DESC) AS rnk
  FROM Employee
) t
WHERE rnk = 2;

🧠 Uses DENSE_RANK() to handle ties. (GitHub)

3. Customers who ordered A but not B

SELECT customer_id
FROM Orders
WHERE product = 'A'
EXCEPT
SELECT customer_id FROM Orders WHERE product = 'B';

🧠 Removes any customer with B. Spark equivalent: left anti join. (GitHub, codelearner.in)

4. Confirmation Rate Per User

SELECT s.user_id,
  ROUND(AVG(CASE WHEN c.action = 'confirmed' THEN 1 ELSE 0 END), 2) AS confirmation_rate
FROM signups s
LEFT JOIN confirmations c ON s.user_id = c.user_id
GROUP BY s.user_id;

🧠 Aggregates mixed case logic to compute rate. (codelearner.in)

5. Students attendance cross join trick

SELECT s.student_id, sb.subject_name,
       COUNT(e.subject_name) AS attended
FROM students s
CROSS JOIN subjects sb
LEFT JOIN examinations e
  ON e.student_id = s.student_id
  AND e.subject_name = sb.subject_name
GROUP BY s.student_id, sb.subject_name;

🧠 Generates full matrix first, then left join attendance. (codelearner.in)

🧭 Recap & Quick Reference Table

Spark NULL behavior: use <=>, overflow null-safe logic
Window functions: rank(), dense_rank(), lag(), lead(), rolling avg, streaks
Joins: anti, null-safe, salting/skew, repartitioning, broadcast
Aggregation traps: COUNT, GROUP BY, HAVING vs WHERE
Set ops: EXCEPT, INTERSECT, or equivalent Spark anti/join logic
LeetCode puzzles: rising temperature, second highest salary, confirmation rate, cross‑join attendance

recent posts

about

✅ Input Tables:

Table1

Table2

⚠️ Important Concept in SQL and Spark:

🔁 INNER JOIN (ON t1.col1 = t2.col1)

Matching keys:

✅ Output:

👉 Count: 2

🔁 LEFT JOIN (ON t1.col1 = t2.col1)

Matching:

✅ Output:

👉 Count: 4 total rows

✅ Summary of Counts:

🔥 TRICKY SQL QUESTIONS FOR INTERVIEWS (WITH EXPLANATIONS)

1. How many rows does this INNER JOIN return?

2. What happens if you do COUNT(col) vs COUNT(*)?

3. Difference between WHERE and HAVING with aggregates

4. What does this return?

5. What’s the output of this?

6. NULLs in IN clause

7. Can GROUP BY NULL? What happens?

8. What does this return?

9. Using NOT IN with NULLs

10. LEFT JOIN with filter in WHERE vs ON

11. GROUP BY without SELECT columns

12. Using IS DISTINCT FROM

13. What is result of NULL + 5 or NULL || 'abc'?

14. Using COALESCE vs IFNULL vs NVL

15. Case-insensitive vs case-sensitive joins

16. Can we ORDER BY a column not in SELECT?

17. Can we GROUP BY alias?

18. Difference between COUNT(*) OVER() and COUNT(col) OVER()

19. How to remove duplicates but keep one?

20. Default NULL sorting behavior in ORDER BY

21. What happens with CROSS JOIN with NULLs?

22. Filter rows where col has at least one NULL duplicate

🧠 Want More?

🔥 SECTION 1: Window Function Puzzles (Ranking, Aggregates, Traps)

🧩 1. Top 2 Salaries Per Department (DENSE_RANK trap)

🧩 2. Running Total of Sales Per Region

🧩 3. First Purchase Date Per Customer

🧩 4. Detect Consecutive Repeats (LEAD/LAG)

🔁 SECTION 2: Top-N Per Group — Real Interview Puzzle

🧩 5. Get Latest 2 Orders Per Customer

🧩 6. Find Product With 2nd Highest Revenue Globally

🔄 SECTION 3: Complex Joins and Self-Joins

🧩 7. Self Join to Find Direct Manager

🧩 8. Get Employees Who Have Same Salary As Someone Else

🧩 9. Anti-Join (NOT EXISTS vs NOT IN vs LEFT JOIN)

⚖️ SECTION 4: Set Operations (INTERSECT, EXCEPT)

🧩 10. Customers Who Bought A but Not B

🧩 11. Customers Who Bought Both A and B

🧩 12. Spark SQL Equivalent using NOT EXISTS or LEFT ANTI JOIN

🧪 SECTION 5: Spark SQL NULL Behavior Puzzles

🧩 13. Will NULL = NULL return true in Spark SQL?

🧩 14. Use null-safe join (<=>)

🧩 15. Count rows where col IN (1, NULL)

🧩 16. LEFT JOIN with NULL filter in WHERE clause

⚡️ SECTION 1: Spark-Specific Traps & Conceptual Gotchas

🧨 1. NULL Join Trap (again but worse)

🧨 2. Filter after Aggregation Trap

🧨 3. Filtering null using !=

🧨 4. explode() on NULL column

🧨 5. Caching Trap

🧪 SECTION 2: Hands-On DataFrame APIs (Tricks with withColumn, rank, etc.)

🎯 6. withColumn: Derive new column with logic

🎯 7. Create flag if column value changed from previous row

🎯 8. Multiple Explodes (flattening nested arrays)

🎯 9. rank() to get top-N per group

🎯 10. pivot() to transform row → column

🧠 SECTION 3: Complex Window Function Patterns

🌀 11. Gaps & Islands – Consecutive Login Sessions

🌀 12. Rolling Average (moving window)

🌀 13. Detect Streak of Same Values

⚙️ SECTION 4: Partitioning, Repartitioning & Optimization

🚀 14. Repartition Before Join (avoid skew)

🚀 15. Broadcast Join (manual + hint)

`Table1`

`Table2`

🔁 INNER JOIN (`ON t1.col1 = t2.col1`)

👉 Count: `2`

🔁 LEFT JOIN (`ON t1.col1 = t2.col1`)

👉 Count: `4` total rows

2. **What happens if you do `COUNT(col)` vs `COUNT(*)`?**

3. Difference between `WHERE` and `HAVING` with aggregates

6. NULLs in `IN` clause

7. Can GROUP BY `NULL`? What happens?

9. Using `NOT IN` with NULLs

12. Using `IS DISTINCT FROM`

13. What is result of `NULL + 5` or `NULL || 'abc'`?

14. Using `COALESCE` vs `IFNULL` vs `NVL`

18. **Difference between COUNT(*) OVER() and COUNT(col) OVER()**

🧩 12. Spark SQL Equivalent using `NOT EXISTS` or `LEFT ANTI JOIN`

🧩 14. Use null-safe join (`<=>`)

🧨 3. Filtering `null` using `!=`

🧨 4. `explode()` on NULL column

🧪 SECTION 2: Hands-On DataFrame APIs (Tricks with `withColumn`, `rank`, etc.)

🚀 18. Use `explain()` to trace execution plan