Perfect. We’ll tackle both in sync:

✅ Practice Set: SQL Deep Dive Challenges

(Time-based, Grouping Sets, Recursive CTEs)

🚀 In Parallel: PySpark + S3 Project (Next message)

🧪 Practice Set (3 Advanced Challenges)

🔹 Challenge 1: Time-Based Analysis (Rolling Average)

📘 Problem:
You have a table daily_sales(date, product_id, amount).
Find the 7-day rolling average sales per product (ending on each day).

✅ SQL (Window):

SELECT 
  date,
  product_id,
  SUM(amount) OVER (
    PARTITION BY product_id 
    ORDER BY date 
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) / 7 AS rolling_avg_7d
FROM daily_sales;

🔁 PySpark:

from pyspark.sql.window import Window
from pyspark.sql.functions import col, sum

w = Window.partitionBy("product_id").orderBy("date").rowsBetween(-6, 0)

df.withColumn("rolling_avg_7d", (sum("amount").over(w) / 7))

🔹 Challenge 2: GROUPING SETS (Multi-level Aggregation)

📘 Problem:
You have sales(region, category, amount).
Create a single query to output:

Sales by region-category
Sales by region
Sales by category
Grand total

✅ SQL:

SELECT 
  region,
  category,
  SUM(amount) AS total_sales
FROM sales
GROUP BY GROUPING SETS (
  (region, category),
  (region),
  (category),
  ()
);

Optional → Filter out totals:

HAVING GROUPING(region) + GROUPING(category) < 2

🔁 PySpark: Not natively supported as `GROUPING SETS`, but can simulate via `union()`:

df_rc = df.groupBy("region", "category").agg(sum("amount").alias("total_sales"))
df_r  = df.groupBy("region").agg(sum("amount").alias("total_sales")).withColumn("category", lit(None))
df_c  = df.groupBy("category").agg(sum("amount").alias("total_sales")).withColumn("region", lit(None))
df_g  = df.agg(sum("amount").alias("total_sales")).withColumn("region", lit(None)).withColumn("category", lit(None))

final_df = df_rc.unionByName(df_r).unionByName(df_c).unionByName(df_g)

🔹 Challenge 3: Recursive CTE (Org Chart)

📘 Problem:
You have employees(emp_id, emp_name, manager_id).

Get each employee’s reporting level from CEO, and show path_to_ceo.

✅ SQL (Recursive CTE):

WITH RECURSIVE emp_tree AS (
  SELECT emp_id, emp_name, manager_id, 1 AS level,
         CAST(emp_name AS VARCHAR) AS path_to_ceo
  FROM employees
  WHERE manager_id IS NULL

  UNION ALL

  SELECT e.emp_id, e.emp_name, e.manager_id, t.level + 1,
         CONCAT(t.path_to_ceo, ' → ', e.emp_name)
  FROM employees e
  JOIN emp_tree t ON e.manager_id = t.emp_id
)
SELECT * FROM emp_tree;

🔁 PySpark: No built-in recursive CTE support.

You simulate using iterative joins with loop or GraphFrames (for DAGs/tree traversal). We’ll cover that in S3 project soon.

HintsToday

recent posts

about

Step-by-Step Roadmap tailored for Data Engineer target stack (SQL, PySpark, Python, AWS S3, Hadoop, Bitbucket)

✅ Practice Set: SQL Deep Dive Challenges

🚀 In Parallel: PySpark + S3 Project (Next message)

🧪 Practice Set (3 Advanced Challenges)

🔹 Challenge 1: Time-Based Analysis (Rolling Average)

✅ SQL (Window):

🔁 PySpark:

🔹 Challenge 2: GROUPING SETS (Multi-level Aggregation)

✅ SQL:

Optional → Filter out totals:

🔁 PySpark: Not natively supported as `GROUPING SETS`, but can simulate via `union()`:

🔹 Challenge 3: Recursive CTE (Org Chart)

✅ SQL (Recursive CTE):

🔁 PySpark: No built-in recursive CTE support.

🎯 What’s Next?

Like this:

recent posts

about

Step-by-Step Roadmap tailored for Data Engineer target stack (SQL, PySpark, Python, AWS S3, Hadoop, Bitbucket)

✅ Practice Set: SQL Deep Dive Challenges

🚀 In Parallel: PySpark + S3 Project (Next message)

🧪 Practice Set (3 Advanced Challenges)

🔹 Challenge 1: Time-Based Analysis (Rolling Average)

✅ SQL (Window):

🔁 PySpark:

🔹 Challenge 2: GROUPING SETS (Multi-level Aggregation)

✅ SQL:

Optional → Filter out totals:

🔁 PySpark: Not natively supported as GROUPING SETS, but can simulate via union():

🔹 Challenge 3: Recursive CTE (Org Chart)

✅ SQL (Recursive CTE):

🔁 PySpark: No built-in recursive CTE support.

🎯 What’s Next?

Like this:

🔁 PySpark: Not natively supported as `GROUPING SETS`, but can simulate via `union()`: