Perfect. We’ll tackle both in sync:
β Practice Set: SQL Deep Dive Challenges
(Time-based, Grouping Sets, Recursive CTEs)
π In Parallel: PySpark + S3 Project (Next message)
π§ͺ Practice Set (3 Advanced Challenges)
πΉ Challenge 1: Time-Based Analysis (Rolling Average)
π Problem:
You have a table daily_sales(date, product_id, amount)
.
Find the 7-day rolling average sales per product (ending on each day).
β SQL (Window):
SELECT
date,
product_id,
SUM(amount) OVER (
PARTITION BY product_id
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) / 7 AS rolling_avg_7d
FROM daily_sales;
π PySpark:
from pyspark.sql.window import Window
from pyspark.sql.functions import col, sum
w = Window.partitionBy("product_id").orderBy("date").rowsBetween(-6, 0)
df.withColumn("rolling_avg_7d", (sum("amount").over(w) / 7))
πΉ Challenge 2: GROUPING SETS (Multi-level Aggregation)
π Problem:
You have sales(region, category, amount)
.
Create a single query to output:
- Sales by region-category
- Sales by region
- Sales by category
- Grand total
β SQL:
SELECT
region,
category,
SUM(amount) AS total_sales
FROM sales
GROUP BY GROUPING SETS (
(region, category),
(region),
(category),
()
);
Optional β Filter out totals:
HAVING GROUPING(region) + GROUPING(category) < 2
π PySpark: Not natively supported as GROUPING SETS
, but can simulate via union()
:
df_rc = df.groupBy("region", "category").agg(sum("amount").alias("total_sales"))
df_r = df.groupBy("region").agg(sum("amount").alias("total_sales")).withColumn("category", lit(None))
df_c = df.groupBy("category").agg(sum("amount").alias("total_sales")).withColumn("region", lit(None))
df_g = df.agg(sum("amount").alias("total_sales")).withColumn("region", lit(None)).withColumn("category", lit(None))
final_df = df_rc.unionByName(df_r).unionByName(df_c).unionByName(df_g)
πΉ Challenge 3: Recursive CTE (Org Chart)
π Problem:
You have employees(emp_id, emp_name, manager_id)
.
Get each employee’s reporting level from CEO, and show path_to_ceo
.
β SQL (Recursive CTE):
WITH RECURSIVE emp_tree AS (
SELECT emp_id, emp_name, manager_id, 1 AS level,
CAST(emp_name AS VARCHAR) AS path_to_ceo
FROM employees
WHERE manager_id IS NULL
UNION ALL
SELECT e.emp_id, e.emp_name, e.manager_id, t.level + 1,
CONCAT(t.path_to_ceo, ' β ', e.emp_name)
FROM employees e
JOIN emp_tree t ON e.manager_id = t.emp_id
)
SELECT * FROM emp_tree;
π PySpark: No built-in recursive CTE support.
You simulate using iterative joins with loop or GraphFrames (for DAGs/tree traversal). Weβll cover that in S3 project soon.