Step-by-Step Roadmap tailored for Data Engineer target stack (SQL, PySpark, Python, AWS S3, Hadoop, Bitbucket)

Perfect. We’ll tackle both in sync:


βœ… Practice Set: SQL Deep Dive Challenges

(Time-based, Grouping Sets, Recursive CTEs)

πŸš€ In Parallel: PySpark + S3 Project (Next message)


πŸ§ͺ Practice Set (3 Advanced Challenges)


πŸ”Ή Challenge 1: Time-Based Analysis (Rolling Average)

πŸ“˜ Problem:
You have a table daily_sales(date, product_id, amount).
Find the 7-day rolling average sales per product (ending on each day).


βœ… SQL (Window):

SELECT 
  date,
  product_id,
  SUM(amount) OVER (
    PARTITION BY product_id 
    ORDER BY date 
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) / 7 AS rolling_avg_7d
FROM daily_sales;

πŸ” PySpark:

from pyspark.sql.window import Window
from pyspark.sql.functions import col, sum

w = Window.partitionBy("product_id").orderBy("date").rowsBetween(-6, 0)

df.withColumn("rolling_avg_7d", (sum("amount").over(w) / 7))

πŸ”Ή Challenge 2: GROUPING SETS (Multi-level Aggregation)

πŸ“˜ Problem:
You have sales(region, category, amount).
Create a single query to output:

  • Sales by region-category
  • Sales by region
  • Sales by category
  • Grand total

βœ… SQL:

SELECT 
  region,
  category,
  SUM(amount) AS total_sales
FROM sales
GROUP BY GROUPING SETS (
  (region, category),
  (region),
  (category),
  ()
);

Optional β†’ Filter out totals:

HAVING GROUPING(region) + GROUPING(category) < 2

πŸ” PySpark: Not natively supported as GROUPING SETS, but can simulate via union():

df_rc = df.groupBy("region", "category").agg(sum("amount").alias("total_sales"))
df_r  = df.groupBy("region").agg(sum("amount").alias("total_sales")).withColumn("category", lit(None))
df_c  = df.groupBy("category").agg(sum("amount").alias("total_sales")).withColumn("region", lit(None))
df_g  = df.agg(sum("amount").alias("total_sales")).withColumn("region", lit(None)).withColumn("category", lit(None))

final_df = df_rc.unionByName(df_r).unionByName(df_c).unionByName(df_g)

πŸ”Ή Challenge 3: Recursive CTE (Org Chart)

πŸ“˜ Problem:
You have employees(emp_id, emp_name, manager_id).

Get each employee’s reporting level from CEO, and show path_to_ceo.


βœ… SQL (Recursive CTE):

WITH RECURSIVE emp_tree AS (
  SELECT emp_id, emp_name, manager_id, 1 AS level,
         CAST(emp_name AS VARCHAR) AS path_to_ceo
  FROM employees
  WHERE manager_id IS NULL

  UNION ALL

  SELECT e.emp_id, e.emp_name, e.manager_id, t.level + 1,
         CONCAT(t.path_to_ceo, ' β†’ ', e.emp_name)
  FROM employees e
  JOIN emp_tree t ON e.manager_id = t.emp_id
)
SELECT * FROM emp_tree;

πŸ” PySpark: No built-in recursive CTE support.

You simulate using iterative joins with loop or GraphFrames (for DAGs/tree traversal). We’ll cover that in S3 project soon.


🎯 What’s Next?

Pages: 1 2 3