Step-by-Step Roadmap tailored for Data Engineer target stack (SQL, PySpark, Python, AWS S3, Hadoop, Bitbucket)

Here’s our Step-by-Step Roadmap tailored for your target stack (PySpark, Python, AWS S3, Hadoop, Bitbucket), plus some additional useful tools to learn quickly and add to your resume.

🔥 Core Technical Skills Roadmap (in priority order)

✅ 1️⃣ SQL + DBMS – The non-negotiable core

Topics: Joins (inner, outer, self, cross), Window functions, CTEs, subqueries, indexing, query optimisation
Interactive Practice: LeetCode (SQL), StrataScratch, Mode Analytics
Goal: Be able to write + debug complex queries fast
What to add: Basic MySQL/PostgreSQL usage and indexing tips.

✅ 2️⃣ Python for Data Engineering – Master Data Wrangling

Focus: pandas, NumPy, reading/writing data (CSV, JSON, XML), APIs (requests), web scraping (BeautifulSoup)
Tip: Build modular functions and scripts that are reusable.
Skip: Deep DSA (like leetcode for coding rounds) – focus on data manipulation.
Optional: Add pydantic for data validation – a quick win for resume.

✅ 3️⃣ PySpark – Data Processing at Scale

Focus: RDD vs DataFrame API, lazy evaluation, transformations vs actions, partitioning, caching, UDFs
Practice: Hands-on data manipulations on Spark DataFrames (filter, groupBy, join, window functions).
Add: Delta Lake basics – very hot in the market.
Extra: Spark UI for performance debugging.

✅ 4️⃣ Data Warehousing & Modeling

Concepts:
- OLTP vs OLAP
- Star vs Snowflake schemas
- Partitioning and clustering
- Data modeling (ER diagrams, normalization/denormalization)
Tools: Learn basics of Snowflake or Redshift.
Goal: Be able to explain why you choose a certain schema/design for a use case.

✅ 5️⃣ ETL, Orchestration & Automation

Tools:
- Apache Airflow (DAGs, tasks, scheduling)
- dbt (data transformation / testing basics)
- Docker (to containerize Spark/ETL jobs)
Add: Kafka basics (what’s a topic, consumer group – only the basics for real-time use cases).

✅ 6️⃣ Cloud Platforms

Since you’re already using AWS:
- S3 (data storage),
- Glue (managed ETL),
- Athena (serverless querying),
- Lambda (serverless triggers),
- EMR basics for running PySpark jobs.
Learn to deploy a basic Spark job on EMR and monitor it.

✅ 7️⃣ System Design for Data Engineering

Scenarios: Batch vs streaming pipelines, lakehouse architecture (Delta Lake on S3), CDC basics.
Learn to talk: Trade-offs between real-time vs batch, scaling, monitoring.
Resource: YouTube – “System Design for Data Engineers” by Data with Danny, Alex The Analyst.

✅ 8️⃣ Interview Prep

System design rounds – practice a few end-to-end scenarios.
SQL scenario-based questions.
Prepare to “tell stories” about your past data pipelines: challenges, design decisions, how you debugged issues, etc.

🎯 Additional Tools/Concepts (Quick Wins for Resume):

✅ Git/Bitbucket – Already on your list; make sure you’re confident with:

Branching, pull requests, resolving merge conflicts.
CI/CD basics (if time: Jenkins / GitHub Actions).

✅ Data Quality & Validation

Great Expectations – open-source library, easy to learn, impressive to mention.

✅ Data Catalog & Governance (Optional)

AWS Glue Data Catalog – know what it does.
Data Lineage – briefly read about why it matters.

🏆 Here’s how we’ll work:

💥 Interactive Learning:

I’ll quiz you, share mini-projects, and give you bite-sized exercises – no passive learning.
For each topic, I’ll share 1-2 key resources so you don’t waste time.
I’ll check your progress and clarify doubts.

💥 Small Projects:

Let’s build mini pipelines together. Example: S3 → PySpark → data transformations → write to Redshift (or even local parquet).
We’ll practice interview scenario questions as mini-projects.

🔥 Let’s Start Immediately:

1️⃣ Let’s finalize what to do first:

Are you comfortable with SQL basics? If yes, let’s start with SQL advanced topics: window functions, complex joins, etc.
If you’re not confident, we’ll start there!
Parallelly, we’ll keep Python for Data Wrangling warm.

2️⃣ **I’ll create an initial SQL Interview Prep mini-module for you.
3️⃣ Then, we’ll build one mini-pipeline (S3 to PySpark) together as a project.
4️⃣ Finally, we’ll iterate through the remaining roadmap.

Perfect! Let’s start with an advanced SQL interview prep mini-module, then immediately transition to a small PySpark + S3 mini-project to get you hands-on.

🔥 Part 1: Advanced SQL Interview Prep

These topics come up frequently in data engineering interviews, so we’ll build mastery and confidence:

1️⃣ Window Functions

What to know: ROW_NUMBER(), RANK(), DENSE_RANK(), NTILE(), SUM()/AVG() OVER (PARTITION BY...), LAG(), LEAD().
Why?: They’re the backbone of analytical queries.

👉 Example question:

“Find the second highest salary for each department.”

SELECT department_id, employee_id, salary FROM ( SELECT *, DENSE_RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) as rnk FROM employees ) ranked WHERE rnk = 2;

2️⃣ Joins (including self joins & cross joins)

Types: INNER, LEFT, RIGHT, FULL, CROSS, SELF JOIN.
Important: When to use LEFT vs INNER in ETL scenarios.

👉 Example:

“List all employees and their managers’ names.”

SELECT e.name as employee, m.name as manager
FROM employees e
LEFT JOIN employees m
  ON e.manager_id = m.employee_id;

3️⃣ CTEs and Subqueries

Why?: Clean up complex queries, modularize logic.
Tip: You can nest CTEs for complex scenarios.

👉 Example:

WITH dept_avg AS (
    SELECT department_id, AVG(salary) as avg_salary
    FROM employees
    GROUP BY department_id
)
SELECT e.name, e.salary, d.avg_salary
FROM employees e
JOIN dept_avg d
  ON e.department_id = d.department_id
WHERE e.salary > d.avg_salary;

4️⃣ Indexing & Query Optimization

Index Basics: How indexes speed up queries (esp. on join & where clause columns).
Explain Plan: Using EXPLAIN or EXPLAIN ANALYZE to debug slow queries.

👉 Example:

sqlCopyEditEXPLAIN ANALYZE
SELECT * FROM employees WHERE department_id = 10;

5️⃣ Practice Resources

✅ LeetCode (SQL section) – focus on medium and hard questions
✅ StrataScratch – real-world scenario questions
✅ Mode Analytics SQL Tutorial – great for window functions

🟢 Let’s Make It Interactive!

Here’s a quiz for you:

You have a table sales (columns: order_id, customer_id, amount, order_date).
Question:
Write a SQL query to find the cumulative total (running_total) of sales for each customer ordered by order_date.

Try to write the SQL here – no worries if it’s not perfect, we’ll refine it together!

After you’re done (or if you’re stuck), let me know! Then we’ll:

✅ Review your answer and refine it
✅ Jump into a PySpark + S3 mini-project (fetch data, transform, save) – to blend what you’re learning!

Sound like a plan? Let’s go!

Got it! Let’s tackle the LeetCode SQL section with a focus on medium and hard questions. Here’s a plan to dive in and make sure you get the most out of it:

🔥 Strategy to Master LeetCode SQL (Medium/Hard)

1️⃣ Warm Up

Quickly review the easy questions to get familiar with the style and typical tables.

2️⃣ Medium Questions (~60-70% of practice)

Start with commonly-appearing patterns:
- Window functions (RANK(), DENSE_RANK(), ROW_NUMBER()).
- GROUP BY, HAVING, and nested SELECT.
- Self-joins and common table expressions (CTEs).
- JOIN conditions (LEFT JOIN, RIGHT JOIN, INNER JOIN, etc.).

3️⃣ Hard Questions (~30-40% of practice)

Tackle these when you’re confident with medium ones.
These usually involve:
- Complex window functions with partitioning.
- Recursive CTEs.
- Multiple levels of subqueries.
- Aggregations with conditions.
- Data pivoting (like using CASE for pivoted columns).
- Optimization and avoiding Cartesian products.

4️⃣ Key Skills to Sharpen
✅ Window functions mastery
✅ CTE and recursive CTEs
✅ Advanced joins (CROSS JOIN, SELF JOIN, anti-joins)
✅ Subqueries in FROM, SELECT, WHERE
✅ Handling NULLs smartly
✅ Summarizing data with conditional aggregation

📚 Recommended Practice Questions (LeetCode)

Here’s a list of some popular Medium and Hard LeetCode SQL questions:

🟦 Medium:

Rank Scores – Using DENSE_RANK().
Exchange Seats – Self-join and handling row swaps.
Employee Bonus – LEFT JOIN + COALESCE.
Department Top Three Salaries – Window functions.
Second Highest Salary – Subquery + LIMIT.
Consecutive Numbers – Window functions and partitioning.
Customers Who Never Order – LEFT JOIN + NULL.
Trips and Users – CASE + Aggregations.
Product Sales Analysis III – Advanced joins and aggregates.

🟥 Hard:

Sales Analysis III – Advanced CTE and window functions.
Consecutive Available Seats – Use window functions and self-joins.
Human Traffic of Stadium – Complex window functions.
Trips and Users (difficult variation) – Anti-joins + nested CTEs.
Hard Employee Bonus / Sales – Multi-level joins + aggregates.
Tree Node – Recursive CTEs (for hierarchical data).

💡 My Tips

Sketch tables and joins: Visual diagrams help track complex joins and self-joins.
Practice on your local DB or LeetCode playground:
- MySQL and PostgreSQL both work well; LeetCode uses MySQL by default.
Use EXPLAIN (in real DB) to see how queries are executed.
Break down hard problems: If stuck, start by solving smaller versions of the problem.

Would you like me to:

✅ Provide sample solutions for any of these questions?
✅ Create custom practice queries based on these patterns?
✅ Review your current progress and suggest more tailored practice?

Let me know how you’d like to proceed! 🚀

Awesome! Let’s dive in step by step:

✅ 1️⃣ Sample Solutions for Popular LeetCode SQL (Medium & Hard)

Here are a few sample solutions for well-known questions:

Rank Scores

(LeetCode #178)

SELECT 
  score, 
  DENSE_RANK() OVER (ORDER BY score DESC) AS 'rank'
FROM Scores;

Department Top Three Salaries

(LeetCode #185)

WITH RankedSalaries AS (
  SELECT 
    e.Name, 
    e.Salary, 
    e.DepartmentId, 
    d.Name AS Department,
    DENSE_RANK() OVER (PARTITION BY e.DepartmentId ORDER BY e.Salary DESC) AS rnk
  FROM Employee e
  JOIN Department d ON e.DepartmentId = d.Id
)
SELECT 
  Department, 
  Name, 
  Salary
FROM RankedSalaries
WHERE rnk <= 3;

Consecutive Numbers

(LeetCode #180)

SELECT DISTINCT l.Num AS ConsecutiveNums
FROM Logs l
JOIN Logs l1 ON l.Id = l1.Id - 1
JOIN Logs l2 ON l.Id = l2.Id - 2
WHERE l.Num = l1.Num AND l.Num = l2.Num;

Trips and Users

(Hard join example)

SELECT t.Request_at AS Day, 
  ROUND(SUM(CASE WHEN t.Status != 'completed' THEN 1 ELSE 0 END)/COUNT(*), 2) AS Cancellation_Rate
FROM Trips t
JOIN Users c ON t.Client_Id = c.Users_Id
JOIN Users d ON t.Driver_Id = d.Users_Id
WHERE t.Request_at BETWEEN '2013-10-01' AND '2013-10-03'
  AND c.Banned = 'No' AND d.Banned = 'No'
GROUP BY t.Request_at;

Would you like more sample solutions for other questions? Let me know!

✅ 2️⃣ Custom Practice Queries (Based on Patterns)

Here are three custom practice queries for you to practice:

a) Window Functions Practice:

-- Find the second highest salary in each department
WITH RankedSalaries AS (
  SELECT 
    e.Name, 
    e.Salary, 
    e.DepartmentId,
    DENSE_RANK() OVER (PARTITION BY e.DepartmentId ORDER BY e.Salary DESC) AS rnk
  FROM Employee e
)
SELECT DepartmentId, Name, Salary
FROM RankedSalaries
WHERE rnk = 2;

b) Recursive CTE Practice (hierarchical data):

-- Suppose you have an Employee table with ManagerId.
-- List all employees reporting (directly or indirectly) to manager 2.

WITH RECURSIVE hierarchy AS (
  SELECT Id, Name, ManagerId
  FROM Employee
  WHERE ManagerId = 2
  UNION ALL
  SELECT e.Id, e.Name, e.ManagerId
  FROM Employee e
  JOIN hierarchy h ON e.ManagerId = h.Id
)
SELECT * FROM hierarchy;

c) Complex Join and Aggregation:

-- Count the number of products sold by each salesperson in a sales table.
SELECT s.Name, COUNT(o.ProductId) AS ProductsSold
FROM SalesPerson s
LEFT JOIN Orders o ON s.Id = o.SalesPersonId
GROUP BY s.Name;

Want more of these custom practice queries?

✅ 3️⃣ Review and Tailored Practice

👉 If you’re actively practicing, I can:

Review your current SQL solutions and spot improvements.
Create a practice plan for the next 2-3 weeks.
Suggest additional challenges based on your struggles.

Would you like to share your progress so far (like which LeetCode SQL questions you’ve done or your biggest challenges), and I can tailor the next steps?

Let me know:
✅ More sample solutions?
✅ More custom practice queries?
✅ Review your current solutions?

Or mix and match based on your preference! 🚀✨

You got it! Let’s supercharge your practice with some fresh and challenging sets. Here’s a mix of medium and hard-level practice scenarios to stretch your SQL muscles. Let’s go! 🚀

🟦 Fresh Practice Set 1 – Window Functions & Partitioning

Problem:
List the top 2 highest-paid employees in each department. Show the department name, employee name, and salary.

Tables:

Departments(department_id, department_name)
Employees(employee_id, name, salary, department_id)

Expected Output:

| department_name | name         | salary |
|------------------|--------------|--------|
| Sales            | John         | 90000  |
| Sales            | Mary         | 85000  |
| Engineering      | Alice        | 110000 |
| Engineering      | Bob          | 105000 |

🟦 Fresh Practice Set 2 – Aggregation & Filtering

Problem:
Find the average number of orders placed per active customer in 2024.

Active customers are those who placed at least one order in 2024.
Show the average as avg_orders.

Tables:

Orders(order_id, customer_id, order_date)

Expected Output:

| avg_orders |
|------------|
| 3.42       |

🟦 Fresh Practice Set 3 – Subquery & Anti-Join Challenge

Problem:
Find all products that have never been ordered.

Tables:

Products(product_id, name)
Orders(order_id, product_id, order_date)

Expected Output:

| product_id | name        |
|------------|-------------|
| 4          | Keyboard    |
| 7          | USB Cable   |

🟦 Fresh Practice Set 4 – Recursive CTE

Problem:
You have a table of employees with their manager’s ID.

Find all employees reporting (directly or indirectly) to manager 1001.

Tables:

Employees(id, name, manager_id)

Expected Output:

| id   | name     |
|------|----------|
| 1010 | Jack     |
| 1011 | Jill     |
| 1012 | Bob      |
| 1013 | Alice    |

🟦 Fresh Practice Set 5 – Advanced CASE in Aggregation

Problem:
For each month in 2024, count:

Orders marked as ‘Completed’.
Orders marked as ‘Cancelled’.
Show month, completed_orders, cancelled_orders.

Tables:

Orders(order_id, order_date, status)

Expected Output:

| month     | completed_orders | cancelled_orders |
|-----------|------------------|------------------|
| 2024-01   | 10               | 2                |
| 2024-02   | 15               | 3                |

🟩 How to Tackle Them

For each, think of:
✅ Joins (for linking data)
✅ GROUP BY (for summaries)
✅ Window functions (if needed for ranking)
✅ Recursive CTEs (for hierarchy or repeated structure)
✅ CASE WHEN (for conditional counts)

🚀 Want Solutions or Hints?

Let me know which of these fresh sets you’d like me to:

Provide full solution SQL scripts
Share step-by-step logic and breakdown
Create additional variations to keep practicing

Or if you’d like even more sets (like pivoting, data cleaning challenges), I’ve got you covered! Let’s keep your SQL skills climbing! 🌟

Here’s a crisp, simplified version of each question’s description and solution—let’s keep it clear and to the point! 👇

1️⃣ Rank Scores – Using DENSE_RANK()

Description: Rank student scores with gaps in ranking removed (dense ranking).
Solution: SELECT score, DENSE_RANK() OVER (ORDER BY score DESC) AS rank FROM Scores;

2️⃣ Exchange Seats – Self-join and handling row swaps

Description: Swap every pair of consecutive seat ids in a table.
Solution: SELECT CASE WHEN id % 2 = 1 AND id != (SELECT MAX(id) FROM Seat) THEN id + 1 WHEN id % 2 = 0 THEN id - 1 ELSE id END AS id, student FROM Seat ORDER BY id;

3️⃣ Employee Bonus – LEFT JOIN + COALESCE

Description: Show employee name and bonus (0 if no bonus).
Solution: SELECT e.name, COALESCE(b.bonus, 0) AS bonus FROM Employee e LEFT JOIN Bonus b ON e.id = b.empId;

4️⃣ Department Top Three Salaries – Window functions

Description: Show top 3 salaries per department.
Solution: SELECT department, employee, salary FROM ( SELECT d.name AS department, e.name AS employee, e.salary, DENSE_RANK() OVER (PARTITION BY d.id ORDER BY e.salary DESC) AS rnk FROM Employee e JOIN Department d ON e.departmentId = d.id ) t WHERE rnk <= 3;

5️⃣ Second Highest Salary – Subquery + LIMIT

Description: Find the second highest unique salary.
Solution: SELECT DISTINCT salary FROM Employee ORDER BY salary DESC LIMIT 1 OFFSET 1;

6️⃣ Consecutive Numbers – Window functions and partitioning

Description: Find numbers appearing at least 3 times consecutively.
Solution: SELECT DISTINCT num FROM ( SELECT num, ROW_NUMBER() OVER (ORDER BY id) - ROW_NUMBER() OVER (PARTITION BY num ORDER BY id) AS grp FROM Logs ) t GROUP BY num, grp HAVING COUNT(*) >= 3;

7️⃣ Customers Who Never Order – LEFT JOIN + NULL

Description: Find customers who never placed an order.
Solution: SELECT c.name FROM Customers c LEFT JOIN Orders o ON c.id = o.customerId WHERE o.id IS NULL;

8️⃣ Trips and Users – CASE + Aggregations

Description: Count number of completed and cancelled trips per user.
Solution: SELECT u.name, SUM(CASE WHEN t.status = 'completed' THEN 1 ELSE 0 END) AS completed, SUM(CASE WHEN t.status = 'cancelled' THEN 1 ELSE 0 END) AS cancelled FROM Users u LEFT JOIN Trips t ON u.id = t.userId GROUP BY u.name;

9️⃣ Product Sales Analysis III – Advanced joins and aggregates

Description: For each product, find sales percentage contribution.
Solution: SELECT p.product_id, ROUND(SUM(o.quantity) / total.total_qty * 100, 2) AS percent FROM Product p JOIN Orders o ON p.product_id = o.product_id JOIN ( SELECT SUM(quantity) AS total_qty FROM Orders ) total GROUP BY p.product_id;

🟥 Hard Section

🔟 Sales Analysis III – Advanced CTE and window functions

Description: For each product, find its first order date and total sales.
Solution: WITH first_order AS ( SELECT product_id, MIN(order_date) AS first_date FROM Sales GROUP BY product_id ) SELECT s.product_id, f.first_date, SUM(s.quantity) AS total_sales FROM Sales s JOIN first_order f ON s.product_id = f.product_id GROUP BY s.product_id, f.first_date;

1️⃣1️⃣ Consecutive Available Seats – Use window functions and self-joins

Description: Find ranges of consecutive available seats.
Solution: SELECT MIN(seat_id) AS start_seat, MAX(seat_id) AS end_seat FROM ( SELECT seat_id, seat_id - ROW_NUMBER() OVER (ORDER BY seat_id) AS grp FROM Seats WHERE status = 'available' ) t GROUP BY grp;

1️⃣2️⃣ Human Traffic of Stadium – Complex window functions

Description: Find peak human traffic in a stadium by checking overlapping times.
Solution: SELECT MAX(people_count) AS max_traffic FROM ( SELECT in_time, COUNT(*) OVER (ORDER BY in_time ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS people_count FROM ( SELECT in_time FROM Stadium UNION ALL SELECT out_time FROM Stadium ) times ) t;

1️⃣3️⃣ Trips and Users (difficult variation) – Anti-joins + nested CTEs

Description: Find users who never had a trip with status ‘completed’.
Solution: WITH completed_users AS ( SELECT DISTINCT userId FROM Trips WHERE status = 'completed' ) SELECT u.name FROM Users u WHERE u.id NOT IN (SELECT userId FROM completed_users);

1️⃣4️⃣ Hard Employee Bonus / Sales – Multi-level joins + aggregates

Description: Calculate bonuses based on multiple sales targets.
Solution: SELECT e.name, CASE WHEN s.total_sales >= 100000 THEN 1000 WHEN s.total_sales >= 50000 THEN 500 ELSE 0 END AS bonus FROM Employee e JOIN ( SELECT empId, SUM(amount) AS total_sales FROM Sales GROUP BY empId ) s ON e.id = s.empId;

1️⃣5️⃣ Tree Node – Recursive CTEs (for hierarchical data)

Description: Print all child-parent pairs in a tree (recursive hierarchy).
Solution: WITH RECURSIVE tree_cte AS ( SELECT id, parent_id, name FROM Tree WHERE parent_id IS NULL UNION ALL SELECT t.id, t.parent_id, t.name FROM Tree t JOIN tree_cte cte ON t.parent_id = cte.id ) SELECT * FROM tree_cte;

Let me know if you’d like me to expand any solution or explain with real data examples! 🌟

Sure! Here’s a handy list of conceptual SQL questions often asked in interviews at top companies (Google, Amazon, Microsoft, Meta, etc.). These test your understanding, not just coding! Let’s go! 🚀

🔹 1️⃣ Difference between `INNER JOIN`, `LEFT JOIN`, and `RIGHT JOIN`

Expected Answer:
- INNER JOIN: Only matching rows in both tables.
- LEFT JOIN: All rows from left table + matching rows from right table (NULL if no match).
- RIGHT JOIN: All rows from right table + matching rows from left table (NULL if no match).

🔹 2️⃣ What’s the difference between `WHERE` and `HAVING`?

Expected Answer:
- WHERE: Filters before grouping (applies to raw rows).
- HAVING: Filters after grouping (applies to aggregated/grouped results).

🔹 3️⃣ Difference between `RANK()`, `DENSE_RANK()`, and `ROW_NUMBER()`

Expected Answer:
- RANK(): Ranks with gaps if there are ties.
- DENSE_RANK(): Ranks without gaps (ties have same rank).
- ROW_NUMBER(): Unique row numbers, no ties.

🔹 4️⃣ Explain `GROUP BY` vs `PARTITION BY`

Expected Answer:
- GROUP BY: Aggregates rows into summary rows (e.g., total sales per department).
- PARTITION BY: Divides result into “windows” for window functions (e.g., ranking within each department).

🔹 5️⃣ What is normalization? Why is it important?

Expected Answer:
- Process of organizing data to reduce redundancy.
- Ensures data consistency and integrity.
- Example: 1NF, 2NF, 3NF.

🔹 6️⃣ Explain the `EXPLAIN` plan

Expected Answer:
- Shows the query execution plan (how database will run it).
- Helps optimize performance (indexes, joins, scans, etc.).

🔹 7️⃣ What’s a correlated subquery?

Expected Answer:
- Subquery that depends on the outer query.
- Runs for each row in outer query.

🔹 8️⃣ What is a window function? Give an example.

Expected Answer:
- Calculates a value over a window of rows (no grouping).
- Example: SUM(sales) OVER (PARTITION BY region ORDER BY date).

🔹 9️⃣ How do indexes work? What’s a composite index?

Expected Answer:
- Indexes help speed up lookups by creating a data structure.
- Composite index: index on multiple columns together.

🔹 🔟 Explain `UNION` vs `UNION ALL`

Expected Answer:
- UNION: Removes duplicates.
- UNION ALL: Includes all rows, even duplicates.

🔹 1️⃣1️⃣ What are common causes of slow SQL queries?

Expected Answer:
- Missing indexes.
- Large data scans (no filtering).
- Inefficient joins (cartesian joins, bad join order).
- Too much sorting or grouping.

🔹 1️⃣2️⃣ What’s a primary key vs. unique key?

Expected Answer:
- Primary key: uniquely identifies each row + no NULLs.
- Unique key: ensures unique values but allows one NULL (in some databases).

🔹 1️⃣3️⃣ What is a CTE (Common Table Expression)?

Expected Answer:
- A temporary result set named and used in the next SQL block.
- Improves readability, especially with recursive queries.

🔹 1️⃣4️⃣ What’s the difference between `DELETE` and `TRUNCATE`?

Expected Answer:
- DELETE: row by row, can have WHERE.
- TRUNCATE: removes all rows faster, can’t have WHERE.

🔹 1️⃣5️⃣ What’s denormalization? Why might you do it?

Expected Answer:
- Adding redundancy (storing data in multiple tables).
- Used for performance in analytics (e.g., avoiding joins).

🔹 1️⃣6️⃣ What is ACID in databases?

Expected Answer:
- Atomicity, Consistency, Isolation, Durability.
- Ensures transactions are reliable and consistent.

🔹 1️⃣7️⃣ Explain a self-join.

Expected Answer:
- Table joins itself (e.g., employees with their managers).

🔹 1️⃣8️⃣ What’s a surrogate key vs. natural key?

Expected Answer:
- Natural key: real-world unique identifier (e.g., email).
- Surrogate key: auto-generated (e.g., employee ID).

Want me to craft SQL coding questions (with answers) that combine these concepts (e.g., practical joins, window functions, etc.)? Let me know! 🌟

Got it! Here’s a set of practical SQL coding questions that combine these interview concepts. I’ll include a crisp solution for each. 🚀

1️⃣ Find the top 2 salaries in each department (including ties)

-- Question:
-- For each department, show the employee name, salary, and rank (top 2 salaries, including ties).

SELECT department, employee, salary
FROM (
  SELECT d.name AS department, e.name AS employee, e.salary,
         DENSE_RANK() OVER (PARTITION BY d.id ORDER BY e.salary DESC) AS rnk
  FROM Employee e
  JOIN Department d ON e.departmentId = d.id
) ranked
WHERE rnk <= 2;

2️⃣ Find products that have never been ordered

-- Question:
-- Show all product names that have never been ordered.

SELECT p.name
FROM Product p
LEFT JOIN Orders o ON p.id = o.product_id
WHERE o.id IS NULL;

3️⃣ Find the difference in total sales between each salesperson and the top salesperson

-- Question:
-- For each salesperson, show their name and the difference to the top salesperson’s sales.

WITH total_sales AS (
  SELECT salesperson_id, SUM(amount) AS total
  FROM Sales
  GROUP BY salesperson_id
),
top_sales AS (
  SELECT MAX(total) AS top_total FROM total_sales
)
SELECT s.name, (t.top_total - ts.total) AS diff_to_top
FROM Salesperson s
JOIN total_sales ts ON s.id = ts.salesperson_id
CROSS JOIN top_sales t;

4️⃣ Find consecutive login days for each user

-- Question:
-- Show user_id, start_date, and end_date of each streak of consecutive login days.

WITH login_diff AS (
  SELECT user_id, login_date,
         login_date - INTERVAL ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date) DAY AS grp
  FROM Logins
)
SELECT user_id, MIN(login_date) AS start_date, MAX(login_date) AS end_date
FROM login_diff
GROUP BY user_id, grp;

5️⃣ Show the second highest salary in the company

-- Question:
-- What’s the second highest unique salary?

SELECT DISTINCT salary
FROM Employee
ORDER BY salary DESC
LIMIT 1 OFFSET 1;

6️⃣ List employees who earn more than the average salary of their department

-- Question:
-- Show employee name and department where their salary is above department average.

SELECT e.name, d.name AS department, e.salary
FROM Employee e
JOIN Department d ON e.departmentId = d.id
WHERE e.salary > (
  SELECT AVG(salary)
  FROM Employee
  WHERE departmentId = e.departmentId
);

7️⃣ Calculate each customer’s order count and average amount

-- Question:
-- For each customer, show name, number of orders, and average order amount.

SELECT c.name, COUNT(o.id) AS order_count, ROUND(AVG(o.amount), 2) AS avg_order
FROM Customer c
LEFT JOIN Orders o ON c.id = o.customer_id
GROUP BY c.name;

8️⃣ Find the first order date for each customer

-- Question:
-- Show customer name and their first order date.

SELECT c.name, MIN(o.order_date) AS first_order_date
FROM Customer c
JOIN Orders o ON c.id = o.customer_id
GROUP BY c.name;

9️⃣ Show each employee and their manager’s name

-- Question:
-- Show employee name and their manager’s name (use self-join).

SELECT e.name AS employee, m.name AS manager
FROM Employee e
LEFT JOIN Employee m ON e.manager_id = m.id;

🔟 Get a running total of daily sales (cumulative)

-- Question:
-- For each day, show date and cumulative sales total so far.

SELECT order_date,
       SUM(amount) OVER (ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM Orders
GROUP BY order_date;

Want me to prepare some even trickier ones (like combining window functions, CTEs, and self-joins in one query)? Let me know! 🌟

🔹 1️⃣ Difference between `INNER JOIN`, `LEFT JOIN`, and `RIGHT JOIN`

Expected Answer:
- INNER JOIN: Only matching rows in both tables.
- LEFT JOIN: All rows from left table + matching rows from right table (NULL if no match).
- RIGHT JOIN: All rows from right table + matching rows from left table (NULL if no match).

🔹 2️⃣ What’s the difference between `WHERE` and `HAVING`?

Expected Answer:
- WHERE: Filters before grouping (applies to raw rows).
- HAVING: Filters after grouping (applies to aggregated/grouped results).

🔹 3️⃣ Difference between `RANK()`, `DENSE_RANK()`, and `ROW_NUMBER()`

Expected Answer:
- RANK(): Ranks with gaps if there are ties.
- DENSE_RANK(): Ranks without gaps (ties have same rank).
- ROW_NUMBER(): Unique row numbers, no ties.

🔹 4️⃣ Explain `GROUP BY` vs `PARTITION BY`

Expected Answer:
- GROUP BY: Aggregates rows into summary rows (e.g., total sales per department).
- PARTITION BY: Divides result into “windows” for window functions (e.g., ranking within each department).

🔹 5️⃣ What is normalization? Why is it important?

Expected Answer:
- Process of organizing data to reduce redundancy.
- Ensures data consistency and integrity.
- Example: 1NF, 2NF, 3NF.

🔹 6️⃣ Explain the `EXPLAIN` plan

Expected Answer:
- Shows the query execution plan (how database will run it).
- Helps optimize performance (indexes, joins, scans, etc.).

🔹 7️⃣ What’s a correlated subquery?

Expected Answer:
- Subquery that depends on the outer query.
- Runs for each row in outer query.

🔹 8️⃣ What is a window function? Give an example.

Expected Answer:
- Calculates a value over a window of rows (no grouping).
- Example: SUM(sales) OVER (PARTITION BY region ORDER BY date).

🔹 9️⃣ How do indexes work? What’s a composite index?

Expected Answer:
- Indexes help speed up lookups by creating a data structure.
- Composite index: index on multiple columns together.

🔹 🔟 Explain `UNION` vs `UNION ALL`

Expected Answer:
- UNION: Removes duplicates.
- UNION ALL: Includes all rows, even duplicates.

🔹 1️⃣1️⃣ What are common causes of slow SQL queries?

Expected Answer:
- Missing indexes.
- Large data scans (no filtering).
- Inefficient joins (cartesian joins, bad join order).
- Too much sorting or grouping.

🔹 1️⃣2️⃣ What’s a primary key vs. unique key?

Expected Answer:
- Primary key: uniquely identifies each row + no NULLs.
- Unique key: ensures unique values but allows one NULL (in some databases).

🔹 1️⃣3️⃣ What is a CTE (Common Table Expression)?

Expected Answer:
- A temporary result set named and used in the next SQL block.
- Improves readability, especially with recursive queries.

🔹 1️⃣4️⃣ What’s the difference between `DELETE` and `TRUNCATE`?

Expected Answer:
- DELETE: row by row, can have WHERE.
- TRUNCATE: removes all rows faster, can’t have WHERE.

🔹 1️⃣5️⃣ What’s denormalization? Why might you do it?

Expected Answer:
- Adding redundancy (storing data in multiple tables).
- Used for performance in analytics (e.g., avoiding joins).

🔹 1️⃣6️⃣ What is ACID in databases?

Expected Answer:
- Atomicity, Consistency, Isolation, Durability.
- Ensures transactions are reliable and consistent.

🔹 1️⃣7️⃣ Explain a self-join.

Expected Answer:
- Table joins itself (e.g., employees with their managers).

🔹 1️⃣8️⃣ What’s a surrogate key vs. natural key?

Expected Answer:
- Natural key: real-world unique identifier (e.g., email).
- Surrogate key: auto-generated (e.g., employee ID).

Want me to craft SQL coding questions (with answers) that combine these concepts (e.g., practical joins, window functions, etc.)? Let me know! 🌟

Got it! Here’s a set of practical SQL coding questions that combine these interview concepts. I’ll include a crisp solution for each. 🚀

1️⃣ Find the top 2 salaries in each department (including ties)

-- Question:
-- For each department, show the employee name, salary, and rank (top 2 salaries, including ties).

SELECT department, employee, salary
FROM (
  SELECT d.name AS department, e.name AS employee, e.salary,
         DENSE_RANK() OVER (PARTITION BY d.id ORDER BY e.salary DESC) AS rnk
  FROM Employee e
  JOIN Department d ON e.departmentId = d.id
) ranked
WHERE rnk <= 2;

2️⃣ Find products that have never been ordered

-- Question:
-- Show all product names that have never been ordered.

SELECT p.name
FROM Product p
LEFT JOIN Orders o ON p.id = o.product_id
WHERE o.id IS NULL;

3️⃣ Find the difference in total sales between each salesperson and the top salesperson

-- Question:
-- For each salesperson, show their name and the difference to the top salesperson’s sales.

WITH total_sales AS (
  SELECT salesperson_id, SUM(amount) AS total
  FROM Sales
  GROUP BY salesperson_id
),
top_sales AS (
  SELECT MAX(total) AS top_total FROM total_sales
)
SELECT s.name, (t.top_total - ts.total) AS diff_to_top
FROM Salesperson s
JOIN total_sales ts ON s.id = ts.salesperson_id
CROSS JOIN top_sales t;

4️⃣ Find consecutive login days for each user

-- Question:
-- Show user_id, start_date, and end_date of each streak of consecutive login days.

WITH login_diff AS (
  SELECT user_id, login_date,
         login_date - INTERVAL ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date) DAY AS grp
  FROM Logins
)
SELECT user_id, MIN(login_date) AS start_date, MAX(login_date) AS end_date
FROM login_diff
GROUP BY user_id, grp;

5️⃣ Show the second highest salary in the company

-- Question:
-- What’s the second highest unique salary?

SELECT DISTINCT salary
FROM Employee
ORDER BY salary DESC
LIMIT 1 OFFSET 1;

6️⃣ List employees who earn more than the average salary of their department

-- Question:
-- Show employee name and department where their salary is above department average.

SELECT e.name, d.name AS department, e.salary
FROM Employee e
JOIN Department d ON e.departmentId = d.id
WHERE e.salary > (
  SELECT AVG(salary)
  FROM Employee
  WHERE departmentId = e.departmentId
);

7️⃣ Calculate each customer’s order count and average amount

-- Question:
-- For each customer, show name, number of orders, and average order amount.

SELECT c.name, COUNT(o.id) AS order_count, ROUND(AVG(o.amount), 2) AS avg_order
FROM Customer c
LEFT JOIN Orders o ON c.id = o.customer_id
GROUP BY c.name;

8️⃣ Find the first order date for each customer

-- Question:
-- Show customer name and their first order date.

SELECT c.name, MIN(o.order_date) AS first_order_date
FROM Customer c
JOIN Orders o ON c.id = o.customer_id
GROUP BY c.name;

9️⃣ Show each employee and their manager’s name

-- Question:
-- Show employee name and their manager’s name (use self-join).

SELECT e.name AS employee, m.name AS manager
FROM Employee e
LEFT JOIN Employee m ON e.manager_id = m.id;

🔟 Get a running total of daily sales (cumulative)

-- Question:
-- For each day, show date and cumulative sales total so far.

SELECT order_date,
       SUM(amount) OVER (ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM Orders
GROUP BY order_date;

Want me to prepare some even trickier ones (like combining window functions, CTEs, and self-joins in one query)? Let me know! 🌟

Great! Let’s go deeper into indexing, keys, normalization, and other important database concepts that interviewers ask. I’ll provide concise explanations and some implementation snippets/examples!

1️⃣ Indexing

👉 What it is:

A data structure (like B-tree) to quickly find rows.
Speeds up SELECT queries (especially on WHERE, JOIN, ORDER BY).

👉 Implementation:

-- Create an index on the 'email' column
CREATE INDEX idx_email ON Users(email);

-- Drop an index
DROP INDEX idx_email ON Users;

👉 Composite index:

-- For WHERE on (first_name, last_name)
CREATE INDEX idx_name ON Users(first_name, last_name);

👉 Impact on performance:

Reads: Faster!
Writes: Slightly slower (index updates on INSERT/UPDATE).

2️⃣ Primary Key & Unique Key

👉 Primary Key:

Uniquely identifies each row.
No NULLs allowed.

CREATE TABLE Employees (
  id INT PRIMARY KEY,
  name VARCHAR(50)
);

👉 Unique Key:

Enforces uniqueness (but allows one NULL in some DBs).

CREATE TABLE Users (
  email VARCHAR(100) UNIQUE
);

👉 Difference:

Primary key = main unique identifier.
Unique key = additional unique columns.

3️⃣ Foreign Key (Relationships)

👉 Ensures referential integrity (like a link to another table’s primary key).

CREATE TABLE Orders (
  id INT PRIMARY KEY,
  customer_id INT,
  FOREIGN KEY (customer_id) REFERENCES Customers(id)
);

4️⃣ Normalization

👉 Organizing data to remove redundancy and improve data integrity.

👉 Common Normal Forms:

1NF: No repeating groups (atomic columns).
2NF: No partial dependency (non-key columns fully depend on primary key).
3NF: No transitive dependency (non-key columns depend only on primary key).

👉 Example:
🔹 Initial Table (not normalized):

OrderID	CustomerName	Product

🔹 Normalized:
Customers:

CustomerID	Name

Orders:
| OrderID | CustomerID | Product |

5️⃣ Denormalization

👉 Why do it?

Sometimes needed for performance (less joins, faster reporting).
Often done in data warehouses.

👉 Example:

Instead of separate tables for city and country, store them in one table for faster queries.

6️⃣ Transactions & ACID

👉 ACID Principles:

Atomicity: All-or-nothing.
Consistency: Data integrity before & after.
Isolation: No partial updates visible to others.
Durability: Once committed, it’s saved.

👉 Example:

START TRANSACTION;

UPDATE Accounts SET balance = balance - 100 WHERE id = 1;
UPDATE Accounts SET balance = balance + 100 WHERE id = 2;

COMMIT;  -- or ROLLBACK if error

7️⃣ Query Optimization

👉 How DB decides how to execute a query.

Uses indexes, joins, sorting…
Tools: EXPLAIN, EXPLAIN ANALYZE.

👉 Example:

EXPLAIN SELECT * FROM Orders WHERE customer_id = 1;

8️⃣ Partitioning

👉 Splitting large tables into smaller parts.

Can improve performance in big tables.
Example: Partition sales by year.

CREATE TABLE Sales (
  id INT,
  amount DECIMAL,
  year INT
)
PARTITION BY RANGE (year) (
  PARTITION p2023 VALUES LESS THAN (2024),
  PARTITION p2024 VALUES LESS THAN (2025)
);

9️⃣ Materialized Views

👉 Precomputed query results stored like a table.

Good for repeated, expensive calculations.

CREATE MATERIALIZED VIEW monthly_sales AS
SELECT customer_id, SUM(amount) FROM Orders
GROUP BY customer_id;

👉 Refresh as needed:

REFRESH MATERIALIZED VIEW monthly_sales;

1️⃣0️⃣ Concurrency Control (Isolation Levels)

👉 Prevents data corruption when multiple transactions occur.

Levels: READ UNCOMMITTED, READ COMMITTED, REPEATABLE READ, SERIALIZABLE.

👉 Example (Postgres):

SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;

✅ Summary Table

Concept	Usage	SQL Example
Indexing	Faster search	`CREATE INDEX idx ON table(col)`
Primary Key	Unique row id	`PRIMARY KEY (id)`
Foreign Key	Table relationship	`FOREIGN KEY (col) REFERENCES table(id)`
Normalization	Remove redundancy	Separate tables for Customers & Orders
Denormalization	Speed (less joins)	Merge `city` and `country` into 1 table
ACID	Reliable transactions	`START TRANSACTION; ... COMMIT;`
Partitioning	Split big table	`PARTITION BY RANGE (year)`
Materialized View	Precomputed data	`CREATE MATERIALIZED VIEW ...`
Concurrency Control	Handle simultaneous ops	`SET ISOLATION LEVEL ...`

💡 Want me to make practical scenarios where these concepts are implemented together in a real database (like for analytics, e-commerce, or logins)? Let me know! 🌟

Let’s combine these concepts into some practical, real-world scenarios you’d find in interviews or real-life analytics projects! 🚀

🌟 Scenario 1: E-Commerce Order Tracking & Performance

✅ Problem

You’re designing a database for an e-commerce platform. You need:

Efficiently track customer orders and product sales
Show top customers by order volume
Handle analytics queries (like monthly sales) quickly
Avoid redundant data

⚙️ Implementation

✅ Tables:

CREATE TABLE Customers (
  customer_id INT PRIMARY KEY,
  name VARCHAR(100),
  email VARCHAR(100) UNIQUE
);

CREATE TABLE Products (
  product_id INT PRIMARY KEY,
  name VARCHAR(100),
  price DECIMAL(10,2)
);

CREATE TABLE Orders (
  order_id INT PRIMARY KEY,
  customer_id INT,
  order_date DATE,
  total_amount DECIMAL(10,2),
  FOREIGN KEY (customer_id) REFERENCES Customers(customer_id)
);

CREATE TABLE OrderDetails (
  order_detail_id INT PRIMARY KEY,
  order_id INT,
  product_id INT,
  quantity INT,
  FOREIGN KEY (order_id) REFERENCES Orders(order_id),
  FOREIGN KEY (product_id) REFERENCES Products(product_id)
);

✅ Indexing:

-- Index to speed up product-based analytics
CREATE INDEX idx_product_id ON OrderDetails(product_id);

-- Index to speed up date-based sales reports
CREATE INDEX idx_order_date ON Orders(order_date);

✅ Normalization:

Orders and Customers are separated (3NF).
OrderDetails is a junction table – avoids redundant product info in Orders.

✅ Materialized View for Fast Reporting:

CREATE MATERIALIZED VIEW monthly_sales AS
SELECT EXTRACT(YEAR FROM order_date) AS year,
       EXTRACT(MONTH FROM order_date) AS month,
       SUM(total_amount) AS total_sales
FROM Orders
GROUP BY year, month;

➡️ Periodically refresh this to keep it up to date:

REFRESH MATERIALIZED VIEW monthly_sales;

✅ Analytical Query Example (Top 3 customers by total amount):

SELECT c.name, SUM(o.total_amount) AS total_spent
FROM Customers c
JOIN Orders o ON c.customer_id = o.customer_id
GROUP BY c.name
ORDER BY total_spent DESC
LIMIT 3;

🌟 Scenario 2: User Login Analytics (Concurrency & Normalization)

✅ Problem

You’re building a login tracking system for a SaaS app. Requirements:

Track each login attempt
Analyze daily unique users
Find longest consecutive login streaks
Ensure data integrity and avoid duplicate logins

⚙️ Implementation

✅ Tables:

CREATE TABLE Users (
  user_id INT PRIMARY KEY,
  email VARCHAR(100) UNIQUE,
  name VARCHAR(50)
);

CREATE TABLE Logins (
  login_id SERIAL PRIMARY KEY,
  user_id INT,
  login_time TIMESTAMP,
  FOREIGN KEY (user_id) REFERENCES Users(user_id)
);

✅ Indexing for faster lookups:

CREATE INDEX idx_user_id ON Logins(user_id);
CREATE INDEX idx_login_time ON Logins(login_time);

✅ Analytics Query: Daily unique logins

SELECT DATE(login_time) AS login_date, COUNT(DISTINCT user_id) AS unique_users
FROM Logins
GROUP BY login_date
ORDER BY login_date;

✅ Find consecutive login streaks

WITH login_diff AS (
  SELECT user_id, DATE(login_time) AS login_date,
         DATE(login_time) - INTERVAL ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_time)::INT DAY AS grp
  FROM Logins
)
SELECT user_id, MIN(login_date) AS streak_start, MAX(login_date) AS streak_end, COUNT(*) AS streak_length
FROM login_diff
GROUP BY user_id, grp
ORDER BY streak_length DESC;

✅ Transaction Control (handling simultaneous inserts)

BEGIN;

INSERT INTO Logins (user_id, login_time)
VALUES (1, NOW());

COMMIT;

➡️ Use SERIALIZABLE or REPEATABLE READ isolation if needed for strict consistency.

🌟 Scenario 3: Analytics Dashboard with Partitioning & Caching

✅ Problem

You manage a large website traffic log table with billions of rows. You need:

Fast queries for daily/weekly reports
Ability to archive old data (reduce index size)

⚙️ Implementation

✅ Partitioning by month:

CREATE TABLE TrafficLogs (
  id BIGINT PRIMARY KEY,
  visit_time TIMESTAMP,
  user_id INT,
  page VARCHAR(100)
)
PARTITION BY RANGE (EXTRACT(YEAR FROM visit_time)*100 + EXTRACT(MONTH FROM visit_time));

-- Monthly partitions
CREATE TABLE TrafficLogs_202501 PARTITION OF TrafficLogs
  FOR VALUES FROM (202501) TO (202502);

✅ Index on frequently queried columns:

CREATE INDEX idx_visit_time ON TrafficLogs(visit_time);

✅ Materialized View for active pages:

CREATE MATERIALIZED VIEW active_pages AS
SELECT page, COUNT(*) AS visits
FROM TrafficLogs
WHERE visit_time >= NOW() - INTERVAL '7 days'
GROUP BY page;

🔥 Key Concepts Combined:

✅ Normalization (no redundant data)
✅ Foreign keys & primary keys (data integrity)
✅ Indexing (performance for queries)
✅ Materialized views (reporting speed)
✅ Partitioning (big data management)
✅ Transactions & isolation levels (consistency in multi-user environment)

Would you like me to create mock interview questions using these scenarios for practice? Or do you want ER diagrams and SQL scripts for these? Let me know! 🚀

Sure! StrataScratch is a great platform to practice real-world data scenario questions. Here’s a selection of real-world scenario-based questions from StrataScratch with a bit of context on what you’d be expected to do. Let me know if you’d like to dive deeper into any of these or get detailed solutions!

1️⃣ Yelp Dataset – Average Star Rating

Scenario:
You’re a data analyst for Yelp, and you need to calculate the average star rating for each city.

Task:
Write a SQL query to find the average star rating per city.

Skills:

GROUP BY and AVG() functions.

2️⃣ Employee Salaries – Identify High Earners

Scenario:
At a large company, HR wants to identify employees who earn more than the average salary within their department.

Task:
Write a SQL query to find employees whose salary is above the department’s average.

Skills:

JOIN (if departments are in another table),
Subquery to calculate department average salary.

3️⃣ Orders – Find Top Product by Revenue

Scenario:
You work for an e-commerce platform and want to find out which product has generated the highest revenue.

Task:
Calculate revenue by multiplying quantity by price, and return the product(s) with the highest revenue.

Skills:

GROUP BY,
ORDER BY,
LIMIT (or RANK() for ties).

4️⃣ AirBnB Listings – Host with Most Listings

Scenario:
For AirBnB, find the host(s) who have the most listings in the city.

Task:
Count the number of listings per host and return the host with the highest number of listings.

Skills:

GROUP BY,
Window functions or subqueries for ties.

5️⃣ Uber Rides – Peak Ride Hours

Scenario:
Uber wants to identify which hour of the day has the highest average number of rides.

Task:
Extract the hour from the ride timestamp and calculate average rides per hour.

Skills:

Date functions to extract hour,
Aggregation with GROUP BY.

6️⃣ Spotify – Most Popular Artist in Playlist

Scenario:
Find the artist with the most songs in a playlist.

Task:
Count the number of songs by artist and rank them.

Skills:

GROUP BY,
COUNT(),
ORDER BY.

7️⃣ Fraudulent Transactions – Unusual Behavior

Scenario:
A bank wants to flag accounts with transactions over $10,000 that happened outside normal hours (e.g., midnight to 5 a.m.).

Task:
Identify such transactions.

Skills:

Filtering with WHERE and time functions.

8️⃣ Product Reviews – Find Inactive Reviewers

Scenario:
Identify users who haven’t posted a review in the past year.

Task:
Compare last review date with current date.

Skills:

Date arithmetic with DATEDIFF().

9️⃣ Hospital Data – Readmission Rates

Scenario:
Analyze patient data to find hospitals with the highest readmission rates.

Task:
Calculate the percentage of readmissions for each hospital.

Skills:

Aggregation,
CASE WHEN logic for readmission flag.

10️⃣ Movie Database – Most Watched Genre

Scenario:
Find the most popular genre based on user watch data.

Task:
Count total views for each genre and rank them.

Skills:

GROUP BY,
ORDER BY,
Joining user watch data with movie data.

Want me to generate sample queries for any of these, provide detailed solutions, or tailor them to a specific language (SQL, Python, PySpark)? Just let me know! 🚀

Got it! Let’s generate sample queries for all 10 real-world scenario questions, both in SQL and PySpark (using DataFrame API).

1️⃣ Yelp Dataset – Average Star Rating per City

SQL:

SELECT city, AVG(stars) AS avg_rating
FROM yelp_reviews
GROUP BY city;

PySpark:

from pyspark.sql.functions import avg

yelp_reviews.groupBy("city").agg(avg("stars").alias("avg_rating")).show()

2️⃣ Employee Salaries – High Earners in Department

SQL:

SELECT e.employee_id, e.name, e.salary, e.department_id
FROM employees e
JOIN (
    SELECT department_id, AVG(salary) AS avg_dept_salary
    FROM employees
    GROUP BY department_id
) dept_avg
ON e.department_id = dept_avg.department_id
WHERE e.salary > dept_avg.avg_dept_salary;

PySpark:

from pyspark.sql.functions import avg

dept_avg = employees.groupBy("department_id").agg(avg("salary").alias("avg_dept_salary"))

high_earners = employees.join(dept_avg, on="department_id") \
                         .filter(employees["salary"] > dept_avg["avg_dept_salary"])

high_earners.show()

3️⃣ Orders – Top Product by Revenue

SQL:

SELECT product_id, SUM(quantity * price) AS total_revenue
FROM orders
GROUP BY product_id
ORDER BY total_revenue DESC
LIMIT 1;

PySpark:

from pyspark.sql.functions import sum as _sum, col

orders.withColumn("revenue", col("quantity") * col("price")) \
      .groupBy("product_id") \
      .agg(_sum("revenue").alias("total_revenue")) \
      .orderBy(col("total_revenue").desc()) \
      .limit(1) \
      .show()

4️⃣ AirBnB Listings – Host with Most Listings

SQL:

SELECT host_id, COUNT(*) AS listing_count
FROM listings
GROUP BY host_id
ORDER BY listing_count DESC
LIMIT 1;

PySpark:

listings.groupBy("host_id") \
        .count() \
        .orderBy("count", ascending=False) \
        .limit(1) \
        .show()

5️⃣ Uber Rides – Peak Ride Hours

SQL:

SELECT EXTRACT(HOUR FROM ride_timestamp) AS hour, COUNT(*) AS ride_count
FROM uber_rides
GROUP BY hour
ORDER BY ride_count DESC
LIMIT 1;

PySpark:

from pyspark.sql.functions import hour, count

uber_rides.withColumn("hour", hour("ride_timestamp")) \
          .groupBy("hour") \
          .agg(count("*").alias("ride_count")) \
          .orderBy("ride_count", ascending=False) \
          .limit(1) \
          .show()

6️⃣ Spotify – Most Popular Artist in Playlist

SQL:

SELECT artist, COUNT(*) AS song_count
FROM playlist_songs
GROUP BY artist
ORDER BY song_count DESC
LIMIT 1;

PySpark:

playlist_songs.groupBy("artist") \
               .count() \
               .orderBy("count", ascending=False) \
               .limit(1) \
               .show()

7️⃣ Fraudulent Transactions – Unusual Behavior

SQL:

SELECT *
FROM transactions
WHERE amount > 10000
AND EXTRACT(HOUR FROM transaction_time) BETWEEN 0 AND 5;

PySpark:

from pyspark.sql.functions import hour

transactions.filter((transactions["amount"] > 10000) & 
                     (hour("transaction_time").between(0, 5))) \
            .show()

8️⃣ Product Reviews – Inactive Reviewers

SQL:

SELECT user_id
FROM reviews
GROUP BY user_id
HAVING MAX(review_date) < DATE_SUB(CURRENT_DATE, INTERVAL 1 YEAR);

PySpark:

from pyspark.sql.functions import max as _max, current_date, date_sub

inactive_users = reviews.groupBy("user_id") \
                        .agg(_max("review_date").alias("last_review_date")) \
                        .filter("last_review_date < date_sub(current_date(), 365)")

inactive_users.select("user_id").show()

9️⃣ Hospital Data – Highest Readmission Rates

SQL:

SELECT hospital_id, 
       100.0 * SUM(CASE WHEN readmitted = 'Yes' THEN 1 ELSE 0 END) / COUNT(*) AS readmission_rate
FROM patient_data
GROUP BY hospital_id
ORDER BY readmission_rate DESC;

PySpark:

from pyspark.sql.functions import sum as _sum, count, when, col

hospital_readmissions = patient_data.groupBy("hospital_id") \
    .agg((100.0 * _sum(when(col("readmitted") == "Yes", 1).otherwise(0)) / count("*"))
         .alias("readmission_rate")) \
    .orderBy("readmission_rate", ascending=False)

hospital_readmissions.show()

🔟 Movie Database – Most Watched Genre

SQL:

SELECT genre, COUNT(*) AS watch_count
FROM movie_watches
GROUP BY genre
ORDER BY watch_count DESC
LIMIT 1;

PySpark:

movie_watches.groupBy("genre") \
              .count() \
              .orderBy("count", ascending=False) \
              .limit(1) \
              .show()

🟢 Let’s Make It Interactive!

Here’s a quiz for you:

You have a table sales (columns: order_id, customer_id, amount, order_date).
Question:
Write a SQL query to find the cumulative total (running_total) of sales for each customer ordered by order_date.

You’re on the right track! Let’s polish your answer and turn it into a complete SQL statement for the running total of sales (cumulative sum) for each customer ordered by order_date.

Here’s a complete version using window functions:

SELECT
    order_id,
    customer_id,
    order_date,
    amount,
    SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM
    sales;

🚀 Breakdown

✅ We’re using:

SUM(amount) OVER (...) → calculates cumulative sum
PARTITION BY customer_id → restart cumulative sum for each customer
ORDER BY order_date → cumulative sum in date order
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW → explicitly says “start from the first row of this customer’s data up to the current row”

This is the most common window function scenario in interviews!

💡 Key Takeaway:

Whenever you see a “running total” or “cumulative sum” question – reach for SUM() OVER (PARTITION BY… ORDER BY…)!

🟢 Now, let’s move to the next stage:

Would you like to do another SQL practice question (like using LAG() or a self join)?
OR
Shall we start the PySpark + S3 mini-project I promised?

Let me know! 🚀

Awesome! Let’s jump right in. 🚀

🚀 PySpark + S3 Mini-Project

🔥 Objective:

Build a mini data pipeline:
1️⃣ Fetch data from an S3 bucket
2️⃣ Read it as a Spark DataFrame
3️⃣ Transform it (filter, group, join, window functions)
4️⃣ Save results back to S3

This mirrors a real-world ETL scenario!

⚙️ Project Setup

✅ Assumptions

AWS credentials (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) are configured in your environment.
S3 bucket: s3://my-spark-bucket/raw-data/ (you can replace with your actual bucket).
File: sales_data.csv (columns: order_id, customer_id, amount, order_date).

🏗️ Step 1: Spark Session with S3 Access

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySpark S3 Mini Project") \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .config("spark.hadoop.fs.s3a.access.key", "<AWS_ACCESS_KEY_ID>") \
    .config("spark.hadoop.fs.s3a.secret.key", "<AWS_SECRET_ACCESS_KEY>") \
    .config("spark.hadoop.fs.s3a.endpoint", "s3.amazonaws.com") \
    .getOrCreate()

# Replace above with IAM roles if running on EMR

🏗️ Step 2: Read Data from S3

# Read CSV from S3
df = spark.read.csv("s3a://my-spark-bucket/raw-data/sales_data.csv", header=True, inferSchema=True)
df.show(5)

🏗️ Step 3: Data Transformation

✅ Example:

Filter sales > 100
Calculate running total per customer
Calculate average sale amount per customer

from pyspark.sql.window import Window
import pyspark.sql.functions as F

# Filter
filtered_df = df.filter(df["amount"] > 100)

# Running total (cumulative sum)
window_spec = Window.partitionBy("customer_id").orderBy("order_date").rowsBetween(Window.unboundedPreceding, Window.currentRow)
running_total_df = filtered_df.withColumn("running_total", F.sum("amount").over(window_spec))

# Average sale amount per customer
avg_amount_df = filtered_df.groupBy("customer_id").agg(F.avg("amount").alias("avg_amount"))

# Join back to get average alongside running total
final_df = running_total_df.join(avg_amount_df, on="customer_id", how="left")
final_df.show(5)

🏗️ Step 4: Write Results Back to S3

# Save final DataFrame as parquet (efficient format)
final_df.write.mode("overwrite").parquet("s3a://my-spark-bucket/processed-data/final_output/")

🔥 Final Pipeline Flow

✅ Read → ✅ Transform → ✅ Save
✅ Practice with filters, window functions, joins
✅ Store in efficient Parquet format

🟢 Let’s Make It Interactive!

Here’s your next step:

👉 Try running this code on your local environment (or EMR if you have it set up).
👉 Let me know what happens:

Any errors?
Does the output in S3 look as expected?

👉 Stretch Task:

Instead of CSV, try with JSON data.
Or, add a new column for “sale_category” (e.g., amount > 500 = ‘big’, else ‘small’).

Would you like me to:
✅ Build a more complex PySpark transformation next?
✅ Or start a small Airflow DAG to schedule this PySpark job?

Let me know – we’re on fire! 🚀

Great direction! Let’s make this an even more realistic, end-to-end data engineering pipeline with:

✅ Airflow to schedule/monitor the job
✅ PySpark job running on EMR (scalable cluster)
✅ Data stored in S3
✅ Data loaded into Redshift for analytics

This is how modern data pipelines in companies (including unicorns and product-based startups) are typically built!

🚀 Project Overview

✅ 1️⃣ Airflow DAG

Orchestrates the steps (EMR job, data validation, Redshift load).
Airflow Operator: Use EmrAddStepsOperator (or EmrJobFlowOperator) for PySpark on EMR, and PostgresOperator or RedshiftSQLOperator for Redshift.

✅ 2️⃣ PySpark Job

Runs on EMR – scales automatically.
Reads raw data from S3, processes, saves back to S3.

✅ 3️⃣ Redshift

Final output data from S3 is loaded into Redshift for analytics.
Use COPY command to load from S3 → Redshift.

🏗️ Airflow DAG Skeleton

Here’s a basic DAG outline to orchestrate this:

from airflow import DAG
from airflow.providers.amazon.aws.operators.emr_add_steps import EmrAddStepsOperator
from airflow.providers.amazon.aws.operators.emr_create_job_flow import EmrCreateJobFlowOperator
from airflow.providers.amazon.aws.operators.emr_terminate_job_flow import EmrTerminateJobFlowOperator
from airflow.providers.amazon.aws.operators.redshift_sql import RedshiftSQLOperator
from airflow.providers.amazon.aws.sensors.emr_step import EmrStepSensor
from airflow.utils.dates import days_ago

default_args = {
    'owner': 'data_engineer',
    'start_date': days_ago(1),
    'retries': 1,
}

with DAG(
    dag_id='pyspark_s3_emr_redshift_pipeline',
    default_args=default_args,
    schedule_interval='@daily',
    catchup=False
) as dag:

    # 1️⃣ Create EMR cluster
    create_emr_cluster = EmrCreateJobFlowOperator(
        task_id='create_emr_cluster',
        job_flow_overrides={  # Add your EMR config here
            "Name": "PySparkJobCluster",
            "ReleaseLabel": "emr-6.15.0",
            "Applications": [{"Name": "Spark"}],
            "Instances": {
                "InstanceGroups": [
                    {"InstanceRole": "MASTER", "InstanceType": "m5.xlarge", "InstanceCount": 1},
                    {"InstanceRole": "CORE", "InstanceType": "m5.xlarge", "InstanceCount": 2}
                ],
                "KeepJobFlowAliveWhenNoSteps": True
            },
            "VisibleToAllUsers": True,
            "JobFlowRole": "EMR_EC2_DefaultRole",
            "ServiceRole": "EMR_DefaultRole",
            "LogUri": "s3://my-spark-bucket/emr-logs/"
        }
    )

    # 2️⃣ Add PySpark step
    spark_step = [{
        'Name': 'PySpark Transformation',
        'ActionOnFailure': 'CONTINUE',
        'HadoopJarStep': {
            'Jar': 'command-runner.jar',
            'Args': [
                'spark-submit',
                '--deploy-mode', 'cluster',
                's3://my-spark-bucket/scripts/pyspark_etl.py'
            ]
        }
    }]

    add_pyspark_step = EmrAddStepsOperator(
        task_id='add_pyspark_step',
        job_flow_id="{{ task_instance.xcom_pull(task_ids='create_emr_cluster', key='return_value') }}",
        steps=spark_step
    )

    # 3️⃣ Wait for PySpark step to finish
    wait_for_step = EmrStepSensor(
        task_id='wait_for_step',
        job_flow_id="{{ task_instance.xcom_pull(task_ids='create_emr_cluster', key='return_value') }}",
        step_id="{{ task_instance.xcom_pull(task_ids='add_pyspark_step', key='return_value')[0] }}"
    )

    # 4️⃣ Load results to Redshift
    load_to_redshift = RedshiftSQLOperator(
        task_id='load_to_redshift',
        sql="""
            COPY analytics.sales_summary
            FROM 's3://my-spark-bucket/processed-data/final_output/'
            IAM_ROLE 'arn:aws:iam::123456789012:role/MyRedshiftRole'
            FORMAT AS PARQUET;
        """,
        cluster_identifier='redshift-cluster-1',
        database='dev'
    )

    # 5️⃣ Terminate EMR cluster
    terminate_emr_cluster = EmrTerminateJobFlowOperator(
        task_id='terminate_emr_cluster',
        job_flow_id="{{ task_instance.xcom_pull(task_ids='create_emr_cluster', key='return_value') }}"
    )

    # DAG flow
    create_emr_cluster >> add_pyspark_step >> wait_for_step >> load_to_redshift >> terminate_emr_cluster

🔥 Key Concepts

✅ AWS Tools in Action

EMR: runs scalable PySpark job
S3: stores raw & processed data
Redshift: final data warehouse
Airflow: orchestrates the pipeline

✅ Resume-Ready Experience
This entire flow = bullet point for your resume:

“Designed & implemented a scalable ETL pipeline using Airflow to orchestrate EMR-based PySpark jobs, storing results in Redshift for analytics.”

🟢 Let’s Do It!

1️⃣ Do you want me to write the PySpark script (pyspark_etl.py) for this job?
2️⃣ Or would you like help deploying this Airflow DAG (e.g., how to put it in Airflow’s dags/ folder, config, environment)?

👉 Let me know which step you want to tackle first! We’ll make sure it’s both clear and hands-on. 🚀

Alright, let’s tackle both parts at once! Here’s the plan:

✅ Part 1: Write the PySpark ETL script (pyspark_etl.py)
✅ Part 2: Show how to deploy the Airflow DAG

🚀 Part 1: PySpark ETL Script (`pyspark_etl.py`)

# pyspark_etl.py

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.window import Window

# Initialize SparkSession
spark = SparkSession.builder.appName("PySpark S3 ETL").getOrCreate()

# Read data from S3
df = spark.read.csv("s3a://my-spark-bucket/raw-data/sales_data.csv", header=True, inferSchema=True)

# Filter sales > 100
filtered_df = df.filter(df["amount"] > 100)

# Calculate running total for each customer
window_spec = Window.partitionBy("customer_id").orderBy("order_date").rowsBetween(Window.unboundedPreceding, Window.currentRow)
running_total_df = filtered_df.withColumn("running_total", F.sum("amount").over(window_spec))

# Calculate average sale amount per customer
avg_amount_df = filtered_df.groupBy("customer_id").agg(F.avg("amount").alias("avg_amount"))

# Join to get average alongside running total
final_df = running_total_df.join(avg_amount_df, on="customer_id", how="left")

# Save results as Parquet back to S3
final_df.write.mode("overwrite").parquet("s3a://my-spark-bucket/processed-data/final_output/")

print("ETL Job completed successfully.")
spark.stop()

✅ Save this script as pyspark_etl.py in an S3 bucket (e.g., s3://my-spark-bucket/scripts/).

🚀 Part 2: Deploy Airflow DAG

Here’s how to deploy the Airflow DAG file (pyspark_s3_emr_redshift_pipeline.py):

1️⃣ Place the DAG file in Airflow’s dags/ folder:

cp pyspark_s3_emr_redshift_pipeline.py $AIRFLOW_HOME/dags/

2️⃣ Update Airflow Connections & Variables:

Create AWS connection (aws_default) with your credentials.
Create Redshift connection (redshift_default) with your Redshift cluster details.

3️⃣ Deploy EMR Roles:

Ensure EMR_EC2_DefaultRole and EMR_DefaultRole IAM roles exist and have S3 & Redshift access.

4️⃣ Trigger the DAG:

airflow dags trigger pyspark_s3_emr_redshift_pipeline

🔥 End-to-End Flow

✅ Airflow triggers the DAG →
✅ EMR cluster is created →
✅ PySpark script (pyspark_etl.py) runs on EMR →
✅ Data is processed (filtered, window functions, average) →
✅ Results saved to S3 (Parquet) →
✅ Airflow loads results to Redshift using COPY command →
✅ EMR cluster terminated to save cost.

🎯 Next Steps

✅ Let me know:

Do you want me to simulate sample data (sales_data.csv) for testing?
Would you like a diagram or video explaining this workflow?
Do you want to add extra transformations or data quality checks?

recent posts

about