Here’s our Step-by-Step Roadmap tailored for your target stack (PySpark, Python, AWS S3, Hadoop, Bitbucket), plus some additional useful tools to learn quickly and add to your resume.
🔥 Core Technical Skills Roadmap (in priority order)
✅ 1️⃣ SQL + DBMS – The non-negotiable core
- Topics: Joins (inner, outer, self, cross), Window functions, CTEs, subqueries, indexing, query optimisation
- Interactive Practice: LeetCode (SQL), StrataScratch, Mode Analytics
- Goal: Be able to write + debug complex queries fast
- What to add: Basic MySQL/PostgreSQL usage and indexing tips.
✅ 2️⃣ Python for Data Engineering – Master Data Wrangling
- Focus: pandas, NumPy, reading/writing data (CSV, JSON, XML), APIs (requests), web scraping (BeautifulSoup)
- Tip: Build modular functions and scripts that are reusable.
- Skip: Deep DSA (like leetcode for coding rounds) – focus on data manipulation.
- Optional: Add
pydantic
for data validation – a quick win for resume.
✅ 3️⃣ PySpark – Data Processing at Scale
- Focus: RDD vs DataFrame API, lazy evaluation, transformations vs actions, partitioning, caching, UDFs
- Practice: Hands-on data manipulations on Spark DataFrames (filter, groupBy, join, window functions).
- Add: Delta Lake basics – very hot in the market.
- Extra: Spark UI for performance debugging.
✅ 4️⃣ Data Warehousing & Modeling
- Concepts:
- OLTP vs OLAP
- Star vs Snowflake schemas
- Partitioning and clustering
- Data modeling (ER diagrams, normalization/denormalization)
- Tools: Learn basics of Snowflake or Redshift.
- Goal: Be able to explain why you choose a certain schema/design for a use case.
✅ 5️⃣ ETL, Orchestration & Automation
- Tools:
- Apache Airflow (DAGs, tasks, scheduling)
- dbt (data transformation / testing basics)
- Docker (to containerize Spark/ETL jobs)
- Add: Kafka basics (what’s a topic, consumer group – only the basics for real-time use cases).
✅ 6️⃣ Cloud Platforms
- Since you’re already using AWS:
- S3 (data storage),
- Glue (managed ETL),
- Athena (serverless querying),
- Lambda (serverless triggers),
- EMR basics for running PySpark jobs.
- Learn to deploy a basic Spark job on EMR and monitor it.
✅ 7️⃣ System Design for Data Engineering
- Scenarios: Batch vs streaming pipelines, lakehouse architecture (Delta Lake on S3), CDC basics.
- Learn to talk: Trade-offs between real-time vs batch, scaling, monitoring.
- Resource: YouTube – “System Design for Data Engineers” by Data with Danny, Alex The Analyst.
✅ 8️⃣ Interview Prep
- System design rounds – practice a few end-to-end scenarios.
- SQL scenario-based questions.
- Prepare to “tell stories” about your past data pipelines: challenges, design decisions, how you debugged issues, etc.
🎯 Additional Tools/Concepts (Quick Wins for Resume):
✅ Git/Bitbucket – Already on your list; make sure you’re confident with:
- Branching, pull requests, resolving merge conflicts.
- CI/CD basics (if time: Jenkins / GitHub Actions).
✅ Data Quality & Validation
Great Expectations
– open-source library, easy to learn, impressive to mention.
✅ Data Catalog & Governance (Optional)
- AWS Glue Data Catalog – know what it does.
- Data Lineage – briefly read about why it matters.
🏆 Here’s how we’ll work:
💥 Interactive Learning:
- I’ll quiz you, share mini-projects, and give you bite-sized exercises – no passive learning.
- For each topic, I’ll share 1-2 key resources so you don’t waste time.
- I’ll check your progress and clarify doubts.
💥 Small Projects:
- Let’s build mini pipelines together. Example: S3 → PySpark → data transformations → write to Redshift (or even local parquet).
- We’ll practice interview scenario questions as mini-projects.
🔥 Let’s Start Immediately:
1️⃣ Let’s finalize what to do first:
- Are you comfortable with SQL basics? If yes, let’s start with SQL advanced topics: window functions, complex joins, etc.
- If you’re not confident, we’ll start there!
- Parallelly, we’ll keep Python for Data Wrangling warm.
2️⃣ **I’ll create an initial SQL Interview Prep mini-module for you.
3️⃣ Then, we’ll build one mini-pipeline (S3 to PySpark) together as a project.
4️⃣ Finally, we’ll iterate through the remaining roadmap.
Perfect! Let’s start with an advanced SQL interview prep mini-module, then immediately transition to a small PySpark + S3 mini-project to get you hands-on.
🔥 Part 1: Advanced SQL Interview Prep
These topics come up frequently in data engineering interviews, so we’ll build mastery and confidence:
1️⃣ Window Functions
- What to know:
ROW_NUMBER()
,RANK()
,DENSE_RANK()
,NTILE()
,SUM()/AVG() OVER (PARTITION BY...)
,LAG()
,LEAD()
. - Why?: They’re the backbone of analytical queries.
👉 Example question:
“Find the second highest salary for each department.”
SELECT department_id, employee_id, salary
FROM (
SELECT *,
DENSE_RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) as rnk
FROM employees
) ranked
WHERE rnk = 2;
2️⃣ Joins (including self joins & cross joins)
- Types:
INNER
,LEFT
,RIGHT
,FULL
,CROSS
,SELF JOIN
. - Important: When to use
LEFT
vsINNER
in ETL scenarios.
👉 Example:
“List all employees and their managers’ names.”
SELECT e.name as employee, m.name as manager
FROM employees e
LEFT JOIN employees m
ON e.manager_id = m.employee_id;
3️⃣ CTEs and Subqueries
- Why?: Clean up complex queries, modularize logic.
- Tip: You can nest CTEs for complex scenarios.
👉 Example:
WITH dept_avg AS (
SELECT department_id, AVG(salary) as avg_salary
FROM employees
GROUP BY department_id
)
SELECT e.name, e.salary, d.avg_salary
FROM employees e
JOIN dept_avg d
ON e.department_id = d.department_id
WHERE e.salary > d.avg_salary;
4️⃣ Indexing & Query Optimization
- Index Basics: How indexes speed up queries (esp. on join & where clause columns).
- Explain Plan: Using
EXPLAIN
orEXPLAIN ANALYZE
to debug slow queries.
👉 Example:
sqlCopyEditEXPLAIN ANALYZE
SELECT * FROM employees WHERE department_id = 10;
5️⃣ Practice Resources
✅ LeetCode (SQL section) – focus on medium and hard questions
✅ StrataScratch – real-world scenario questions
✅ Mode Analytics SQL Tutorial – great for window functions
🟢 Let’s Make It Interactive!
Here’s a quiz for you:
You have a table
sales
(columns:order_id
,customer_id
,amount
,order_date
).
Question:
Write a SQL query to find the cumulative total (running_total
) of sales for each customer ordered byorder_date
.
Try to write the SQL here – no worries if it’s not perfect, we’ll refine it together!
After you’re done (or if you’re stuck), let me know! Then we’ll:
✅ Review your answer and refine it
✅ Jump into a PySpark + S3 mini-project (fetch data, transform, save) – to blend what you’re learning!
Sound like a plan? Let’s go!
Got it! Let’s tackle the LeetCode SQL section with a focus on medium and hard questions. Here’s a plan to dive in and make sure you get the most out of it:
🔥 Strategy to Master LeetCode SQL (Medium/Hard)
1️⃣ Warm Up
- Quickly review the easy questions to get familiar with the style and typical tables.
2️⃣ Medium Questions (~60-70% of practice)
- Start with commonly-appearing patterns:
- Window functions (
RANK()
,DENSE_RANK()
,ROW_NUMBER()
). GROUP BY
,HAVING
, and nestedSELECT
.- Self-joins and common table expressions (CTEs).
JOIN
conditions (LEFT JOIN
,RIGHT JOIN
,INNER JOIN
, etc.).
- Window functions (
3️⃣ Hard Questions (~30-40% of practice)
- Tackle these when you’re confident with medium ones.
- These usually involve:
- Complex window functions with partitioning.
- Recursive CTEs.
- Multiple levels of subqueries.
- Aggregations with conditions.
- Data pivoting (like using
CASE
for pivoted columns). - Optimization and avoiding Cartesian products.
4️⃣ Key Skills to Sharpen
✅ Window functions mastery
✅ CTE and recursive CTEs
✅ Advanced joins (CROSS JOIN
, SELF JOIN
, anti-joins)
✅ Subqueries in FROM
, SELECT
, WHERE
✅ Handling NULLs smartly
✅ Summarizing data with conditional aggregation
📚 Recommended Practice Questions (LeetCode)
Here’s a list of some popular Medium and Hard LeetCode SQL questions:
🟦 Medium:
- Rank Scores – Using
DENSE_RANK()
. - Exchange Seats – Self-join and handling row swaps.
- Employee Bonus – LEFT JOIN + COALESCE.
- Department Top Three Salaries – Window functions.
- Second Highest Salary – Subquery +
LIMIT
. - Consecutive Numbers – Window functions and partitioning.
- Customers Who Never Order – LEFT JOIN + NULL.
- Trips and Users – CASE + Aggregations.
- Product Sales Analysis III – Advanced joins and aggregates.
🟥 Hard:
- Sales Analysis III – Advanced CTE and window functions.
- Consecutive Available Seats – Use window functions and self-joins.
- Human Traffic of Stadium – Complex window functions.
- Trips and Users (difficult variation) – Anti-joins + nested CTEs.
- Hard Employee Bonus / Sales – Multi-level joins + aggregates.
- Tree Node – Recursive CTEs (for hierarchical data).
💡 My Tips
- Sketch tables and joins: Visual diagrams help track complex joins and self-joins.
- Practice on your local DB or LeetCode playground:
- MySQL and PostgreSQL both work well; LeetCode uses MySQL by default.
- Use EXPLAIN (in real DB) to see how queries are executed.
- Break down hard problems: If stuck, start by solving smaller versions of the problem.
Would you like me to:
✅ Provide sample solutions for any of these questions?
✅ Create custom practice queries based on these patterns?
✅ Review your current progress and suggest more tailored practice?
Let me know how you’d like to proceed! 🚀
Awesome! Let’s dive in step by step:
✅ 1️⃣ Sample Solutions for Popular LeetCode SQL (Medium & Hard)
Here are a few sample solutions for well-known questions:
Rank Scores
(LeetCode #178)
SELECT
score,
DENSE_RANK() OVER (ORDER BY score DESC) AS 'rank'
FROM Scores;
Department Top Three Salaries
(LeetCode #185)
WITH RankedSalaries AS (
SELECT
e.Name,
e.Salary,
e.DepartmentId,
d.Name AS Department,
DENSE_RANK() OVER (PARTITION BY e.DepartmentId ORDER BY e.Salary DESC) AS rnk
FROM Employee e
JOIN Department d ON e.DepartmentId = d.Id
)
SELECT
Department,
Name,
Salary
FROM RankedSalaries
WHERE rnk <= 3;
Consecutive Numbers
(LeetCode #180)
SELECT DISTINCT l.Num AS ConsecutiveNums
FROM Logs l
JOIN Logs l1 ON l.Id = l1.Id - 1
JOIN Logs l2 ON l.Id = l2.Id - 2
WHERE l.Num = l1.Num AND l.Num = l2.Num;
Trips and Users
(Hard join example)
SELECT t.Request_at AS Day,
ROUND(SUM(CASE WHEN t.Status != 'completed' THEN 1 ELSE 0 END)/COUNT(*), 2) AS Cancellation_Rate
FROM Trips t
JOIN Users c ON t.Client_Id = c.Users_Id
JOIN Users d ON t.Driver_Id = d.Users_Id
WHERE t.Request_at BETWEEN '2013-10-01' AND '2013-10-03'
AND c.Banned = 'No' AND d.Banned = 'No'
GROUP BY t.Request_at;
Would you like more sample solutions for other questions? Let me know!
✅ 2️⃣ Custom Practice Queries (Based on Patterns)
Here are three custom practice queries for you to practice:
a) Window Functions Practice:
-- Find the second highest salary in each department
WITH RankedSalaries AS (
SELECT
e.Name,
e.Salary,
e.DepartmentId,
DENSE_RANK() OVER (PARTITION BY e.DepartmentId ORDER BY e.Salary DESC) AS rnk
FROM Employee e
)
SELECT DepartmentId, Name, Salary
FROM RankedSalaries
WHERE rnk = 2;
b) Recursive CTE Practice (hierarchical data):
-- Suppose you have an Employee table with ManagerId.
-- List all employees reporting (directly or indirectly) to manager 2.
WITH RECURSIVE hierarchy AS (
SELECT Id, Name, ManagerId
FROM Employee
WHERE ManagerId = 2
UNION ALL
SELECT e.Id, e.Name, e.ManagerId
FROM Employee e
JOIN hierarchy h ON e.ManagerId = h.Id
)
SELECT * FROM hierarchy;
c) Complex Join and Aggregation:
-- Count the number of products sold by each salesperson in a sales table.
SELECT s.Name, COUNT(o.ProductId) AS ProductsSold
FROM SalesPerson s
LEFT JOIN Orders o ON s.Id = o.SalesPersonId
GROUP BY s.Name;
Want more of these custom practice queries?
✅ 3️⃣ Review and Tailored Practice
👉 If you’re actively practicing, I can:
- Review your current SQL solutions and spot improvements.
- Create a practice plan for the next 2-3 weeks.
- Suggest additional challenges based on your struggles.
Would you like to share your progress so far (like which LeetCode SQL questions you’ve done or your biggest challenges), and I can tailor the next steps?
Let me know:
✅ More sample solutions?
✅ More custom practice queries?
✅ Review your current solutions?
Or mix and match based on your preference! 🚀✨
You got it! Let’s supercharge your practice with some fresh and challenging sets. Here’s a mix of medium and hard-level practice scenarios to stretch your SQL muscles. Let’s go! 🚀
🟦 Fresh Practice Set 1 – Window Functions & Partitioning
Problem:
List the top 2 highest-paid employees in each department. Show the department name, employee name, and salary.
Tables:
Departments(department_id, department_name)
Employees(employee_id, name, salary, department_id)
Expected Output:
| department_name | name | salary |
|------------------|--------------|--------|
| Sales | John | 90000 |
| Sales | Mary | 85000 |
| Engineering | Alice | 110000 |
| Engineering | Bob | 105000 |
🟦 Fresh Practice Set 2 – Aggregation & Filtering
Problem:
Find the average number of orders placed per active customer in 2024.
- Active customers are those who placed at least one order in 2024.
- Show the average as
avg_orders
.
Tables:
Orders(order_id, customer_id, order_date)
Expected Output:
| avg_orders |
|------------|
| 3.42 |
🟦 Fresh Practice Set 3 – Subquery & Anti-Join Challenge
Problem:
Find all products that have never been ordered.
Tables:
Products(product_id, name)
Orders(order_id, product_id, order_date)
Expected Output:
| product_id | name |
|------------|-------------|
| 4 | Keyboard |
| 7 | USB Cable |
🟦 Fresh Practice Set 4 – Recursive CTE
Problem:
You have a table of employees with their manager’s ID.
- Find all employees reporting (directly or indirectly) to manager 1001.
Tables:
Employees(id, name, manager_id)
Expected Output:
| id | name |
|------|----------|
| 1010 | Jack |
| 1011 | Jill |
| 1012 | Bob |
| 1013 | Alice |
🟦 Fresh Practice Set 5 – Advanced CASE in Aggregation
Problem:
For each month in 2024, count:
- Orders marked as ‘Completed’.
- Orders marked as ‘Cancelled’.
- Show
month
,completed_orders
,cancelled_orders
.
Tables:
Orders(order_id, order_date, status)
Expected Output:
| month | completed_orders | cancelled_orders |
|-----------|------------------|------------------|
| 2024-01 | 10 | 2 |
| 2024-02 | 15 | 3 |
🟩 How to Tackle Them
For each, think of:
✅ Joins (for linking data)
✅ GROUP BY (for summaries)
✅ Window functions (if needed for ranking)
✅ Recursive CTEs (for hierarchy or repeated structure)
✅ CASE WHEN (for conditional counts)
🚀 Want Solutions or Hints?
Let me know which of these fresh sets you’d like me to:
- Provide full solution SQL scripts
- Share step-by-step logic and breakdown
- Create additional variations to keep practicing
Or if you’d like even more sets (like pivoting, data cleaning challenges), I’ve got you covered! Let’s keep your SQL skills climbing! 🌟
Here’s a crisp, simplified version of each question’s description and solution—let’s keep it clear and to the point! 👇
1️⃣ Rank Scores – Using DENSE_RANK()
- Description: Rank student scores with gaps in ranking removed (dense ranking).
- Solution:
SELECT score, DENSE_RANK() OVER (ORDER BY score DESC) AS rank FROM Scores;
2️⃣ Exchange Seats – Self-join and handling row swaps
- Description: Swap every pair of consecutive seat ids in a table.
- Solution:
SELECT CASE WHEN id % 2 = 1 AND id != (SELECT MAX(id) FROM Seat) THEN id + 1 WHEN id % 2 = 0 THEN id - 1 ELSE id END AS id, student FROM Seat ORDER BY id;
3️⃣ Employee Bonus – LEFT JOIN + COALESCE
- Description: Show employee name and bonus (0 if no bonus).
- Solution:
SELECT e.name, COALESCE(b.bonus, 0) AS bonus FROM Employee e LEFT JOIN Bonus b ON e.id = b.empId;
4️⃣ Department Top Three Salaries – Window functions
- Description: Show top 3 salaries per department.
- Solution:
SELECT department, employee, salary FROM ( SELECT d.name AS department, e.name AS employee, e.salary, DENSE_RANK() OVER (PARTITION BY d.id ORDER BY e.salary DESC) AS rnk FROM Employee e JOIN Department d ON e.departmentId = d.id ) t WHERE rnk <= 3;
5️⃣ Second Highest Salary – Subquery + LIMIT
- Description: Find the second highest unique salary.
- Solution:
SELECT DISTINCT salary FROM Employee ORDER BY salary DESC LIMIT 1 OFFSET 1;
6️⃣ Consecutive Numbers – Window functions and partitioning
- Description: Find numbers appearing at least 3 times consecutively.
- Solution:
SELECT DISTINCT num FROM ( SELECT num, ROW_NUMBER() OVER (ORDER BY id) - ROW_NUMBER() OVER (PARTITION BY num ORDER BY id) AS grp FROM Logs ) t GROUP BY num, grp HAVING COUNT(*) >= 3;
7️⃣ Customers Who Never Order – LEFT JOIN + NULL
- Description: Find customers who never placed an order.
- Solution:
SELECT c.name FROM Customers c LEFT JOIN Orders o ON c.id = o.customerId WHERE o.id IS NULL;
8️⃣ Trips and Users – CASE + Aggregations
- Description: Count number of completed and cancelled trips per user.
- Solution:
SELECT u.name, SUM(CASE WHEN t.status = 'completed' THEN 1 ELSE 0 END) AS completed, SUM(CASE WHEN t.status = 'cancelled' THEN 1 ELSE 0 END) AS cancelled FROM Users u LEFT JOIN Trips t ON u.id = t.userId GROUP BY u.name;
9️⃣ Product Sales Analysis III – Advanced joins and aggregates
- Description: For each product, find sales percentage contribution.
- Solution:
SELECT p.product_id, ROUND(SUM(o.quantity) / total.total_qty * 100, 2) AS percent FROM Product p JOIN Orders o ON p.product_id = o.product_id JOIN ( SELECT SUM(quantity) AS total_qty FROM Orders ) total GROUP BY p.product_id;
🟥 Hard Section
🔟 Sales Analysis III – Advanced CTE and window functions
- Description: For each product, find its first order date and total sales.
- Solution:
WITH first_order AS ( SELECT product_id, MIN(order_date) AS first_date FROM Sales GROUP BY product_id ) SELECT s.product_id, f.first_date, SUM(s.quantity) AS total_sales FROM Sales s JOIN first_order f ON s.product_id = f.product_id GROUP BY s.product_id, f.first_date;
1️⃣1️⃣ Consecutive Available Seats – Use window functions and self-joins
- Description: Find ranges of consecutive available seats.
- Solution:
SELECT MIN(seat_id) AS start_seat, MAX(seat_id) AS end_seat FROM ( SELECT seat_id, seat_id - ROW_NUMBER() OVER (ORDER BY seat_id) AS grp FROM Seats WHERE status = 'available' ) t GROUP BY grp;
1️⃣2️⃣ Human Traffic of Stadium – Complex window functions
- Description: Find peak human traffic in a stadium by checking overlapping times.
- Solution:
SELECT MAX(people_count) AS max_traffic FROM ( SELECT in_time, COUNT(*) OVER (ORDER BY in_time ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS people_count FROM ( SELECT in_time FROM Stadium UNION ALL SELECT out_time FROM Stadium ) times ) t;
1️⃣3️⃣ Trips and Users (difficult variation) – Anti-joins + nested CTEs
- Description: Find users who never had a trip with status ‘completed’.
- Solution:
WITH completed_users AS ( SELECT DISTINCT userId FROM Trips WHERE status = 'completed' ) SELECT u.name FROM Users u WHERE u.id NOT IN (SELECT userId FROM completed_users);
1️⃣4️⃣ Hard Employee Bonus / Sales – Multi-level joins + aggregates
- Description: Calculate bonuses based on multiple sales targets.
- Solution:
SELECT e.name, CASE WHEN s.total_sales >= 100000 THEN 1000 WHEN s.total_sales >= 50000 THEN 500 ELSE 0 END AS bonus FROM Employee e JOIN ( SELECT empId, SUM(amount) AS total_sales FROM Sales GROUP BY empId ) s ON e.id = s.empId;
1️⃣5️⃣ Tree Node – Recursive CTEs (for hierarchical data)
- Description: Print all child-parent pairs in a tree (recursive hierarchy).
- Solution:
WITH RECURSIVE tree_cte AS ( SELECT id, parent_id, name FROM Tree WHERE parent_id IS NULL UNION ALL SELECT t.id, t.parent_id, t.name FROM Tree t JOIN tree_cte cte ON t.parent_id = cte.id ) SELECT * FROM tree_cte;
Let me know if you’d like me to expand any solution or explain with real data examples! 🌟
Sure! Here’s a handy list of conceptual SQL questions often asked in interviews at top companies (Google, Amazon, Microsoft, Meta, etc.). These test your understanding, not just coding! Let’s go! 🚀
🔹 1️⃣ Difference between INNER JOIN
, LEFT JOIN
, and RIGHT JOIN
- Expected Answer:
INNER JOIN
: Only matching rows in both tables.LEFT JOIN
: All rows from left table + matching rows from right table (NULL if no match).RIGHT JOIN
: All rows from right table + matching rows from left table (NULL if no match).
🔹 2️⃣ What’s the difference between WHERE
and HAVING
?
- Expected Answer:
WHERE
: Filters before grouping (applies to raw rows).HAVING
: Filters after grouping (applies to aggregated/grouped results).
🔹 3️⃣ Difference between RANK()
, DENSE_RANK()
, and ROW_NUMBER()
- Expected Answer:
RANK()
: Ranks with gaps if there are ties.DENSE_RANK()
: Ranks without gaps (ties have same rank).ROW_NUMBER()
: Unique row numbers, no ties.
🔹 4️⃣ Explain GROUP BY
vs PARTITION BY
- Expected Answer:
GROUP BY
: Aggregates rows into summary rows (e.g., total sales per department).PARTITION BY
: Divides result into “windows” for window functions (e.g., ranking within each department).
🔹 5️⃣ What is normalization? Why is it important?
- Expected Answer:
- Process of organizing data to reduce redundancy.
- Ensures data consistency and integrity.
- Example: 1NF, 2NF, 3NF.
🔹 6️⃣ Explain the EXPLAIN
plan
- Expected Answer:
- Shows the query execution plan (how database will run it).
- Helps optimize performance (indexes, joins, scans, etc.).
🔹 7️⃣ What’s a correlated subquery?
- Expected Answer:
- Subquery that depends on the outer query.
- Runs for each row in outer query.
🔹 8️⃣ What is a window function? Give an example.
- Expected Answer:
- Calculates a value over a window of rows (no grouping).
- Example:
SUM(sales) OVER (PARTITION BY region ORDER BY date)
.
🔹 9️⃣ How do indexes work? What’s a composite index?
- Expected Answer:
- Indexes help speed up lookups by creating a data structure.
- Composite index: index on multiple columns together.
🔹 🔟 Explain UNION
vs UNION ALL
- Expected Answer:
UNION
: Removes duplicates.UNION ALL
: Includes all rows, even duplicates.
🔹 1️⃣1️⃣ What are common causes of slow SQL queries?
- Expected Answer:
- Missing indexes.
- Large data scans (no filtering).
- Inefficient joins (cartesian joins, bad join order).
- Too much sorting or grouping.
🔹 1️⃣2️⃣ What’s a primary key vs. unique key?
- Expected Answer:
- Primary key: uniquely identifies each row + no NULLs.
- Unique key: ensures unique values but allows one NULL (in some databases).
🔹 1️⃣3️⃣ What is a CTE (Common Table Expression)?
- Expected Answer:
- A temporary result set named and used in the next SQL block.
- Improves readability, especially with recursive queries.
🔹 1️⃣4️⃣ What’s the difference between DELETE
and TRUNCATE
?
- Expected Answer:
DELETE
: row by row, can haveWHERE
.TRUNCATE
: removes all rows faster, can’t haveWHERE
.
🔹 1️⃣5️⃣ What’s denormalization? Why might you do it?
- Expected Answer:
- Adding redundancy (storing data in multiple tables).
- Used for performance in analytics (e.g., avoiding joins).
🔹 1️⃣6️⃣ What is ACID in databases?
- Expected Answer:
- Atomicity, Consistency, Isolation, Durability.
- Ensures transactions are reliable and consistent.
🔹 1️⃣7️⃣ Explain a self-join.
- Expected Answer:
- Table joins itself (e.g., employees with their managers).
🔹 1️⃣8️⃣ What’s a surrogate key vs. natural key?
- Expected Answer:
- Natural key: real-world unique identifier (e.g., email).
- Surrogate key: auto-generated (e.g., employee ID).
Want me to craft SQL coding questions (with answers) that combine these concepts (e.g., practical joins, window functions, etc.)? Let me know! 🌟
Got it! Here’s a set of practical SQL coding questions that combine these interview concepts. I’ll include a crisp solution for each. 🚀
1️⃣ Find the top 2 salaries in each department (including ties)
-- Question:
-- For each department, show the employee name, salary, and rank (top 2 salaries, including ties).
SELECT department, employee, salary
FROM (
SELECT d.name AS department, e.name AS employee, e.salary,
DENSE_RANK() OVER (PARTITION BY d.id ORDER BY e.salary DESC) AS rnk
FROM Employee e
JOIN Department d ON e.departmentId = d.id
) ranked
WHERE rnk <= 2;
2️⃣ Find products that have never been ordered
-- Question:
-- Show all product names that have never been ordered.
SELECT p.name
FROM Product p
LEFT JOIN Orders o ON p.id = o.product_id
WHERE o.id IS NULL;
3️⃣ Find the difference in total sales between each salesperson and the top salesperson
-- Question:
-- For each salesperson, show their name and the difference to the top salesperson’s sales.
WITH total_sales AS (
SELECT salesperson_id, SUM(amount) AS total
FROM Sales
GROUP BY salesperson_id
),
top_sales AS (
SELECT MAX(total) AS top_total FROM total_sales
)
SELECT s.name, (t.top_total - ts.total) AS diff_to_top
FROM Salesperson s
JOIN total_sales ts ON s.id = ts.salesperson_id
CROSS JOIN top_sales t;
4️⃣ Find consecutive login days for each user
-- Question:
-- Show user_id, start_date, and end_date of each streak of consecutive login days.
WITH login_diff AS (
SELECT user_id, login_date,
login_date - INTERVAL ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date) DAY AS grp
FROM Logins
)
SELECT user_id, MIN(login_date) AS start_date, MAX(login_date) AS end_date
FROM login_diff
GROUP BY user_id, grp;
5️⃣ Show the second highest salary in the company
-- Question:
-- What’s the second highest unique salary?
SELECT DISTINCT salary
FROM Employee
ORDER BY salary DESC
LIMIT 1 OFFSET 1;
6️⃣ List employees who earn more than the average salary of their department
-- Question:
-- Show employee name and department where their salary is above department average.
SELECT e.name, d.name AS department, e.salary
FROM Employee e
JOIN Department d ON e.departmentId = d.id
WHERE e.salary > (
SELECT AVG(salary)
FROM Employee
WHERE departmentId = e.departmentId
);
7️⃣ Calculate each customer’s order count and average amount
-- Question:
-- For each customer, show name, number of orders, and average order amount.
SELECT c.name, COUNT(o.id) AS order_count, ROUND(AVG(o.amount), 2) AS avg_order
FROM Customer c
LEFT JOIN Orders o ON c.id = o.customer_id
GROUP BY c.name;
8️⃣ Find the first order date for each customer
-- Question:
-- Show customer name and their first order date.
SELECT c.name, MIN(o.order_date) AS first_order_date
FROM Customer c
JOIN Orders o ON c.id = o.customer_id
GROUP BY c.name;
9️⃣ Show each employee and their manager’s name
-- Question:
-- Show employee name and their manager’s name (use self-join).
SELECT e.name AS employee, m.name AS manager
FROM Employee e
LEFT JOIN Employee m ON e.manager_id = m.id;
🔟 Get a running total of daily sales (cumulative)
-- Question:
-- For each day, show date and cumulative sales total so far.
SELECT order_date,
SUM(amount) OVER (ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM Orders
GROUP BY order_date;
Want me to prepare some even trickier ones (like combining window functions, CTEs, and self-joins in one query)? Let me know! 🌟
Sure! Here’s a handy list of conceptual SQL questions often asked in interviews at top companies (Google, Amazon, Microsoft, Meta, etc.). These test your understanding, not just coding! Let’s go! 🚀
🔹 1️⃣ Difference between INNER JOIN
, LEFT JOIN
, and RIGHT JOIN
- Expected Answer:
INNER JOIN
: Only matching rows in both tables.LEFT JOIN
: All rows from left table + matching rows from right table (NULL if no match).RIGHT JOIN
: All rows from right table + matching rows from left table (NULL if no match).
🔹 2️⃣ What’s the difference between WHERE
and HAVING
?
- Expected Answer:
WHERE
: Filters before grouping (applies to raw rows).HAVING
: Filters after grouping (applies to aggregated/grouped results).
🔹 3️⃣ Difference between RANK()
, DENSE_RANK()
, and ROW_NUMBER()
- Expected Answer:
RANK()
: Ranks with gaps if there are ties.DENSE_RANK()
: Ranks without gaps (ties have same rank).ROW_NUMBER()
: Unique row numbers, no ties.
🔹 4️⃣ Explain GROUP BY
vs PARTITION BY
- Expected Answer:
GROUP BY
: Aggregates rows into summary rows (e.g., total sales per department).PARTITION BY
: Divides result into “windows” for window functions (e.g., ranking within each department).
🔹 5️⃣ What is normalization? Why is it important?
- Expected Answer:
- Process of organizing data to reduce redundancy.
- Ensures data consistency and integrity.
- Example: 1NF, 2NF, 3NF.
🔹 6️⃣ Explain the EXPLAIN
plan
- Expected Answer:
- Shows the query execution plan (how database will run it).
- Helps optimize performance (indexes, joins, scans, etc.).
🔹 7️⃣ What’s a correlated subquery?
- Expected Answer:
- Subquery that depends on the outer query.
- Runs for each row in outer query.
🔹 8️⃣ What is a window function? Give an example.
- Expected Answer:
- Calculates a value over a window of rows (no grouping).
- Example:
SUM(sales) OVER (PARTITION BY region ORDER BY date)
.
🔹 9️⃣ How do indexes work? What’s a composite index?
- Expected Answer:
- Indexes help speed up lookups by creating a data structure.
- Composite index: index on multiple columns together.
🔹 🔟 Explain UNION
vs UNION ALL
- Expected Answer:
UNION
: Removes duplicates.UNION ALL
: Includes all rows, even duplicates.
🔹 1️⃣1️⃣ What are common causes of slow SQL queries?
- Expected Answer:
- Missing indexes.
- Large data scans (no filtering).
- Inefficient joins (cartesian joins, bad join order).
- Too much sorting or grouping.
🔹 1️⃣2️⃣ What’s a primary key vs. unique key?
- Expected Answer:
- Primary key: uniquely identifies each row + no NULLs.
- Unique key: ensures unique values but allows one NULL (in some databases).
🔹 1️⃣3️⃣ What is a CTE (Common Table Expression)?
- Expected Answer:
- A temporary result set named and used in the next SQL block.
- Improves readability, especially with recursive queries.
🔹 1️⃣4️⃣ What’s the difference between DELETE
and TRUNCATE
?
- Expected Answer:
DELETE
: row by row, can haveWHERE
.TRUNCATE
: removes all rows faster, can’t haveWHERE
.
🔹 1️⃣5️⃣ What’s denormalization? Why might you do it?
- Expected Answer:
- Adding redundancy (storing data in multiple tables).
- Used for performance in analytics (e.g., avoiding joins).
🔹 1️⃣6️⃣ What is ACID in databases?
- Expected Answer:
- Atomicity, Consistency, Isolation, Durability.
- Ensures transactions are reliable and consistent.
🔹 1️⃣7️⃣ Explain a self-join.
- Expected Answer:
- Table joins itself (e.g., employees with their managers).
🔹 1️⃣8️⃣ What’s a surrogate key vs. natural key?
- Expected Answer:
- Natural key: real-world unique identifier (e.g., email).
- Surrogate key: auto-generated (e.g., employee ID).
Want me to craft SQL coding questions (with answers) that combine these concepts (e.g., practical joins, window functions, etc.)? Let me know! 🌟
Got it! Here’s a set of practical SQL coding questions that combine these interview concepts. I’ll include a crisp solution for each. 🚀
1️⃣ Find the top 2 salaries in each department (including ties)
-- Question:
-- For each department, show the employee name, salary, and rank (top 2 salaries, including ties).
SELECT department, employee, salary
FROM (
SELECT d.name AS department, e.name AS employee, e.salary,
DENSE_RANK() OVER (PARTITION BY d.id ORDER BY e.salary DESC) AS rnk
FROM Employee e
JOIN Department d ON e.departmentId = d.id
) ranked
WHERE rnk <= 2;
2️⃣ Find products that have never been ordered
-- Question:
-- Show all product names that have never been ordered.
SELECT p.name
FROM Product p
LEFT JOIN Orders o ON p.id = o.product_id
WHERE o.id IS NULL;
3️⃣ Find the difference in total sales between each salesperson and the top salesperson
-- Question:
-- For each salesperson, show their name and the difference to the top salesperson’s sales.
WITH total_sales AS (
SELECT salesperson_id, SUM(amount) AS total
FROM Sales
GROUP BY salesperson_id
),
top_sales AS (
SELECT MAX(total) AS top_total FROM total_sales
)
SELECT s.name, (t.top_total - ts.total) AS diff_to_top
FROM Salesperson s
JOIN total_sales ts ON s.id = ts.salesperson_id
CROSS JOIN top_sales t;
4️⃣ Find consecutive login days for each user
-- Question:
-- Show user_id, start_date, and end_date of each streak of consecutive login days.
WITH login_diff AS (
SELECT user_id, login_date,
login_date - INTERVAL ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date) DAY AS grp
FROM Logins
)
SELECT user_id, MIN(login_date) AS start_date, MAX(login_date) AS end_date
FROM login_diff
GROUP BY user_id, grp;
5️⃣ Show the second highest salary in the company
-- Question:
-- What’s the second highest unique salary?
SELECT DISTINCT salary
FROM Employee
ORDER BY salary DESC
LIMIT 1 OFFSET 1;
6️⃣ List employees who earn more than the average salary of their department
-- Question:
-- Show employee name and department where their salary is above department average.
SELECT e.name, d.name AS department, e.salary
FROM Employee e
JOIN Department d ON e.departmentId = d.id
WHERE e.salary > (
SELECT AVG(salary)
FROM Employee
WHERE departmentId = e.departmentId
);
7️⃣ Calculate each customer’s order count and average amount
-- Question:
-- For each customer, show name, number of orders, and average order amount.
SELECT c.name, COUNT(o.id) AS order_count, ROUND(AVG(o.amount), 2) AS avg_order
FROM Customer c
LEFT JOIN Orders o ON c.id = o.customer_id
GROUP BY c.name;
8️⃣ Find the first order date for each customer
-- Question:
-- Show customer name and their first order date.
SELECT c.name, MIN(o.order_date) AS first_order_date
FROM Customer c
JOIN Orders o ON c.id = o.customer_id
GROUP BY c.name;
9️⃣ Show each employee and their manager’s name
-- Question:
-- Show employee name and their manager’s name (use self-join).
SELECT e.name AS employee, m.name AS manager
FROM Employee e
LEFT JOIN Employee m ON e.manager_id = m.id;
🔟 Get a running total of daily sales (cumulative)
-- Question:
-- For each day, show date and cumulative sales total so far.
SELECT order_date,
SUM(amount) OVER (ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM Orders
GROUP BY order_date;
Want me to prepare some even trickier ones (like combining window functions, CTEs, and self-joins in one query)? Let me know! 🌟
Great! Let’s go deeper into indexing, keys, normalization, and other important database concepts that interviewers ask. I’ll provide concise explanations and some implementation snippets/examples!
1️⃣ Indexing
👉 What it is:
- A data structure (like B-tree) to quickly find rows.
- Speeds up SELECT queries (especially on WHERE, JOIN, ORDER BY).
👉 Implementation:
-- Create an index on the 'email' column
CREATE INDEX idx_email ON Users(email);
-- Drop an index
DROP INDEX idx_email ON Users;
👉 Composite index:
-- For WHERE on (first_name, last_name)
CREATE INDEX idx_name ON Users(first_name, last_name);
👉 Impact on performance:
- Reads: Faster!
- Writes: Slightly slower (index updates on INSERT/UPDATE).
2️⃣ Primary Key & Unique Key
👉 Primary Key:
- Uniquely identifies each row.
- No NULLs allowed.
CREATE TABLE Employees (
id INT PRIMARY KEY,
name VARCHAR(50)
);
👉 Unique Key:
- Enforces uniqueness (but allows one NULL in some DBs).
CREATE TABLE Users (
email VARCHAR(100) UNIQUE
);
👉 Difference:
- Primary key = main unique identifier.
- Unique key = additional unique columns.
3️⃣ Foreign Key (Relationships)
👉 Ensures referential integrity (like a link to another table’s primary key).
CREATE TABLE Orders (
id INT PRIMARY KEY,
customer_id INT,
FOREIGN KEY (customer_id) REFERENCES Customers(id)
);
4️⃣ Normalization
👉 Organizing data to remove redundancy and improve data integrity.
👉 Common Normal Forms:
- 1NF: No repeating groups (atomic columns).
- 2NF: No partial dependency (non-key columns fully depend on primary key).
- 3NF: No transitive dependency (non-key columns depend only on primary key).
👉 Example:
🔹 Initial Table (not normalized):
OrderID | CustomerName | Product |
---|
🔹 Normalized:
Customers:
CustomerID | Name |
---|
Orders:
| OrderID | CustomerID | Product |
5️⃣ Denormalization
👉 Why do it?
- Sometimes needed for performance (less joins, faster reporting).
- Often done in data warehouses.
👉 Example:
- Instead of separate tables for
city
andcountry
, store them in one table for faster queries.
6️⃣ Transactions & ACID
👉 ACID Principles:
- Atomicity: All-or-nothing.
- Consistency: Data integrity before & after.
- Isolation: No partial updates visible to others.
- Durability: Once committed, it’s saved.
👉 Example:
START TRANSACTION;
UPDATE Accounts SET balance = balance - 100 WHERE id = 1;
UPDATE Accounts SET balance = balance + 100 WHERE id = 2;
COMMIT; -- or ROLLBACK if error
7️⃣ Query Optimization
👉 How DB decides how to execute a query.
- Uses indexes, joins, sorting…
- Tools:
EXPLAIN
,EXPLAIN ANALYZE
.
👉 Example:
EXPLAIN SELECT * FROM Orders WHERE customer_id = 1;
8️⃣ Partitioning
👉 Splitting large tables into smaller parts.
- Can improve performance in big tables.
- Example: Partition sales by year.
CREATE TABLE Sales (
id INT,
amount DECIMAL,
year INT
)
PARTITION BY RANGE (year) (
PARTITION p2023 VALUES LESS THAN (2024),
PARTITION p2024 VALUES LESS THAN (2025)
);
9️⃣ Materialized Views
👉 Precomputed query results stored like a table.
- Good for repeated, expensive calculations.
CREATE MATERIALIZED VIEW monthly_sales AS
SELECT customer_id, SUM(amount) FROM Orders
GROUP BY customer_id;
👉 Refresh as needed:
REFRESH MATERIALIZED VIEW monthly_sales;
1️⃣0️⃣ Concurrency Control (Isolation Levels)
👉 Prevents data corruption when multiple transactions occur.
- Levels:
READ UNCOMMITTED
,READ COMMITTED
,REPEATABLE READ
,SERIALIZABLE
.
👉 Example (Postgres):
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
✅ Summary Table
Concept | Usage | SQL Example |
---|---|---|
Indexing | Faster search | CREATE INDEX idx ON table(col) |
Primary Key | Unique row id | PRIMARY KEY (id) |
Foreign Key | Table relationship | FOREIGN KEY (col) REFERENCES table(id) |
Normalization | Remove redundancy | Separate tables for Customers & Orders |
Denormalization | Speed (less joins) | Merge city and country into 1 table |
ACID | Reliable transactions | START TRANSACTION; ... COMMIT; |
Partitioning | Split big table | PARTITION BY RANGE (year) |
Materialized View | Precomputed data | CREATE MATERIALIZED VIEW ... |
Concurrency Control | Handle simultaneous ops | SET ISOLATION LEVEL ... |
💡 Want me to make practical scenarios where these concepts are implemented together in a real database (like for analytics, e-commerce, or logins)? Let me know! 🌟
Let’s combine these concepts into some practical, real-world scenarios you’d find in interviews or real-life analytics projects! 🚀
🌟 Scenario 1: E-Commerce Order Tracking & Performance
✅ Problem
You’re designing a database for an e-commerce platform. You need:
- Efficiently track customer orders and product sales
- Show top customers by order volume
- Handle analytics queries (like monthly sales) quickly
- Avoid redundant data
⚙️ Implementation
✅ Tables:
CREATE TABLE Customers (
customer_id INT PRIMARY KEY,
name VARCHAR(100),
email VARCHAR(100) UNIQUE
);
CREATE TABLE Products (
product_id INT PRIMARY KEY,
name VARCHAR(100),
price DECIMAL(10,2)
);
CREATE TABLE Orders (
order_id INT PRIMARY KEY,
customer_id INT,
order_date DATE,
total_amount DECIMAL(10,2),
FOREIGN KEY (customer_id) REFERENCES Customers(customer_id)
);
CREATE TABLE OrderDetails (
order_detail_id INT PRIMARY KEY,
order_id INT,
product_id INT,
quantity INT,
FOREIGN KEY (order_id) REFERENCES Orders(order_id),
FOREIGN KEY (product_id) REFERENCES Products(product_id)
);
✅ Indexing:
-- Index to speed up product-based analytics
CREATE INDEX idx_product_id ON OrderDetails(product_id);
-- Index to speed up date-based sales reports
CREATE INDEX idx_order_date ON Orders(order_date);
✅ Normalization:
- Orders and Customers are separated (3NF).
- OrderDetails is a junction table – avoids redundant product info in Orders.
✅ Materialized View for Fast Reporting:
CREATE MATERIALIZED VIEW monthly_sales AS
SELECT EXTRACT(YEAR FROM order_date) AS year,
EXTRACT(MONTH FROM order_date) AS month,
SUM(total_amount) AS total_sales
FROM Orders
GROUP BY year, month;
➡️ Periodically refresh this to keep it up to date:
REFRESH MATERIALIZED VIEW monthly_sales;
✅ Analytical Query Example (Top 3 customers by total amount):
SELECT c.name, SUM(o.total_amount) AS total_spent
FROM Customers c
JOIN Orders o ON c.customer_id = o.customer_id
GROUP BY c.name
ORDER BY total_spent DESC
LIMIT 3;
🌟 Scenario 2: User Login Analytics (Concurrency & Normalization)
✅ Problem
You’re building a login tracking system for a SaaS app. Requirements:
- Track each login attempt
- Analyze daily unique users
- Find longest consecutive login streaks
- Ensure data integrity and avoid duplicate logins
⚙️ Implementation
✅ Tables:
CREATE TABLE Users (
user_id INT PRIMARY KEY,
email VARCHAR(100) UNIQUE,
name VARCHAR(50)
);
CREATE TABLE Logins (
login_id SERIAL PRIMARY KEY,
user_id INT,
login_time TIMESTAMP,
FOREIGN KEY (user_id) REFERENCES Users(user_id)
);
✅ Indexing for faster lookups:
CREATE INDEX idx_user_id ON Logins(user_id);
CREATE INDEX idx_login_time ON Logins(login_time);
✅ Analytics Query: Daily unique logins
SELECT DATE(login_time) AS login_date, COUNT(DISTINCT user_id) AS unique_users
FROM Logins
GROUP BY login_date
ORDER BY login_date;
✅ Find consecutive login streaks
WITH login_diff AS (
SELECT user_id, DATE(login_time) AS login_date,
DATE(login_time) - INTERVAL ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_time)::INT DAY AS grp
FROM Logins
)
SELECT user_id, MIN(login_date) AS streak_start, MAX(login_date) AS streak_end, COUNT(*) AS streak_length
FROM login_diff
GROUP BY user_id, grp
ORDER BY streak_length DESC;
✅ Transaction Control (handling simultaneous inserts)
BEGIN;
INSERT INTO Logins (user_id, login_time)
VALUES (1, NOW());
COMMIT;
➡️ Use SERIALIZABLE
or REPEATABLE READ
isolation if needed for strict consistency.
🌟 Scenario 3: Analytics Dashboard with Partitioning & Caching
✅ Problem
You manage a large website traffic log table with billions of rows. You need:
- Fast queries for daily/weekly reports
- Ability to archive old data (reduce index size)
⚙️ Implementation
✅ Partitioning by month:
CREATE TABLE TrafficLogs (
id BIGINT PRIMARY KEY,
visit_time TIMESTAMP,
user_id INT,
page VARCHAR(100)
)
PARTITION BY RANGE (EXTRACT(YEAR FROM visit_time)*100 + EXTRACT(MONTH FROM visit_time));
-- Monthly partitions
CREATE TABLE TrafficLogs_202501 PARTITION OF TrafficLogs
FOR VALUES FROM (202501) TO (202502);
✅ Index on frequently queried columns:
CREATE INDEX idx_visit_time ON TrafficLogs(visit_time);
✅ Materialized View for active pages:
CREATE MATERIALIZED VIEW active_pages AS
SELECT page, COUNT(*) AS visits
FROM TrafficLogs
WHERE visit_time >= NOW() - INTERVAL '7 days'
GROUP BY page;
🔥 Key Concepts Combined:
✅ Normalization (no redundant data)
✅ Foreign keys & primary keys (data integrity)
✅ Indexing (performance for queries)
✅ Materialized views (reporting speed)
✅ Partitioning (big data management)
✅ Transactions & isolation levels (consistency in multi-user environment)
Would you like me to create mock interview questions using these scenarios for practice? Or do you want ER diagrams and SQL scripts for these? Let me know! 🚀
Sure! StrataScratch is a great platform to practice real-world data scenario questions. Here’s a selection of real-world scenario-based questions from StrataScratch with a bit of context on what you’d be expected to do. Let me know if you’d like to dive deeper into any of these or get detailed solutions!
1️⃣ Yelp Dataset – Average Star Rating
Scenario:
You’re a data analyst for Yelp, and you need to calculate the average star rating for each city.
Task:
Write a SQL query to find the average star rating per city.
Skills:
GROUP BY
andAVG()
functions.
2️⃣ Employee Salaries – Identify High Earners
Scenario:
At a large company, HR wants to identify employees who earn more than the average salary within their department.
Task:
Write a SQL query to find employees whose salary is above the department’s average.
Skills:
JOIN
(if departments are in another table),- Subquery to calculate department average salary.
3️⃣ Orders – Find Top Product by Revenue
Scenario:
You work for an e-commerce platform and want to find out which product has generated the highest revenue.
Task:
Calculate revenue by multiplying quantity by price, and return the product(s) with the highest revenue.
Skills:
GROUP BY
,ORDER BY
,LIMIT
(orRANK()
for ties).
4️⃣ AirBnB Listings – Host with Most Listings
Scenario:
For AirBnB, find the host(s) who have the most listings in the city.
Task:
Count the number of listings per host and return the host with the highest number of listings.
Skills:
GROUP BY
,- Window functions or subqueries for ties.
5️⃣ Uber Rides – Peak Ride Hours
Scenario:
Uber wants to identify which hour of the day has the highest average number of rides.
Task:
Extract the hour from the ride timestamp and calculate average rides per hour.
Skills:
- Date functions to extract hour,
- Aggregation with
GROUP BY
.
6️⃣ Spotify – Most Popular Artist in Playlist
Scenario:
Find the artist with the most songs in a playlist.
Task:
Count the number of songs by artist and rank them.
Skills:
GROUP BY
,COUNT()
,ORDER BY
.
7️⃣ Fraudulent Transactions – Unusual Behavior
Scenario:
A bank wants to flag accounts with transactions over $10,000 that happened outside normal hours (e.g., midnight to 5 a.m.).
Task:
Identify such transactions.
Skills:
- Filtering with
WHERE
and time functions.
8️⃣ Product Reviews – Find Inactive Reviewers
Scenario:
Identify users who haven’t posted a review in the past year.
Task:
Compare last review date with current date.
Skills:
- Date arithmetic with
DATEDIFF()
.
9️⃣ Hospital Data – Readmission Rates
Scenario:
Analyze patient data to find hospitals with the highest readmission rates.
Task:
Calculate the percentage of readmissions for each hospital.
Skills:
- Aggregation,
CASE WHEN
logic for readmission flag.
10️⃣ Movie Database – Most Watched Genre
Scenario:
Find the most popular genre based on user watch data.
Task:
Count total views for each genre and rank them.
Skills:
GROUP BY
,ORDER BY
,- Joining user watch data with movie data.
Want me to generate sample queries for any of these, provide detailed solutions, or tailor them to a specific language (SQL, Python, PySpark)? Just let me know! 🚀
Got it! Let’s generate sample queries for all 10 real-world scenario questions, both in SQL and PySpark (using DataFrame API).
1️⃣ Yelp Dataset – Average Star Rating per City
SQL:
SELECT city, AVG(stars) AS avg_rating
FROM yelp_reviews
GROUP BY city;
PySpark:
from pyspark.sql.functions import avg
yelp_reviews.groupBy("city").agg(avg("stars").alias("avg_rating")).show()
2️⃣ Employee Salaries – High Earners in Department
SQL:
SELECT e.employee_id, e.name, e.salary, e.department_id
FROM employees e
JOIN (
SELECT department_id, AVG(salary) AS avg_dept_salary
FROM employees
GROUP BY department_id
) dept_avg
ON e.department_id = dept_avg.department_id
WHERE e.salary > dept_avg.avg_dept_salary;
PySpark:
from pyspark.sql.functions import avg
dept_avg = employees.groupBy("department_id").agg(avg("salary").alias("avg_dept_salary"))
high_earners = employees.join(dept_avg, on="department_id") \
.filter(employees["salary"] > dept_avg["avg_dept_salary"])
high_earners.show()
3️⃣ Orders – Top Product by Revenue
SQL:
SELECT product_id, SUM(quantity * price) AS total_revenue
FROM orders
GROUP BY product_id
ORDER BY total_revenue DESC
LIMIT 1;
PySpark:
from pyspark.sql.functions import sum as _sum, col
orders.withColumn("revenue", col("quantity") * col("price")) \
.groupBy("product_id") \
.agg(_sum("revenue").alias("total_revenue")) \
.orderBy(col("total_revenue").desc()) \
.limit(1) \
.show()
4️⃣ AirBnB Listings – Host with Most Listings
SQL:
SELECT host_id, COUNT(*) AS listing_count
FROM listings
GROUP BY host_id
ORDER BY listing_count DESC
LIMIT 1;
PySpark:
listings.groupBy("host_id") \
.count() \
.orderBy("count", ascending=False) \
.limit(1) \
.show()
5️⃣ Uber Rides – Peak Ride Hours
SQL:
SELECT EXTRACT(HOUR FROM ride_timestamp) AS hour, COUNT(*) AS ride_count
FROM uber_rides
GROUP BY hour
ORDER BY ride_count DESC
LIMIT 1;
PySpark:
from pyspark.sql.functions import hour, count
uber_rides.withColumn("hour", hour("ride_timestamp")) \
.groupBy("hour") \
.agg(count("*").alias("ride_count")) \
.orderBy("ride_count", ascending=False) \
.limit(1) \
.show()
6️⃣ Spotify – Most Popular Artist in Playlist
SQL:
SELECT artist, COUNT(*) AS song_count
FROM playlist_songs
GROUP BY artist
ORDER BY song_count DESC
LIMIT 1;
PySpark:
playlist_songs.groupBy("artist") \
.count() \
.orderBy("count", ascending=False) \
.limit(1) \
.show()
7️⃣ Fraudulent Transactions – Unusual Behavior
SQL:
SELECT *
FROM transactions
WHERE amount > 10000
AND EXTRACT(HOUR FROM transaction_time) BETWEEN 0 AND 5;
PySpark:
from pyspark.sql.functions import hour
transactions.filter((transactions["amount"] > 10000) &
(hour("transaction_time").between(0, 5))) \
.show()
8️⃣ Product Reviews – Inactive Reviewers
SQL:
SELECT user_id
FROM reviews
GROUP BY user_id
HAVING MAX(review_date) < DATE_SUB(CURRENT_DATE, INTERVAL 1 YEAR);
PySpark:
from pyspark.sql.functions import max as _max, current_date, date_sub
inactive_users = reviews.groupBy("user_id") \
.agg(_max("review_date").alias("last_review_date")) \
.filter("last_review_date < date_sub(current_date(), 365)")
inactive_users.select("user_id").show()
9️⃣ Hospital Data – Highest Readmission Rates
SQL:
SELECT hospital_id,
100.0 * SUM(CASE WHEN readmitted = 'Yes' THEN 1 ELSE 0 END) / COUNT(*) AS readmission_rate
FROM patient_data
GROUP BY hospital_id
ORDER BY readmission_rate DESC;
PySpark:
from pyspark.sql.functions import sum as _sum, count, when, col
hospital_readmissions = patient_data.groupBy("hospital_id") \
.agg((100.0 * _sum(when(col("readmitted") == "Yes", 1).otherwise(0)) / count("*"))
.alias("readmission_rate")) \
.orderBy("readmission_rate", ascending=False)
hospital_readmissions.show()
🔟 Movie Database – Most Watched Genre
SQL:
SELECT genre, COUNT(*) AS watch_count
FROM movie_watches
GROUP BY genre
ORDER BY watch_count DESC
LIMIT 1;
PySpark:
movie_watches.groupBy("genre") \
.count() \
.orderBy("count", ascending=False) \
.limit(1) \
.show()
🟢 Let’s Make It Interactive!
Here’s a quiz for you:
You have a table sales
(columns: order_id
, customer_id
, amount
, order_date
).
Question:
Write a SQL query to find the cumulative total (running_total
) of sales for each customer ordered by order_date
.
You’re on the right track! Let’s polish your answer and turn it into a complete SQL statement for the running total of sales (cumulative sum
) for each customer ordered by order_date
.
Here’s a complete version using window functions:
SELECT
order_id,
customer_id,
order_date,
amount,
SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM
sales;
🚀 Breakdown
✅ We’re using:
SUM(amount) OVER (...)
→ calculates cumulative sumPARTITION BY customer_id
→ restart cumulative sum for each customerORDER BY order_date
→ cumulative sum in date orderROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
→ explicitly says “start from the first row of this customer’s data up to the current row”
This is the most common window function scenario in interviews!
💡 Key Takeaway:
Whenever you see a “running total” or “cumulative sum” question – reach for SUM() OVER (PARTITION BY… ORDER BY…)
!
🟢 Now, let’s move to the next stage:
Would you like to do another SQL practice question (like using LAG()
or a self join)?
OR
Shall we start the PySpark + S3 mini-project I promised?
Let me know! 🚀
Awesome! Let’s jump right in. 🚀
🚀 PySpark + S3 Mini-Project
🔥 Objective:
- Build a mini data pipeline:
1️⃣ Fetch data from an S3 bucket
2️⃣ Read it as a Spark DataFrame
3️⃣ Transform it (filter, group, join, window functions)
4️⃣ Save results back to S3
This mirrors a real-world ETL scenario!
⚙️ Project Setup
✅ Assumptions
- AWS credentials (
AWS_ACCESS_KEY_ID
,AWS_SECRET_ACCESS_KEY
) are configured in your environment. - S3 bucket:
s3://my-spark-bucket/raw-data/
(you can replace with your actual bucket). - File:
sales_data.csv
(columns:order_id
,customer_id
,amount
,order_date
).
🏗️ Step 1: Spark Session with S3 Access
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("PySpark S3 Mini Project") \
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
.config("spark.hadoop.fs.s3a.access.key", "<AWS_ACCESS_KEY_ID>") \
.config("spark.hadoop.fs.s3a.secret.key", "<AWS_SECRET_ACCESS_KEY>") \
.config("spark.hadoop.fs.s3a.endpoint", "s3.amazonaws.com") \
.getOrCreate()
# Replace above with IAM roles if running on EMR
🏗️ Step 2: Read Data from S3
# Read CSV from S3
df = spark.read.csv("s3a://my-spark-bucket/raw-data/sales_data.csv", header=True, inferSchema=True)
df.show(5)
🏗️ Step 3: Data Transformation
✅ Example:
- Filter sales > 100
- Calculate running total per customer
- Calculate average sale amount per customer
from pyspark.sql.window import Window
import pyspark.sql.functions as F
# Filter
filtered_df = df.filter(df["amount"] > 100)
# Running total (cumulative sum)
window_spec = Window.partitionBy("customer_id").orderBy("order_date").rowsBetween(Window.unboundedPreceding, Window.currentRow)
running_total_df = filtered_df.withColumn("running_total", F.sum("amount").over(window_spec))
# Average sale amount per customer
avg_amount_df = filtered_df.groupBy("customer_id").agg(F.avg("amount").alias("avg_amount"))
# Join back to get average alongside running total
final_df = running_total_df.join(avg_amount_df, on="customer_id", how="left")
final_df.show(5)
🏗️ Step 4: Write Results Back to S3
# Save final DataFrame as parquet (efficient format)
final_df.write.mode("overwrite").parquet("s3a://my-spark-bucket/processed-data/final_output/")
🔥 Final Pipeline Flow
✅ Read → ✅ Transform → ✅ Save
✅ Practice with filters, window functions, joins
✅ Store in efficient Parquet format
🟢 Let’s Make It Interactive!
Here’s your next step:
👉 Try running this code on your local environment (or EMR if you have it set up).
👉 Let me know what happens:
- Any errors?
- Does the output in S3 look as expected?
👉 Stretch Task:
- Instead of CSV, try with JSON data.
- Or, add a new column for “sale_category” (e.g., amount > 500 = ‘big’, else ‘small’).
Would you like me to:
✅ Build a more complex PySpark transformation next?
✅ Or start a small Airflow DAG to schedule this PySpark job?
Let me know – we’re on fire! 🚀
Great direction! Let’s make this an even more realistic, end-to-end data engineering pipeline with:
✅ Airflow to schedule/monitor the job
✅ PySpark job running on EMR (scalable cluster)
✅ Data stored in S3
✅ Data loaded into Redshift for analytics
This is how modern data pipelines in companies (including unicorns and product-based startups) are typically built!
🚀 Project Overview
✅ 1️⃣ Airflow DAG
- Orchestrates the steps (EMR job, data validation, Redshift load).
- Airflow Operator: Use
EmrAddStepsOperator
(orEmrJobFlowOperator
) for PySpark on EMR, andPostgresOperator
orRedshiftSQLOperator
for Redshift.
✅ 2️⃣ PySpark Job
- Runs on EMR – scales automatically.
- Reads raw data from S3, processes, saves back to S3.
✅ 3️⃣ Redshift
- Final output data from S3 is loaded into Redshift for analytics.
- Use COPY command to load from S3 → Redshift.
🏗️ Airflow DAG Skeleton
Here’s a basic DAG outline to orchestrate this:
from airflow import DAG
from airflow.providers.amazon.aws.operators.emr_add_steps import EmrAddStepsOperator
from airflow.providers.amazon.aws.operators.emr_create_job_flow import EmrCreateJobFlowOperator
from airflow.providers.amazon.aws.operators.emr_terminate_job_flow import EmrTerminateJobFlowOperator
from airflow.providers.amazon.aws.operators.redshift_sql import RedshiftSQLOperator
from airflow.providers.amazon.aws.sensors.emr_step import EmrStepSensor
from airflow.utils.dates import days_ago
default_args = {
'owner': 'data_engineer',
'start_date': days_ago(1),
'retries': 1,
}
with DAG(
dag_id='pyspark_s3_emr_redshift_pipeline',
default_args=default_args,
schedule_interval='@daily',
catchup=False
) as dag:
# 1️⃣ Create EMR cluster
create_emr_cluster = EmrCreateJobFlowOperator(
task_id='create_emr_cluster',
job_flow_overrides={ # Add your EMR config here
"Name": "PySparkJobCluster",
"ReleaseLabel": "emr-6.15.0",
"Applications": [{"Name": "Spark"}],
"Instances": {
"InstanceGroups": [
{"InstanceRole": "MASTER", "InstanceType": "m5.xlarge", "InstanceCount": 1},
{"InstanceRole": "CORE", "InstanceType": "m5.xlarge", "InstanceCount": 2}
],
"KeepJobFlowAliveWhenNoSteps": True
},
"VisibleToAllUsers": True,
"JobFlowRole": "EMR_EC2_DefaultRole",
"ServiceRole": "EMR_DefaultRole",
"LogUri": "s3://my-spark-bucket/emr-logs/"
}
)
# 2️⃣ Add PySpark step
spark_step = [{
'Name': 'PySpark Transformation',
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': [
'spark-submit',
'--deploy-mode', 'cluster',
's3://my-spark-bucket/scripts/pyspark_etl.py'
]
}
}]
add_pyspark_step = EmrAddStepsOperator(
task_id='add_pyspark_step',
job_flow_id="{{ task_instance.xcom_pull(task_ids='create_emr_cluster', key='return_value') }}",
steps=spark_step
)
# 3️⃣ Wait for PySpark step to finish
wait_for_step = EmrStepSensor(
task_id='wait_for_step',
job_flow_id="{{ task_instance.xcom_pull(task_ids='create_emr_cluster', key='return_value') }}",
step_id="{{ task_instance.xcom_pull(task_ids='add_pyspark_step', key='return_value')[0] }}"
)
# 4️⃣ Load results to Redshift
load_to_redshift = RedshiftSQLOperator(
task_id='load_to_redshift',
sql="""
COPY analytics.sales_summary
FROM 's3://my-spark-bucket/processed-data/final_output/'
IAM_ROLE 'arn:aws:iam::123456789012:role/MyRedshiftRole'
FORMAT AS PARQUET;
""",
cluster_identifier='redshift-cluster-1',
database='dev'
)
# 5️⃣ Terminate EMR cluster
terminate_emr_cluster = EmrTerminateJobFlowOperator(
task_id='terminate_emr_cluster',
job_flow_id="{{ task_instance.xcom_pull(task_ids='create_emr_cluster', key='return_value') }}"
)
# DAG flow
create_emr_cluster >> add_pyspark_step >> wait_for_step >> load_to_redshift >> terminate_emr_cluster
🔥 Key Concepts
✅ AWS Tools in Action
- EMR: runs scalable PySpark job
- S3: stores raw & processed data
- Redshift: final data warehouse
- Airflow: orchestrates the pipeline
✅ Resume-Ready Experience
This entire flow = bullet point for your resume:
“Designed & implemented a scalable ETL pipeline using Airflow to orchestrate EMR-based PySpark jobs, storing results in Redshift for analytics.”
🟢 Let’s Do It!
1️⃣ Do you want me to write the PySpark script (pyspark_etl.py
) for this job?
2️⃣ Or would you like help deploying this Airflow DAG (e.g., how to put it in Airflow’s dags/
folder, config, environment)?
👉 Let me know which step you want to tackle first! We’ll make sure it’s both clear and hands-on. 🚀
Alright, let’s tackle both parts at once! Here’s the plan:
✅ Part 1: Write the PySpark ETL script (pyspark_etl.py
)
✅ Part 2: Show how to deploy the Airflow DAG
🚀 Part 1: PySpark ETL Script (pyspark_etl.py
)
# pyspark_etl.py
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.window import Window
# Initialize SparkSession
spark = SparkSession.builder.appName("PySpark S3 ETL").getOrCreate()
# Read data from S3
df = spark.read.csv("s3a://my-spark-bucket/raw-data/sales_data.csv", header=True, inferSchema=True)
# Filter sales > 100
filtered_df = df.filter(df["amount"] > 100)
# Calculate running total for each customer
window_spec = Window.partitionBy("customer_id").orderBy("order_date").rowsBetween(Window.unboundedPreceding, Window.currentRow)
running_total_df = filtered_df.withColumn("running_total", F.sum("amount").over(window_spec))
# Calculate average sale amount per customer
avg_amount_df = filtered_df.groupBy("customer_id").agg(F.avg("amount").alias("avg_amount"))
# Join to get average alongside running total
final_df = running_total_df.join(avg_amount_df, on="customer_id", how="left")
# Save results as Parquet back to S3
final_df.write.mode("overwrite").parquet("s3a://my-spark-bucket/processed-data/final_output/")
print("ETL Job completed successfully.")
spark.stop()
✅ Save this script as pyspark_etl.py
in an S3 bucket (e.g., s3://my-spark-bucket/scripts/
).
🚀 Part 2: Deploy Airflow DAG
Here’s how to deploy the Airflow DAG file (pyspark_s3_emr_redshift_pipeline.py
):
1️⃣ Place the DAG file in Airflow’s dags/
folder:
cp pyspark_s3_emr_redshift_pipeline.py $AIRFLOW_HOME/dags/
2️⃣ Update Airflow Connections & Variables:
- Create AWS connection (
aws_default
) with your credentials. - Create Redshift connection (
redshift_default
) with your Redshift cluster details.
3️⃣ Deploy EMR Roles:
- Ensure
EMR_EC2_DefaultRole
andEMR_DefaultRole
IAM roles exist and have S3 & Redshift access.
4️⃣ Trigger the DAG:
airflow dags trigger pyspark_s3_emr_redshift_pipeline
🔥 End-to-End Flow
✅ Airflow triggers the DAG →
✅ EMR cluster is created →
✅ PySpark script (pyspark_etl.py
) runs on EMR →
✅ Data is processed (filtered, window functions, average) →
✅ Results saved to S3 (Parquet) →
✅ Airflow loads results to Redshift using COPY
command →
✅ EMR cluster terminated to save cost.
🎯 Next Steps
✅ Let me know:
- Do you want me to simulate sample data (
sales_data.csv
) for testing? - Would you like a diagram or video explaining this workflow?
- Do you want to add extra transformations or data quality checks?
We’re moving like a pro team! 🚀