SQL Tricky Conceptual Interview Questions

Data cleaning in SQL is a crucial step in data preprocessing, especially when working with real-world messy datasets. Below is a structured breakdown of SQL data cleaning steps, methods, functions, and complex use cases you can apply in real projects or interviews.

✅ Common SQL Data Cleaning Steps & Methods

Step	Method / Function	Example
1. Remove Duplicates	`ROW_NUMBER()`, `DISTINCT`, `GROUP BY`	See below under “Removing Duplicates”
2. Handle NULLs / Missing	`COALESCE()`, `IS NULL`, `CASE WHEN`	Replace or remove nulls: `COALESCE(col, 0)`
3. Fix Inconsistent Values	`LOWER()`, `TRIM()`, `REPLACE()`	Normalize case, remove whitespace
4. Outlier Detection / Fix	`PERCENTILE_CONT`, manual thresholds	Filter ages > 120, salaries > 1 crore, etc.
5. Convert Data Types	`CAST()`, `TRY_CAST()`	`CAST(date_str AS DATE)`
6. Parse and Extract Fields	`SUBSTRING()`, `SPLIT_PART()` (PostgreSQL)	Extract name parts, email domain
7. Standardize Formats	`TO_DATE()`, `FORMAT()`	Fix inconsistent date/time strings
8. Correct Invalid Records	`CASE`, `REGEXP`	Regex to validate phone numbers or emails
9. Impute Values	`AVG() OVER()`, `LAST_VALUE()`	Fill missing using group average or last observed value
10. Remove Irrelevant Data	`WHERE` clause	Remove records with junk or unneeded values

🧠 Example Use Cases

🔁 1. Remove Duplicate Records

WITH Ranked AS (
  SELECT *,
         ROW_NUMBER() OVER(PARTITION BY email ORDER BY updated_at DESC) as rn
  FROM customers
)
SELECT * FROM Ranked
WHERE rn = 1;

Explanation: Keeps only the latest record per email based on updated_at.

🕳️ 2. Replace Missing/Null Values

SELECT id,
       COALESCE(phone, 'N/A') AS phone,
       COALESCE(age, AVG(age) OVER()) AS imputed_age
FROM users;

Explanation: Fills missing phone with ‘N/A’, and missing age with average.

🔤 3. Standardize Case and Trim Whitespace

SELECT 
  id,
  LOWER(TRIM(name)) AS clean_name
FROM employee;

Explanation: Fixes inconsistent casing and trailing/leading spaces.

🧪 4. Validate and Filter Email Format

SELECT *
FROM users
WHERE email ~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$';

Explanation: Uses regex to filter only valid email formats (PostgreSQL).

📅 5. Convert Date Strings and Handle Invalids

SELECT *,
       TRY_CAST(birthdate_str AS DATE) AS birthdate
FROM raw_birthdates
WHERE TRY_CAST(birthdate_str AS DATE) IS NOT NULL;

Explanation: Converts string to date; filters out bad records.

📊 6. Outlier Removal (e.g., salary)

WITH SalaryStats AS (
  SELECT PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY salary) AS p95
  FROM employees
)
SELECT e.*
FROM employees e
JOIN SalaryStats s ON e.salary <= s.p95;

Explanation: Filters salaries above 95th percentile (removes outliers).

🔢 7. Fix Date Formats

SELECT id,
       TO_DATE(date_string, 'MM/DD/YYYY') AS cleaned_date
FROM events
WHERE TRY_CAST(TO_DATE(date_string, 'MM/DD/YYYY') AS DATE) IS NOT NULL;

🔄 8. Join to Lookup Table for Standardization

SELECT a.id, l.standard_category
FROM raw_data a
LEFT JOIN category_lookup l ON a.cat_code = l.raw_cat_code;

Use Case: Fix inconsistent categories using a reference table.

🧬 9. Impute Group-Wise Averages

SELECT *,
       COALESCE(score, AVG(score) OVER(PARTITION BY department)) AS imputed_score
FROM performance;

Explanation: Fills NULL scores using department-wise average.

📤 10. Extract Domain from Email

SELECT email,
       SPLIT_PART(email, '@', 2) AS domain
FROM users;

(Use SUBSTRING_INDEX(email, '@', -1) in MySQL)

💡 Tips for Real-World SQL Cleaning:

Use CTEs for multi-step cleaning logic
Create temporary/staging tables for safe operations
Use REGEX functions to validate or extract patterns
Avoid deleting rows directly; prefer marking them inactive
Track changes using IS_CLEANED or last_updated_by metadata fields

🧠 Bonus Complex Challenge Example

Problem:

You receive a user table where:

email is duplicated
age has NULLs and outliers
name is messy (whitespace, inconsistent case)
join_date is in string format

Cleaning Pipeline (All-in-One CTE)

WITH deduped AS (
  SELECT *, ROW_NUMBER() OVER(PARTITION BY email ORDER BY updated_at DESC) AS rn
  FROM raw_users
),
cleaned AS (
  SELECT id,
         LOWER(TRIM(name)) AS clean_name,
         email,
         CASE 
           WHEN age IS NULL THEN AVG(age) OVER()
           WHEN age > 100 THEN NULL
           ELSE age
         END AS fixed_age,
         TO_DATE(join_date, 'YYYY-MM-DD') AS fixed_date
  FROM deduped
  WHERE rn = 1
)
SELECT * FROM cleaned;

-- SQL Data Cleaning Use Cases Notebook (PostgreSQL compatible)

-- 1. Removing Duplicate Records (Keep latest)
WITH deduped AS (
  SELECT *,
         ROW_NUMBER() OVER(PARTITION BY email ORDER BY updated_at DESC) AS rn
  FROM users_raw
)
SELECT * FROM deduped WHERE rn = 1;

-- 2. Handling NULLs: Replace missing phone with 'N/A'
SELECT id, name, COALESCE(phone, 'N/A') AS phone
FROM users_raw;

-- 3. Handling NULLs: Impute missing age with average
SELECT id, name,
       COALESCE(age, AVG(age) OVER()) AS imputed_age
FROM users_raw;

-- 4. Standardizing Case and Trimming
SELECT id, LOWER(TRIM(name)) AS clean_name
FROM users_raw;

-- 5. Removing Outliers (e.g., salary above 95th percentile)
WITH salary_stats AS (
  SELECT PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY salary) AS p95
  FROM employees
)
SELECT e.*
FROM employees e
JOIN salary_stats s ON e.salary <= s.p95;

-- 6. Extracting Domain from Email
SELECT email, SPLIT_PART(email, '@', 2) AS domain
FROM users_raw;

-- 7. Fixing Date Strings
SELECT id, TO_DATE(date_str, 'MM/DD/YYYY') AS fixed_date
FROM event_logs
WHERE TRY_CAST(TO_DATE(date_str, 'MM/DD/YYYY') AS DATE) IS NOT NULL;

-- 8. Regex: Valid email check (PostgreSQL)
SELECT *
FROM users_raw
WHERE email ~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$';

-- 9. Group-wise Imputation (e.g., scores per department)
SELECT *,
       COALESCE(score, AVG(score) OVER(PARTITION BY department)) AS imputed_score
FROM performance;

-- 10. Join with Lookup Table for Category Cleanup
SELECT raw.id, raw.cat_code, lookup.standard_category
FROM raw_data raw
LEFT JOIN category_lookup lookup
ON raw.cat_code = lookup.raw_cat_code;

✅ Here’s a SQL notebook covering practical data cleaning use cases (PostgreSQL-compatible). You can adapt it for MySQL or BigQuery with slight syntax tweaks.

🔗 Real Datasets for Practice (Download Links)

Here are great datasets from Kaggle and StrataScratch for cleaning tasks:

Dataset Name	Description	Link
🧍 User Behavior Data	Contains missing and inconsistent user data	Kaggle – User Data
🧾 311 Service Requests	Noisy real-world city data	NYC 311 Calls
📊 Marketing Leads Data	Nulls, date issues, duplicates	StrataScratch Datasets → Register → SQL Practice
💼 HR Data Set	Null ages, duplicate employees, wrong departments	HR Analytics

🔄 PySpark & Pandas Conversion Example

Let’s take the “Clean name, handle null age, drop duplicates by email” example:

🔹 PySpark

from pyspark.sql.functions import col, lower, trim, avg, row_number
from pyspark.sql.window import Window

df = spark.read.csv("users_raw.csv", header=True, inferSchema=True)

# Clean name and impute age
window = Window.partitionBy("email").orderBy(col("updated_at").desc())
df = df.withColumn("rn", row_number().over(window)) \
       .withColumn("clean_name", lower(trim(col("name")))) \
       .withColumn("age", col("age").cast("int"))

avg_age = df.select(avg("age")).collect()[0][0]
df_cleaned = df.filter("rn = 1").fillna({'age': avg_age})

🔹 Pandas

import pandas as pd

df = pd.read_csv("users_raw.csv")
df['clean_name'] = df['name'].str.strip().str.lower()

# Drop duplicates by email (keep latest)
df['updated_at'] = pd.to_datetime(df['updated_at'])
df = df.sort_values('updated_at', ascending=False).drop_duplicates('email')

# Fill missing age
df['age'] = df['age'].fillna(df['age'].mean())

HintsToday

recent posts

about