Data cleaning in SQL is a crucial step in data preprocessing, especially when working with real-world messy datasets. Below is a structured breakdown of SQL data cleaning steps, methods, functions, and complex use cases you can apply in real projects or interviews.
✅ Common SQL Data Cleaning Steps & Methods
Step | Method / Function | Example |
---|---|---|
1. Remove Duplicates | ROW_NUMBER() , DISTINCT , GROUP BY | See below under “Removing Duplicates” |
2. Handle NULLs / Missing | COALESCE() , IS NULL , CASE WHEN | Replace or remove nulls: COALESCE(col, 0) |
3. Fix Inconsistent Values | LOWER() , TRIM() , REPLACE() | Normalize case, remove whitespace |
4. Outlier Detection / Fix | PERCENTILE_CONT , manual thresholds | Filter ages > 120, salaries > 1 crore, etc. |
5. Convert Data Types | CAST() , TRY_CAST() | CAST(date_str AS DATE) |
6. Parse and Extract Fields | SUBSTRING() , SPLIT_PART() (PostgreSQL) | Extract name parts, email domain |
7. Standardize Formats | TO_DATE() , FORMAT() | Fix inconsistent date/time strings |
8. Correct Invalid Records | CASE , REGEXP | Regex to validate phone numbers or emails |
9. Impute Values | AVG() OVER() , LAST_VALUE() | Fill missing using group average or last observed value |
10. Remove Irrelevant Data | WHERE clause | Remove records with junk or unneeded values |
🧠 Example Use Cases
🔁 1. Remove Duplicate Records
WITH Ranked AS (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY email ORDER BY updated_at DESC) as rn
FROM customers
)
SELECT * FROM Ranked
WHERE rn = 1;
Explanation: Keeps only the latest record per email based on updated_at
.
🕳️ 2. Replace Missing/Null Values
SELECT id,
COALESCE(phone, 'N/A') AS phone,
COALESCE(age, AVG(age) OVER()) AS imputed_age
FROM users;
Explanation: Fills missing phone with ‘N/A’, and missing age with average.
🔤 3. Standardize Case and Trim Whitespace
SELECT
id,
LOWER(TRIM(name)) AS clean_name
FROM employee;
Explanation: Fixes inconsistent casing and trailing/leading spaces.
🧪 4. Validate and Filter Email Format
SELECT *
FROM users
WHERE email ~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$';
Explanation: Uses regex to filter only valid email formats (PostgreSQL).
📅 5. Convert Date Strings and Handle Invalids
SELECT *,
TRY_CAST(birthdate_str AS DATE) AS birthdate
FROM raw_birthdates
WHERE TRY_CAST(birthdate_str AS DATE) IS NOT NULL;
Explanation: Converts string to date; filters out bad records.
📊 6. Outlier Removal (e.g., salary)
WITH SalaryStats AS (
SELECT PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY salary) AS p95
FROM employees
)
SELECT e.*
FROM employees e
JOIN SalaryStats s ON e.salary <= s.p95;
Explanation: Filters salaries above 95th percentile (removes outliers).
🔢 7. Fix Date Formats
SELECT id,
TO_DATE(date_string, 'MM/DD/YYYY') AS cleaned_date
FROM events
WHERE TRY_CAST(TO_DATE(date_string, 'MM/DD/YYYY') AS DATE) IS NOT NULL;
🔄 8. Join to Lookup Table for Standardization
SELECT a.id, l.standard_category
FROM raw_data a
LEFT JOIN category_lookup l ON a.cat_code = l.raw_cat_code;
Use Case: Fix inconsistent categories using a reference table.
🧬 9. Impute Group-Wise Averages
SELECT *,
COALESCE(score, AVG(score) OVER(PARTITION BY department)) AS imputed_score
FROM performance;
Explanation: Fills NULL scores using department-wise average.
📤 10. Extract Domain from Email
SELECT email,
SPLIT_PART(email, '@', 2) AS domain
FROM users;
(Use SUBSTRING_INDEX(email, '@', -1)
in MySQL)
💡 Tips for Real-World SQL Cleaning:
- Use CTEs for multi-step cleaning logic
- Create temporary/staging tables for safe operations
- Use REGEX functions to validate or extract patterns
- Avoid deleting rows directly; prefer marking them inactive
- Track changes using
IS_CLEANED
orlast_updated_by
metadata fields
🧠 Bonus Complex Challenge Example
Problem:
You receive a user table where:
email
is duplicatedage
has NULLs and outliersname
is messy (whitespace, inconsistent case)join_date
is in string format
Cleaning Pipeline (All-in-One CTE)
WITH deduped AS (
SELECT *, ROW_NUMBER() OVER(PARTITION BY email ORDER BY updated_at DESC) AS rn
FROM raw_users
),
cleaned AS (
SELECT id,
LOWER(TRIM(name)) AS clean_name,
email,
CASE
WHEN age IS NULL THEN AVG(age) OVER()
WHEN age > 100 THEN NULL
ELSE age
END AS fixed_age,
TO_DATE(join_date, 'YYYY-MM-DD') AS fixed_date
FROM deduped
WHERE rn = 1
)
SELECT * FROM cleaned;
-- SQL Data Cleaning Use Cases Notebook (PostgreSQL compatible)
-- 1. Removing Duplicate Records (Keep latest)
WITH deduped AS (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY email ORDER BY updated_at DESC) AS rn
FROM users_raw
)
SELECT * FROM deduped WHERE rn = 1;
-- 2. Handling NULLs: Replace missing phone with 'N/A'
SELECT id, name, COALESCE(phone, 'N/A') AS phone
FROM users_raw;
-- 3. Handling NULLs: Impute missing age with average
SELECT id, name,
COALESCE(age, AVG(age) OVER()) AS imputed_age
FROM users_raw;
-- 4. Standardizing Case and Trimming
SELECT id, LOWER(TRIM(name)) AS clean_name
FROM users_raw;
-- 5. Removing Outliers (e.g., salary above 95th percentile)
WITH salary_stats AS (
SELECT PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY salary) AS p95
FROM employees
)
SELECT e.*
FROM employees e
JOIN salary_stats s ON e.salary <= s.p95;
-- 6. Extracting Domain from Email
SELECT email, SPLIT_PART(email, '@', 2) AS domain
FROM users_raw;
-- 7. Fixing Date Strings
SELECT id, TO_DATE(date_str, 'MM/DD/YYYY') AS fixed_date
FROM event_logs
WHERE TRY_CAST(TO_DATE(date_str, 'MM/DD/YYYY') AS DATE) IS NOT NULL;
-- 8. Regex: Valid email check (PostgreSQL)
SELECT *
FROM users_raw
WHERE email ~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$';
-- 9. Group-wise Imputation (e.g., scores per department)
SELECT *,
COALESCE(score, AVG(score) OVER(PARTITION BY department)) AS imputed_score
FROM performance;
-- 10. Join with Lookup Table for Category Cleanup
SELECT raw.id, raw.cat_code, lookup.standard_category
FROM raw_data raw
LEFT JOIN category_lookup lookup
ON raw.cat_code = lookup.raw_cat_code;
✅ Here’s a SQL notebook covering practical data cleaning use cases (PostgreSQL-compatible). You can adapt it for MySQL or BigQuery with slight syntax tweaks.
🔗 Real Datasets for Practice (Download Links)
Here are great datasets from Kaggle and StrataScratch for cleaning tasks:
Dataset Name | Description | Link |
---|---|---|
🧍 User Behavior Data | Contains missing and inconsistent user data | Kaggle – User Data |
🧾 311 Service Requests | Noisy real-world city data | NYC 311 Calls |
📊 Marketing Leads Data | Nulls, date issues, duplicates | StrataScratch Datasets → Register → SQL Practice |
💼 HR Data Set | Null ages, duplicate employees, wrong departments | HR Analytics |
🔄 PySpark & Pandas Conversion Example
Let’s take the “Clean name, handle null age, drop duplicates by email” example:
🔹 PySpark
from pyspark.sql.functions import col, lower, trim, avg, row_number
from pyspark.sql.window import Window
df = spark.read.csv("users_raw.csv", header=True, inferSchema=True)
# Clean name and impute age
window = Window.partitionBy("email").orderBy(col("updated_at").desc())
df = df.withColumn("rn", row_number().over(window)) \
.withColumn("clean_name", lower(trim(col("name")))) \
.withColumn("age", col("age").cast("int"))
avg_age = df.select(avg("age")).collect()[0][0]
df_cleaned = df.filter("rn = 1").fillna({'age': avg_age})
🔹 Pandas
import pandas as pd
df = pd.read_csv("users_raw.csv")
df['clean_name'] = df['name'].str.strip().str.lower()
# Drop duplicates by email (keep latest)
df['updated_at'] = pd.to_datetime(df['updated_at'])
df = df.sort_values('updated_at', ascending=False).drop_duplicates('email')
# Fill missing age
df['age'] = df['age'].fillna(df['age'].mean())
Leave a Reply