Data cleaning in SQL is a crucial step in data preprocessing, especially when working with real-world messy datasets. Below is a structured breakdown of SQL data cleaning steps, methods, functions, and complex use cases you can apply in real projects or interviews.


✅ Common SQL Data Cleaning Steps & Methods

StepMethod / FunctionExample
1. Remove DuplicatesROW_NUMBER(), DISTINCT, GROUP BYSee below under “Removing Duplicates”
2. Handle NULLs / MissingCOALESCE(), IS NULL, CASE WHENReplace or remove nulls: COALESCE(col, 0)
3. Fix Inconsistent ValuesLOWER(), TRIM(), REPLACE()Normalize case, remove whitespace
4. Outlier Detection / FixPERCENTILE_CONT, manual thresholdsFilter ages > 120, salaries > 1 crore, etc.
5. Convert Data TypesCAST(), TRY_CAST()CAST(date_str AS DATE)
6. Parse and Extract FieldsSUBSTRING(), SPLIT_PART() (PostgreSQL)Extract name parts, email domain
7. Standardize FormatsTO_DATE(), FORMAT()Fix inconsistent date/time strings
8. Correct Invalid RecordsCASE, REGEXPRegex to validate phone numbers or emails
9. Impute ValuesAVG() OVER(), LAST_VALUE()Fill missing using group average or last observed value
10. Remove Irrelevant DataWHERE clauseRemove records with junk or unneeded values

🧠 Example Use Cases

🔁 1. Remove Duplicate Records

WITH Ranked AS (
  SELECT *,
         ROW_NUMBER() OVER(PARTITION BY email ORDER BY updated_at DESC) as rn
  FROM customers
)
SELECT * FROM Ranked
WHERE rn = 1;

Explanation: Keeps only the latest record per email based on updated_at.


🕳️ 2. Replace Missing/Null Values

SELECT id,
       COALESCE(phone, 'N/A') AS phone,
       COALESCE(age, AVG(age) OVER()) AS imputed_age
FROM users;

Explanation: Fills missing phone with ‘N/A’, and missing age with average.


🔤 3. Standardize Case and Trim Whitespace

SELECT 
  id,
  LOWER(TRIM(name)) AS clean_name
FROM employee;

Explanation: Fixes inconsistent casing and trailing/leading spaces.


🧪 4. Validate and Filter Email Format

SELECT *
FROM users
WHERE email ~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$';

Explanation: Uses regex to filter only valid email formats (PostgreSQL).


📅 5. Convert Date Strings and Handle Invalids

SELECT *,
       TRY_CAST(birthdate_str AS DATE) AS birthdate
FROM raw_birthdates
WHERE TRY_CAST(birthdate_str AS DATE) IS NOT NULL;

Explanation: Converts string to date; filters out bad records.


📊 6. Outlier Removal (e.g., salary)

WITH SalaryStats AS (
  SELECT PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY salary) AS p95
  FROM employees
)
SELECT e.*
FROM employees e
JOIN SalaryStats s ON e.salary <= s.p95;

Explanation: Filters salaries above 95th percentile (removes outliers).


🔢 7. Fix Date Formats

SELECT id,
       TO_DATE(date_string, 'MM/DD/YYYY') AS cleaned_date
FROM events
WHERE TRY_CAST(TO_DATE(date_string, 'MM/DD/YYYY') AS DATE) IS NOT NULL;

🔄 8. Join to Lookup Table for Standardization

SELECT a.id, l.standard_category
FROM raw_data a
LEFT JOIN category_lookup l ON a.cat_code = l.raw_cat_code;

Use Case: Fix inconsistent categories using a reference table.


🧬 9. Impute Group-Wise Averages

SELECT *,
       COALESCE(score, AVG(score) OVER(PARTITION BY department)) AS imputed_score
FROM performance;

Explanation: Fills NULL scores using department-wise average.


📤 10. Extract Domain from Email

SELECT email,
       SPLIT_PART(email, '@', 2) AS domain
FROM users;

(Use SUBSTRING_INDEX(email, '@', -1) in MySQL)


💡 Tips for Real-World SQL Cleaning:

  • Use CTEs for multi-step cleaning logic
  • Create temporary/staging tables for safe operations
  • Use REGEX functions to validate or extract patterns
  • Avoid deleting rows directly; prefer marking them inactive
  • Track changes using IS_CLEANED or last_updated_by metadata fields

🧠 Bonus Complex Challenge Example

Problem:

You receive a user table where:

  • email is duplicated
  • age has NULLs and outliers
  • name is messy (whitespace, inconsistent case)
  • join_date is in string format

Cleaning Pipeline (All-in-One CTE)

WITH deduped AS (
  SELECT *, ROW_NUMBER() OVER(PARTITION BY email ORDER BY updated_at DESC) AS rn
  FROM raw_users
),
cleaned AS (
  SELECT id,
         LOWER(TRIM(name)) AS clean_name,
         email,
         CASE 
           WHEN age IS NULL THEN AVG(age) OVER()
           WHEN age > 100 THEN NULL
           ELSE age
         END AS fixed_age,
         TO_DATE(join_date, 'YYYY-MM-DD') AS fixed_date
  FROM deduped
  WHERE rn = 1
)
SELECT * FROM cleaned;

-- SQL Data Cleaning Use Cases Notebook (PostgreSQL compatible)

-- 1. Removing Duplicate Records (Keep latest)
WITH deduped AS (
  SELECT *,
         ROW_NUMBER() OVER(PARTITION BY email ORDER BY updated_at DESC) AS rn
  FROM users_raw
)
SELECT * FROM deduped WHERE rn = 1;

-- 2. Handling NULLs: Replace missing phone with 'N/A'
SELECT id, name, COALESCE(phone, 'N/A') AS phone
FROM users_raw;

-- 3. Handling NULLs: Impute missing age with average
SELECT id, name,
       COALESCE(age, AVG(age) OVER()) AS imputed_age
FROM users_raw;

-- 4. Standardizing Case and Trimming
SELECT id, LOWER(TRIM(name)) AS clean_name
FROM users_raw;

-- 5. Removing Outliers (e.g., salary above 95th percentile)
WITH salary_stats AS (
  SELECT PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY salary) AS p95
  FROM employees
)
SELECT e.*
FROM employees e
JOIN salary_stats s ON e.salary <= s.p95;

-- 6. Extracting Domain from Email
SELECT email, SPLIT_PART(email, '@', 2) AS domain
FROM users_raw;

-- 7. Fixing Date Strings
SELECT id, TO_DATE(date_str, 'MM/DD/YYYY') AS fixed_date
FROM event_logs
WHERE TRY_CAST(TO_DATE(date_str, 'MM/DD/YYYY') AS DATE) IS NOT NULL;

-- 8. Regex: Valid email check (PostgreSQL)
SELECT *
FROM users_raw
WHERE email ~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$';

-- 9. Group-wise Imputation (e.g., scores per department)
SELECT *,
       COALESCE(score, AVG(score) OVER(PARTITION BY department)) AS imputed_score
FROM performance;

-- 10. Join with Lookup Table for Category Cleanup
SELECT raw.id, raw.cat_code, lookup.standard_category
FROM raw_data raw
LEFT JOIN category_lookup lookup
ON raw.cat_code = lookup.raw_cat_code;

✅ Here’s a SQL notebook covering practical data cleaning use cases (PostgreSQL-compatible). You can adapt it for MySQL or BigQuery with slight syntax tweaks.


🔗 Real Datasets for Practice (Download Links)

Here are great datasets from Kaggle and StrataScratch for cleaning tasks:

Dataset NameDescriptionLink
🧍 User Behavior DataContains missing and inconsistent user dataKaggle – User Data
🧾 311 Service RequestsNoisy real-world city dataNYC 311 Calls
📊 Marketing Leads DataNulls, date issues, duplicatesStrataScratch Datasets → Register → SQL Practice
💼 HR Data SetNull ages, duplicate employees, wrong departmentsHR Analytics

🔄 PySpark & Pandas Conversion Example

Let’s take the “Clean name, handle null age, drop duplicates by email” example:

🔹 PySpark

from pyspark.sql.functions import col, lower, trim, avg, row_number
from pyspark.sql.window import Window

df = spark.read.csv("users_raw.csv", header=True, inferSchema=True)

# Clean name and impute age
window = Window.partitionBy("email").orderBy(col("updated_at").desc())
df = df.withColumn("rn", row_number().over(window)) \
       .withColumn("clean_name", lower(trim(col("name")))) \
       .withColumn("age", col("age").cast("int"))

avg_age = df.select(avg("age")).collect()[0][0]
df_cleaned = df.filter("rn = 1").fillna({'age': avg_age})

🔹 Pandas

import pandas as pd

df = pd.read_csv("users_raw.csv")
df['clean_name'] = df['name'].str.strip().str.lower()

# Drop duplicates by email (keep latest)
df['updated_at'] = pd.to_datetime(df['updated_at'])
df = df.sort_values('updated_at', ascending=False).drop_duplicates('email')

# Fill missing age
df['age'] = df['age'].fillna(df['age'].mean())

Pages: 1 2 3 4


Discover more from HintsToday

Subscribe to get the latest posts sent to your email.

Posted in

Leave a Reply

Discover more from HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading