Here is Post 4: Delta Lake Deep Dive — a complete, hands-on guide perfect for your Databricks tutorial series.

💎 Post 4: Delta Lake Deep Dive in Databricks

Powerful Features to Scale Your Data Engineering Projects

Delta Lake is at the heart of the Lakehouse architecture. If you’ve already started exploring Spark and saved your data as Delta format, you’re ready to unlock its real power.

In this guide, we go deep into:

MERGE INTO for Upserts
Schema Evolution
OPTIMIZE, ZORDER, VACUUM
Time Travel and Versioning

🚀 Why Delta Lake?

Delta Lake builds ACID transactions, schema enforcement, and versioned data on top of Parquet.
This solves key problems of raw data lakes:

Problem	Delta Lake Solution
Dirty/inconsistent data	ACID Transactions
No schema enforcement	Auto & manual schema control
Hard to debug / rollback	Time travel via transaction log
Query performance issues	OPTIMIZE + ZORDER indexing

⚙️ Setup: Load a Sample Delta Table

We’ll reuse /delta/customers from the previous post or create a new one:

df = spark.read.option("header", True).csv("/FileStore/tables/customers.csv")
df.write.format("delta").mode("overwrite").save("/delta/customers")

CREATE TABLE IF NOT EXISTS customers
USING DELTA
LOCATION '/delta/customers'

🔁 1. `MERGE INTO` – The UPSERT Power

Delta supports UPSERT natively (not available in Parquet or CSV).

🔹 Scenario:

Update existing customers or insert new ones from a staging table.

from pyspark.sql import Row

# Create dummy updates
updates = spark.createDataFrame([
    Row(customer_id=101, city="Mumbai", gender="Male"),
    Row(customer_id=999, city="Delhi", gender="Female")
])

updates.createOrReplaceTempView("updates")

# MERGE INTO
spark.sql("""
MERGE INTO customers AS target
USING updates AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
  UPDATE SET *
WHEN NOT MATCHED THEN
  INSERT *
""")

✅ Now check changes:

SELECT * FROM customers WHERE customer_id IN (101, 999);

🧬 2. Schema Evolution

Schema changes automatically handled in overwrite mode.

🔹 Add New Column Example:

new_df = df.withColumn("loyalty_score", lit(100))

new_df.write \
  .format("delta") \
  .mode("overwrite") \
  .option("overwriteSchema", True) \
  .save("/delta/customers")

Now the Delta table has a new column loyalty_score.

⚡ 3. OPTIMIZE + ZORDER

These improve read performance by coalescing small files and indexing key columns.

🔹 OPTIMIZE

OPTIMIZE customers

This compacts small files into fewer large ones → improves scan speed.

🔹 ZORDER (Column Indexing)

OPTIMIZE customers ZORDER BY (city)

This helps prune files during filters like:

SELECT * FROM customers WHERE city = 'Delhi'

🧹 4. VACUUM – Clean Up Old Data

Delta keeps old versions for time travel. Use VACUUM to clean them.

-- Remove files older than 7 days (default)
VACUUM customers

-- Aggressive cleanup (use with caution!)
VACUUM customers RETAIN 0 HOURS

🛑 Warning: Time travel will not work for deleted versions!

🕰️ 5. Time Travel and Versioning

Delta Lake lets you query previous versions of your table.

🔹 View History

DESCRIBE HISTORY customers

🔹 Query Previous Version

df_v0 = spark.read.format("delta") \
    .option("versionAsOf", 0) \
    .load("/delta/customers")

df_v0.show()

Or use timestamp:

df_ts = spark.read.format("delta") \
    .option("timestampAsOf", "2024-06-01T10:00:00") \
    .load("/delta/customers")

🧪 Bonus: Check Schema Before and After

print("Current Schema:")
df.printSchema()

print("Historical Schema:")
df_v0.printSchema()

💡 Pro Tips

Use Case	Delta Feature to Use
Insert + Update daily records	`MERGE INTO`
Append with changing schema	`overwriteSchema=True`
Improve query performance	`OPTIMIZE + ZORDER`
Query past states of data	Time Travel (`versionAsOf`)
Clean up unneeded files	`VACUUM`

📌 Summary

✅ You learned to:

Do MERGE INTO for upserts
Use overwriteSchema for schema evolution
Optimize Delta tables for better performance
Use version history and query older data
Clean unused files with VACUUM

⏭️ Coming Next:

Post 5: Medallion Architecture with Delta (Bronze → Silver → Gold)
We’ll implement a layered data pipeline using:

Raw ingestion (Bronze)
Cleaning + joining (Silver)
Aggregation for reporting (Gold)

Would you like this post exported as:

A Markdown blog post?
A Databricks notebook (.dbc or .ipynb)?
Or both for sharing with your audience?

HintsToday

recent posts

about

DataBricks Tutorial for Beginner to Advanced

💎 Post 4: Delta Lake Deep Dive in Databricks

🚀 Why Delta Lake?

⚙️ Setup: Load a Sample Delta Table

🔁 1. `MERGE INTO` – The UPSERT Power

🔹 Scenario:

🧬 2. Schema Evolution

🔹 Add New Column Example:

⚡ 3. OPTIMIZE + ZORDER

🔹 OPTIMIZE

🔹 ZORDER (Column Indexing)

🧹 4. VACUUM – Clean Up Old Data

🕰️ 5. Time Travel and Versioning

🔹 View History

🔹 Query Previous Version

🧪 Bonus: Check Schema Before and After

💡 Pro Tips

📌 Summary

⏭️ Coming Next:

Like this:

Discover more from HintsToday

recent posts

about

DataBricks Tutorial for Beginner to Advanced

💎 Post 4: Delta Lake Deep Dive in Databricks

🚀 Why Delta Lake?

⚙️ Setup: Load a Sample Delta Table

🔁 1. MERGE INTO – The UPSERT Power

🔹 Scenario:

🧬 2. Schema Evolution

🔹 Add New Column Example:

⚡ 3. OPTIMIZE + ZORDER

🔹 OPTIMIZE

🔹 ZORDER (Column Indexing)

🧹 4. VACUUM – Clean Up Old Data

🕰️ 5. Time Travel and Versioning

🔹 View History

🔹 Query Previous Version

🧪 Bonus: Check Schema Before and After

💡 Pro Tips

📌 Summary

⏭️ Coming Next:

Like this:

Discover more from HintsToday

Discover more from HintsToday

🔁 1. `MERGE INTO` – The UPSERT Power