Here is Post 4: Delta Lake Deep Dive — a complete, hands-on guide perfect for your Databricks tutorial series.
💎 Post 4: Delta Lake Deep Dive in Databricks
Powerful Features to Scale Your Data Engineering Projects
Delta Lake is at the heart of the Lakehouse architecture. If you’ve already started exploring Spark and saved your data as Delta format, you’re ready to unlock its real power.
In this guide, we go deep into:
MERGE INTO
for Upserts- Schema Evolution
- OPTIMIZE, ZORDER, VACUUM
- Time Travel and Versioning
🚀 Why Delta Lake?
Delta Lake builds ACID transactions, schema enforcement, and versioned data on top of Parquet.
This solves key problems of raw data lakes:
Problem | Delta Lake Solution |
---|---|
Dirty/inconsistent data | ACID Transactions |
No schema enforcement | Auto & manual schema control |
Hard to debug / rollback | Time travel via transaction log |
Query performance issues | OPTIMIZE + ZORDER indexing |
⚙️ Setup: Load a Sample Delta Table
We’ll reuse /delta/customers
from the previous post or create a new one:
df = spark.read.option("header", True).csv("/FileStore/tables/customers.csv")
df.write.format("delta").mode("overwrite").save("/delta/customers")
Register it:
CREATE TABLE IF NOT EXISTS customers
USING DELTA
LOCATION '/delta/customers'
🔁 1. MERGE INTO
– The UPSERT Power
Delta supports UPSERT natively (not available in Parquet or CSV).
🔹 Scenario:
Update existing customers or insert new ones from a staging table.
from pyspark.sql import Row
# Create dummy updates
updates = spark.createDataFrame([
Row(customer_id=101, city="Mumbai", gender="Male"),
Row(customer_id=999, city="Delhi", gender="Female")
])
updates.createOrReplaceTempView("updates")
# MERGE INTO
spark.sql("""
MERGE INTO customers AS target
USING updates AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT *
""")
✅ Now check changes:
SELECT * FROM customers WHERE customer_id IN (101, 999);
🧬 2. Schema Evolution
Schema changes automatically handled in overwrite mode.
🔹 Add New Column Example:
new_df = df.withColumn("loyalty_score", lit(100))
new_df.write \
.format("delta") \
.mode("overwrite") \
.option("overwriteSchema", True) \
.save("/delta/customers")
Now the Delta table has a new column loyalty_score
.
⚡ 3. OPTIMIZE + ZORDER
These improve read performance by coalescing small files and indexing key columns.
🔹 OPTIMIZE
OPTIMIZE customers
This compacts small files into fewer large ones → improves scan speed.
🔹 ZORDER (Column Indexing)
OPTIMIZE customers ZORDER BY (city)
This helps prune files during filters like:
SELECT * FROM customers WHERE city = 'Delhi'
🧹 4. VACUUM – Clean Up Old Data
Delta keeps old versions for time travel. Use VACUUM
to clean them.
-- Remove files older than 7 days (default)
VACUUM customers
-- Aggressive cleanup (use with caution!)
VACUUM customers RETAIN 0 HOURS
🛑 Warning: Time travel will not work for deleted versions!
🕰️ 5. Time Travel and Versioning
Delta Lake lets you query previous versions of your table.
🔹 View History
DESCRIBE HISTORY customers
🔹 Query Previous Version
df_v0 = spark.read.format("delta") \
.option("versionAsOf", 0) \
.load("/delta/customers")
df_v0.show()
Or use timestamp:
df_ts = spark.read.format("delta") \
.option("timestampAsOf", "2024-06-01T10:00:00") \
.load("/delta/customers")
🧪 Bonus: Check Schema Before and After
print("Current Schema:")
df.printSchema()
print("Historical Schema:")
df_v0.printSchema()
💡 Pro Tips
Use Case | Delta Feature to Use |
---|---|
Insert + Update daily records | MERGE INTO |
Append with changing schema | overwriteSchema=True |
Improve query performance | OPTIMIZE + ZORDER |
Query past states of data | Time Travel (versionAsOf ) |
Clean up unneeded files | VACUUM |
📌 Summary
✅ You learned to:
- Do
MERGE INTO
for upserts - Use
overwriteSchema
for schema evolution - Optimize Delta tables for better performance
- Use version history and query older data
- Clean unused files with
VACUUM
⏭️ Coming Next:
Post 5: Medallion Architecture with Delta (Bronze → Silver → Gold)
We’ll implement a layered data pipeline using:
- Raw ingestion (Bronze)
- Cleaning + joining (Silver)
- Aggregation for reporting (Gold)
Would you like this post exported as:
- A Markdown blog post?
- A Databricks notebook (.dbc or .ipynb)?
- Or both for sharing with your audience?