Here is Post 4: Delta Lake Deep Dive — a complete, hands-on guide perfect for your Databricks tutorial series.


💎 Post 4: Delta Lake Deep Dive in Databricks

Powerful Features to Scale Your Data Engineering Projects

Delta Lake is at the heart of the Lakehouse architecture. If you’ve already started exploring Spark and saved your data as Delta format, you’re ready to unlock its real power.

In this guide, we go deep into:

  • MERGE INTO for Upserts
  • Schema Evolution
  • OPTIMIZE, ZORDER, VACUUM
  • Time Travel and Versioning

🚀 Why Delta Lake?

Delta Lake builds ACID transactions, schema enforcement, and versioned data on top of Parquet.
This solves key problems of raw data lakes:

ProblemDelta Lake Solution
Dirty/inconsistent dataACID Transactions
No schema enforcementAuto & manual schema control
Hard to debug / rollbackTime travel via transaction log
Query performance issuesOPTIMIZE + ZORDER indexing

⚙️ Setup: Load a Sample Delta Table

We’ll reuse /delta/customers from the previous post or create a new one:

df = spark.read.option("header", True).csv("/FileStore/tables/customers.csv")
df.write.format("delta").mode("overwrite").save("/delta/customers")

Register it:

CREATE TABLE IF NOT EXISTS customers
USING DELTA
LOCATION '/delta/customers'

🔁 1. MERGE INTO – The UPSERT Power

Delta supports UPSERT natively (not available in Parquet or CSV).

🔹 Scenario:

Update existing customers or insert new ones from a staging table.

from pyspark.sql import Row

# Create dummy updates
updates = spark.createDataFrame([
    Row(customer_id=101, city="Mumbai", gender="Male"),
    Row(customer_id=999, city="Delhi", gender="Female")
])

updates.createOrReplaceTempView("updates")

# MERGE INTO
spark.sql("""
MERGE INTO customers AS target
USING updates AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
  UPDATE SET *
WHEN NOT MATCHED THEN
  INSERT *
""")

✅ Now check changes:

SELECT * FROM customers WHERE customer_id IN (101, 999);

🧬 2. Schema Evolution

Schema changes automatically handled in overwrite mode.

🔹 Add New Column Example:

new_df = df.withColumn("loyalty_score", lit(100))

new_df.write \
  .format("delta") \
  .mode("overwrite") \
  .option("overwriteSchema", True) \
  .save("/delta/customers")

Now the Delta table has a new column loyalty_score.


⚡ 3. OPTIMIZE + ZORDER

These improve read performance by coalescing small files and indexing key columns.

🔹 OPTIMIZE

OPTIMIZE customers

This compacts small files into fewer large ones → improves scan speed.

🔹 ZORDER (Column Indexing)

OPTIMIZE customers ZORDER BY (city)

This helps prune files during filters like:

SELECT * FROM customers WHERE city = 'Delhi'

🧹 4. VACUUM – Clean Up Old Data

Delta keeps old versions for time travel. Use VACUUM to clean them.

-- Remove files older than 7 days (default)
VACUUM customers

-- Aggressive cleanup (use with caution!)
VACUUM customers RETAIN 0 HOURS

🛑 Warning: Time travel will not work for deleted versions!


🕰️ 5. Time Travel and Versioning

Delta Lake lets you query previous versions of your table.

🔹 View History

DESCRIBE HISTORY customers

🔹 Query Previous Version

df_v0 = spark.read.format("delta") \
    .option("versionAsOf", 0) \
    .load("/delta/customers")

df_v0.show()

Or use timestamp:

df_ts = spark.read.format("delta") \
    .option("timestampAsOf", "2024-06-01T10:00:00") \
    .load("/delta/customers")

🧪 Bonus: Check Schema Before and After

print("Current Schema:")
df.printSchema()

print("Historical Schema:")
df_v0.printSchema()

💡 Pro Tips

Use CaseDelta Feature to Use
Insert + Update daily recordsMERGE INTO
Append with changing schemaoverwriteSchema=True
Improve query performanceOPTIMIZE + ZORDER
Query past states of dataTime Travel (versionAsOf)
Clean up unneeded filesVACUUM

📌 Summary

✅ You learned to:

  • Do MERGE INTO for upserts
  • Use overwriteSchema for schema evolution
  • Optimize Delta tables for better performance
  • Use version history and query older data
  • Clean unused files with VACUUM

⏭️ Coming Next:

Post 5: Medallion Architecture with Delta (Bronze → Silver → Gold)
We’ll implement a layered data pipeline using:

  • Raw ingestion (Bronze)
  • Cleaning + joining (Silver)
  • Aggregation for reporting (Gold)

Would you like this post exported as:

  • A Markdown blog post?
  • A Databricks notebook (.dbc or .ipynb)?
  • Or both for sharing with your audience?

Pages: 1 2 3 4 5 6


Discover more from HintsToday

Subscribe to get the latest posts sent to your email.

Posted in

Discover more from HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading