Azure Databricks tutorial roadmap (Beginner β†’ Advanced), tailored for Data Engineering interviews in India

Here’s a detailed response covering your interview use cases, plus a clean, updated πŸ“˜ Databricks Delta Cheat Sheet with the latest Spark/Delta SQL commands.


🎯 Interview Use Cases (Scenario-Based Answers)


🧩 β€œHow would you design a pipeline to process 1 TB of data daily using Databricks?”

βœ… Design Overview (Medallion + Performance + Autoscaling):

  1. Ingestion:
    • Use Auto Loader (with cloudFiles) to incrementally load data from Azure Blob/ADLS
    • Enable schema evolution and checkpointing
  2. Bronze Layer:
    • Store raw JSON/CSV as Delta with minimal transformation
    • Partition by date or source ID
  3. Silver Layer:
    • Perform ETL: parsing, filtering, deduplication, normalization
    • Use Delta format with merge/upsert for Slowly Changing Dimensions
  4. Gold Layer:
    • Aggregate for BI (e.g., daily revenue)
    • Apply ZORDER on key columns (e.g., user_id, event_date)
    • Enable data skipping
  5. Performance:
    • Enable Photon runtime
    • Use Auto-scaling Job Clusters
    • Run jobs in parallel tasks using Workflows
  6. Monitoring & Orchestration:
    • Use Databricks Workflows
    • Include email alerts, failure retries
    • Track lineage via Unity Catalog

🧩 β€œYour delta table has grown too large; how would you optimize it?”

βœ… Optimizations:

  1. ZORDER by frequently queried columns: OPTIMIZE big_table ZORDER BY (customer_id, event_date)
  2. VACUUM old data: VACUUM big_table RETAIN 168 HOURS
  3. Partition pruning:
    • Repartition by logical column (region, event_date)
  4. Auto Compaction + Optimize Writes (Delta Properties): spark.conf.set(\"spark.databricks.delta.optimizeWrite.enabled\", \"true\") spark.conf.set(\"spark.databricks.delta.autoCompact.enabled\", \"true\")

🧩 β€œDescribe how to handle schema drift during streaming ingestion.”

βœ… Solution Using Auto Loader with Evolving Schema:

  1. Enable schema evolution: df = spark.readStream.format(\"cloudFiles\") \\\n .option(\"cloudFiles.format\", \"json\") \\\n .option(\"cloudFiles.schemaLocation\", \"/mnt/schema\") \\\n .option(\"mergeSchema\", \"true\") \\\n .load(\"/mnt/data\")\n ```
  2. Use mergeSchema during write: df.writeStream \\\n .format(\"delta\") \\\n .option(\"mergeSchema\", \"true\") \\\n .outputMode(\"append\") \\\n .option(\"checkpointLocation\", \"/mnt/checkpoint\") \\\n .table(\"bronze_table\")\n ```
  3. Persist evolving schema in /mnt/schema/ so it auto-updates.

πŸ“˜ Cheat Sheet: Delta Lake Key Commands

# βœ… Read from Blob/ADLS
spark.read.format("csv") \
  .option("header", True) \
  .load("dbfs:/mnt/myblob/data.csv")

# βœ… Write as Delta
df.write.format("delta") \
  .mode("overwrite") \
  .save("/mnt/delta/bronze")

# βœ… Read with Time Travel
spark.read.format("delta") \
  .option("versionAsOf", 3) \
  .load("/mnt/delta/bronze")

# βœ… Read Stream with Auto Loader
df = spark.readStream.format("cloudFiles") \
    .option("cloudFiles.format", "json") \
    .option("cloudFiles.schemaLocation", "/mnt/schema") \
    .load("/mnt/data")

# βœ… Merge (UPSERT)
MERGE INTO target_table USING updates
ON target_table.id = updates.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

# βœ… Optimize with ZORDER
OPTIMIZE gold_table ZORDER BY (user_id)

# βœ… Clean Old Files
VACUUM gold_table RETAIN 168 HOURS

Pages: 1 2 3 4 5 6 7 8