Here’s a detailed response covering your interview use cases, plus a clean, updated 📘 Databricks Delta Cheat Sheet with the latest Spark/Delta SQL commands.


🎯 Interview Use Cases (Scenario-Based Answers)


🧩 “How would you design a pipeline to process 1 TB of data daily using Databricks?”

Design Overview (Medallion + Performance + Autoscaling):

  1. Ingestion:
    • Use Auto Loader (with cloudFiles) to incrementally load data from Azure Blob/ADLS
    • Enable schema evolution and checkpointing
  2. Bronze Layer:
    • Store raw JSON/CSV as Delta with minimal transformation
    • Partition by date or source ID
  3. Silver Layer:
    • Perform ETL: parsing, filtering, deduplication, normalization
    • Use Delta format with merge/upsert for Slowly Changing Dimensions
  4. Gold Layer:
    • Aggregate for BI (e.g., daily revenue)
    • Apply ZORDER on key columns (e.g., user_id, event_date)
    • Enable data skipping
  5. Performance:
    • Enable Photon runtime
    • Use Auto-scaling Job Clusters
    • Run jobs in parallel tasks using Workflows
  6. Monitoring & Orchestration:
    • Use Databricks Workflows
    • Include email alerts, failure retries
    • Track lineage via Unity Catalog

🧩 “Your delta table has grown too large; how would you optimize it?”

Optimizations:

  1. ZORDER by frequently queried columns: OPTIMIZE big_table ZORDER BY (customer_id, event_date)
  2. VACUUM old data: VACUUM big_table RETAIN 168 HOURS
  3. Partition pruning:
    • Repartition by logical column (region, event_date)
  4. Auto Compaction + Optimize Writes (Delta Properties): spark.conf.set(\"spark.databricks.delta.optimizeWrite.enabled\", \"true\") spark.conf.set(\"spark.databricks.delta.autoCompact.enabled\", \"true\")

🧩 “Describe how to handle schema drift during streaming ingestion.”

Solution Using Auto Loader with Evolving Schema:

  1. Enable schema evolution: df = spark.readStream.format(\"cloudFiles\") \\\n .option(\"cloudFiles.format\", \"json\") \\\n .option(\"cloudFiles.schemaLocation\", \"/mnt/schema\") \\\n .option(\"mergeSchema\", \"true\") \\\n .load(\"/mnt/data\")\n ```
  2. Use mergeSchema during write: df.writeStream \\\n .format(\"delta\") \\\n .option(\"mergeSchema\", \"true\") \\\n .outputMode(\"append\") \\\n .option(\"checkpointLocation\", \"/mnt/checkpoint\") \\\n .table(\"bronze_table\")\n ```
  3. Persist evolving schema in /mnt/schema/ so it auto-updates.

📘 Cheat Sheet: Delta Lake Key Commands

# ✅ Read from Blob/ADLS
spark.read.format("csv") \
  .option("header", True) \
  .load("dbfs:/mnt/myblob/data.csv")

# ✅ Write as Delta
df.write.format("delta") \
  .mode("overwrite") \
  .save("/mnt/delta/bronze")

# ✅ Read with Time Travel
spark.read.format("delta") \
  .option("versionAsOf", 3) \
  .load("/mnt/delta/bronze")

# ✅ Read Stream with Auto Loader
df = spark.readStream.format("cloudFiles") \
    .option("cloudFiles.format", "json") \
    .option("cloudFiles.schemaLocation", "/mnt/schema") \
    .load("/mnt/data")

# ✅ Merge (UPSERT)
MERGE INTO target_table USING updates
ON target_table.id = updates.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

# ✅ Optimize with ZORDER
OPTIMIZE gold_table ZORDER BY (user_id)

# ✅ Clean Old Files
VACUUM gold_table RETAIN 168 HOURS

Pages: 1 2 3 4 5 6 7 8


Discover more from HintsToday

Subscribe to get the latest posts sent to your email.

Discover more from HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading