Hereβs a detailed response covering your interview use cases, plus a clean, updated π Databricks Delta Cheat Sheet with the latest Spark/Delta SQL commands.
π― Interview Use Cases (Scenario-Based Answers)
π§© βHow would you design a pipeline to process 1 TB of data daily using Databricks?β
β Design Overview (Medallion + Performance + Autoscaling):
- Ingestion:
- Use Auto Loader (with cloudFiles) to incrementally load data from Azure Blob/ADLS
- Enable schema evolution and checkpointing
- Bronze Layer:
- Store raw JSON/CSV as Delta with minimal transformation
- Partition by date or source ID
- Silver Layer:
- Perform ETL: parsing, filtering, deduplication, normalization
- Use Delta format with merge/upsert for Slowly Changing Dimensions
- Gold Layer:
- Aggregate for BI (e.g., daily revenue)
- Apply ZORDER on key columns (e.g.,
user_id
,event_date
) - Enable data skipping
- Performance:
- Enable Photon runtime
- Use Auto-scaling Job Clusters
- Run jobs in parallel tasks using Workflows
- Monitoring & Orchestration:
- Use Databricks Workflows
- Include email alerts, failure retries
- Track lineage via Unity Catalog
π§© βYour delta table has grown too large; how would you optimize it?β
β Optimizations:
- ZORDER by frequently queried columns:
OPTIMIZE big_table ZORDER BY (customer_id, event_date)
- VACUUM old data:
VACUUM big_table RETAIN 168 HOURS
- Partition pruning:
- Repartition by logical column (
region
,event_date
)
- Repartition by logical column (
- Auto Compaction + Optimize Writes (Delta Properties):
spark.conf.set(\"spark.databricks.delta.optimizeWrite.enabled\", \"true\") spark.conf.set(\"spark.databricks.delta.autoCompact.enabled\", \"true\")
π§© βDescribe how to handle schema drift during streaming ingestion.β
β Solution Using Auto Loader with Evolving Schema:
- Enable schema evolution:
df = spark.readStream.format(\"cloudFiles\") \\\n .option(\"cloudFiles.format\", \"json\") \\\n .option(\"cloudFiles.schemaLocation\", \"/mnt/schema\") \\\n .option(\"mergeSchema\", \"true\") \\\n .load(\"/mnt/data\")\n ```
- Use
mergeSchema
during write:df.writeStream \\\n .format(\"delta\") \\\n .option(\"mergeSchema\", \"true\") \\\n .outputMode(\"append\") \\\n .option(\"checkpointLocation\", \"/mnt/checkpoint\") \\\n .table(\"bronze_table\")\n ```
- Persist evolving schema in
/mnt/schema/
so it auto-updates.
π Cheat Sheet: Delta Lake Key Commands
# β
Read from Blob/ADLS
spark.read.format("csv") \
.option("header", True) \
.load("dbfs:/mnt/myblob/data.csv")
# β
Write as Delta
df.write.format("delta") \
.mode("overwrite") \
.save("/mnt/delta/bronze")
# β
Read with Time Travel
spark.read.format("delta") \
.option("versionAsOf", 3) \
.load("/mnt/delta/bronze")
# β
Read Stream with Auto Loader
df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.option("cloudFiles.schemaLocation", "/mnt/schema") \
.load("/mnt/data")
# β
Merge (UPSERT)
MERGE INTO target_table USING updates
ON target_table.id = updates.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
# β
Optimize with ZORDER
OPTIMIZE gold_table ZORDER BY (user_id)
# β
Clean Old Files
VACUUM gold_table RETAIN 168 HOURS