# Databricks Advanced: Medallion Architecture, CI/CD, Performance
# COMMAND ----------
# ✅ Bronze Layer: Ingest raw data using Auto Loader
from pyspark.sql.functions import *
raw_path = "/mnt/raw/events/"
bronze_table = "bronze_events"
bronze_df = (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "/mnt/schemas/events/")
.load(raw_path)
.withColumn("ingestion_date", current_timestamp())
)
bronze_df.writeStream \
.format("delta") \
.option("checkpointLocation", "/mnt/checkpoints/bronze/") \
.outputMode("append") \
.table(bronze_table)
# COMMAND ----------
# ✅ Silver Layer: Clean and transform Bronze data
silver_table = "silver_events"
bronze_read = spark.readStream.table(bronze_table)
silver_df = bronze_read \
.filter("eventType IS NOT NULL") \
.withColumn("event_date", to_date("eventTime"))
silver_df.writeStream \
.format("delta") \
.option("checkpointLocation", "/mnt/checkpoints/silver/") \
.outputMode("append") \
.table(silver_table)
# COMMAND ----------
# ✅ Gold Layer: Aggregation for analytics
gold_df = spark.read.table(silver_table) \
.groupBy("event_date", "eventType") \
.agg(count("*").alias("event_count"))
(gold_df.write \
.format("delta") \
.mode("overwrite") \
.option("overwriteSchema", "true") \
.saveAsTable("gold_event_summary"))
# COMMAND ----------
# 🚀 Performance: OPTIMIZE and ZORDER
# Optimize gold table for fast queries
spark.sql("OPTIMIZE gold_event_summary ZORDER BY (event_date)")
# COMMAND ----------
# ⚙️ Git Integration with Repos (UI: Repos > Add Repo)
# Or via CLI/API to sync GitHub or Azure DevOps
# Example: Cloning into /Workspace/Repos/user/email/project-name
# COMMAND ----------
# ⚙️ Job Orchestration with Databricks Workflows
# UI Steps:
# - Create Task 1 (Bronze notebook)
# - Task 2 depends on Task 1 (Silver notebook)
# - Task 3 (Gold notebook)
# - Add schedule, retries, and email alerts
# COMMAND ----------
# 🧠 Technical Term: Trigger.AvailableNow (non-continuous streaming)
# - Runs on all available data once, then stops
# - Useful for near real-time without long-running stream
# Example:
from pyspark.sql.streaming import Trigger
(silver_df.writeStream \
.trigger(availableNow=True) \
.format("delta") \
.option("checkpointLocation", "/mnt/checkpoints/silver/") \
.outputMode("append") \
.table("silver_events_once"))
# COMMAND ----------
# ✅ Unity Catalog: Centralized data governance
# - Manage permissions per catalog/schema/table
# - Integrates with Azure Purview
# - Enforced row-level/column-level access controls
# Example (SQL):
# GRANT SELECT ON TABLE main.analytics.gold_event_summary TO `finance_team`;
# COMMAND ----------
# ✅ Access Modes
# - Single User: Used for interactive notebooks
# - Shared: Multiple users run jobs on same cluster
# - No Isolation (Legacy): Deprecated
# Use Single User mode for Unity Catalog
# ✅ Workspace Files vs Git Repos
# - Workspace Files: Local, no versioning
# - Repos: Synced with Git (GitHub, DevOps), supports pull/push/versioning
Here’s your complete advanced tutorial covering:
✅ Concepts Explained
1. Medallion Architecture (Bronze → Silver → Gold)
- Bronze: Raw data ingestion (semi-structured/unstructured)
- Silver: Cleansed, joined, enriched data
- Gold: Aggregated, analytics-ready tables
2. Auto Loader vs Batch Load
Auto Loader | Batch Load |
---|
Incremental file detection | One-time or scheduled loads |
Uses cloud file notification | Manual or cron-based |
Scalable & schema-evolving | Static schema mostly |
3. Databricks Workflows (Jobs)
- Allows multi-task jobs with dependencies
- Use notebooks, Python scripts, JARs
- Add retries, email alerts, and triggers
4. CI/CD in Databricks
- Use Repos to sync with GitHub/Azure DevOps
- Code versioning + branch deployment
- Integrate with Azure Pipelines or GitHub Actions
5. Cluster Modes
Mode | Use Case |
---|
Standard | Interactive dev/testing |
Job | Automated production workflows |
Shared | Multi-user shared compute |
✅ Technical Terms
Term | Meaning |
---|
Trigger.AvailableNow | Stream all available data once, then stop |
Checkpoints | Track state of streaming jobs for fault-tolerance |
Unity Catalog | Fine-grained, centralized access control for all tables across workspaces |
Access Modes | Single User (secure, for Unity Catalog), Shared (legacy support) |
Repos | Git-connected notebooks with push/pull/versioning support |
🎯 Interview Questions & Answers
1. Explain the Medallion Architecture in Databricks.
It’s a multi-layered Lakehouse pattern to separate raw → refined → aggregated data:
- Bronze: Raw ingestion using Auto Loader or streaming
- Silver: Data cleaning, joins, deduplication
- Gold: BI-ready aggregations for dashboards
2. What’s the difference between Auto Loader and Structured Streaming?
- Auto Loader is a file ingestion tool that uses file notification (via cloud events or listing) to incrementally detect and load new files.
- Structured Streaming is a streaming engine to continuously process data from any source (Kafka, files, sockets).
3. How do you handle Git-based versioning in Databricks?
- Use the Repos feature to connect to GitHub/Azure DevOps
- Clone repo inside workspace → edit → commit & push
- Supports branch switching, conflict resolution, and CI/CD integrations
4. How do you improve the performance of a slow PySpark job?
- Optimize join strategy: broadcast small tables
- Repartition wisely (reduce shuffles)
- Cache reused DataFrames
- Use OPTIMIZE ZORDER BY on Delta tables
- Enable Adaptive Query Execution
- Use appropriate cluster size and runtime
Pages: 1 2 3 4 5 6 7 8