Azure Databricks tutorial roadmap (Beginner → Advanced), tailored for Data Engineering interviews in India

# Databricks Advanced: Medallion Architecture, CI/CD, Performance

# COMMAND ----------
# ✅ Bronze Layer: Ingest raw data using Auto Loader

from pyspark.sql.functions import *

raw_path = "/mnt/raw/events/"
bronze_table = "bronze_events"

bronze_df = (
  spark.readStream
       .format("cloudFiles")
       .option("cloudFiles.format", "json")
       .option("cloudFiles.schemaLocation", "/mnt/schemas/events/")
       .load(raw_path)
       .withColumn("ingestion_date", current_timestamp())
)

bronze_df.writeStream \
    .format("delta") \
    .option("checkpointLocation", "/mnt/checkpoints/bronze/") \
    .outputMode("append") \
    .table(bronze_table)

# COMMAND ----------
# ✅ Silver Layer: Clean and transform Bronze data

silver_table = "silver_events"
bronze_read = spark.readStream.table(bronze_table)

silver_df = bronze_read \
    .filter("eventType IS NOT NULL") \
    .withColumn("event_date", to_date("eventTime"))

silver_df.writeStream \
    .format("delta") \
    .option("checkpointLocation", "/mnt/checkpoints/silver/") \
    .outputMode("append") \
    .table(silver_table)

# COMMAND ----------
# ✅ Gold Layer: Aggregation for analytics

gold_df = spark.read.table(silver_table) \
    .groupBy("event_date", "eventType") \
    .agg(count("*").alias("event_count"))

(gold_df.write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .saveAsTable("gold_event_summary"))

# COMMAND ----------
# 🚀 Performance: OPTIMIZE and ZORDER

# Optimize gold table for fast queries
spark.sql("OPTIMIZE gold_event_summary ZORDER BY (event_date)")

# COMMAND ----------
# ⚙️ Git Integration with Repos (UI: Repos > Add Repo)
# Or via CLI/API to sync GitHub or Azure DevOps
# Example: Cloning into /Workspace/Repos/user/email/project-name

# COMMAND ----------
# ⚙️ Job Orchestration with Databricks Workflows
# UI Steps:
# - Create Task 1 (Bronze notebook)
# - Task 2 depends on Task 1 (Silver notebook)
# - Task 3 (Gold notebook)
# - Add schedule, retries, and email alerts

# COMMAND ----------
# 🧠 Technical Term: Trigger.AvailableNow (non-continuous streaming)
# - Runs on all available data once, then stops
# - Useful for near real-time without long-running stream

# Example:
from pyspark.sql.streaming import Trigger

(silver_df.writeStream \
    .trigger(availableNow=True) \
    .format("delta") \
    .option("checkpointLocation", "/mnt/checkpoints/silver/") \
    .outputMode("append") \
    .table("silver_events_once"))

# COMMAND ----------
# ✅ Unity Catalog: Centralized data governance
# - Manage permissions per catalog/schema/table
# - Integrates with Azure Purview
# - Enforced row-level/column-level access controls

# Example (SQL):
# GRANT SELECT ON TABLE main.analytics.gold_event_summary TO `finance_team`;

# COMMAND ----------
# ✅ Access Modes
# - Single User: Used for interactive notebooks
# - Shared: Multiple users run jobs on same cluster
# - No Isolation (Legacy): Deprecated

# Use Single User mode for Unity Catalog

# ✅ Workspace Files vs Git Repos
# - Workspace Files: Local, no versioning
# - Repos: Synced with Git (GitHub, DevOps), supports pull/push/versioning

Here’s your complete advanced tutorial covering:

✅ Concepts Explained

1. Medallion Architecture (Bronze → Silver → Gold)

Bronze: Raw data ingestion (semi-structured/unstructured)
Silver: Cleansed, joined, enriched data
Gold: Aggregated, analytics-ready tables

2. Auto Loader vs Batch Load

Auto Loader	Batch Load
Incremental file detection	One-time or scheduled loads
Uses cloud file notification	Manual or cron-based
Scalable & schema-evolving	Static schema mostly

3. Databricks Workflows (Jobs)

Allows multi-task jobs with dependencies
Use notebooks, Python scripts, JARs
Add retries, email alerts, and triggers

4. CI/CD in Databricks

Use Repos to sync with GitHub/Azure DevOps
Code versioning + branch deployment
Integrate with Azure Pipelines or GitHub Actions

5. Cluster Modes

Mode	Use Case
Standard	Interactive dev/testing
Job	Automated production workflows
Shared	Multi-user shared compute

✅ Technical Terms

Term	Meaning
`Trigger.AvailableNow`	Stream all available data once, then stop
Checkpoints	Track state of streaming jobs for fault-tolerance
Unity Catalog	Fine-grained, centralized access control for all tables across workspaces
Access Modes	Single User (secure, for Unity Catalog), Shared (legacy support)
Repos	Git-connected notebooks with push/pull/versioning support

🎯 Interview Questions & Answers

1. Explain the Medallion Architecture in Databricks.

It’s a multi-layered Lakehouse pattern to separate raw → refined → aggregated data:

Bronze: Raw ingestion using Auto Loader or streaming
Silver: Data cleaning, joins, deduplication
Gold: BI-ready aggregations for dashboards

2. What’s the difference between Auto Loader and Structured Streaming?

Auto Loader is a file ingestion tool that uses file notification (via cloud events or listing) to incrementally detect and load new files.
Structured Streaming is a streaming engine to continuously process data from any source (Kafka, files, sockets).

3. How do you handle Git-based versioning in Databricks?

Use the Repos feature to connect to GitHub/Azure DevOps
Clone repo inside workspace → edit → commit & push
Supports branch switching, conflict resolution, and CI/CD integrations

4. How do you improve the performance of a slow PySpark job?

Optimize join strategy: broadcast small tables
Repartition wisely (reduce shuffles)
Cache reused DataFrames
Use OPTIMIZE ZORDER BY on Delta tables
Enable Adaptive Query Execution
Use appropriate cluster size and runtime

HintsToday

recent posts

about

✅ Concepts Explained

1. Medallion Architecture (Bronze → Silver → Gold)

2. Auto Loader vs Batch Load

3. Databricks Workflows (Jobs)

4. CI/CD in Databricks

5. Cluster Modes

✅ Technical Terms

🎯 Interview Questions & Answers

1. Explain the Medallion Architecture in Databricks.

2. What’s the difference between Auto Loader and Structured Streaming?

3. How do you handle Git-based versioning in Databricks?

4. How do you improve the performance of a slow PySpark job?

Discover more from HintsToday