Azure Databricks tutorial roadmap (Beginner → Advanced), tailored for Data Engineering interviews in India

# Databricks Advanced: Medallion Architecture, CI/CD, Performance

# COMMAND ----------
# ✅ Bronze Layer: Ingest raw data using Auto Loader

from pyspark.sql.functions import *

raw_path = "/mnt/raw/events/"
bronze_table = "bronze_events"

bronze_df = (
  spark.readStream
       .format("cloudFiles")
       .option("cloudFiles.format", "json")
       .option("cloudFiles.schemaLocation", "/mnt/schemas/events/")
       .load(raw_path)
       .withColumn("ingestion_date", current_timestamp())
)

bronze_df.writeStream \
    .format("delta") \
    .option("checkpointLocation", "/mnt/checkpoints/bronze/") \
    .outputMode("append") \
    .table(bronze_table)

# COMMAND ----------
# ✅ Silver Layer: Clean and transform Bronze data

silver_table = "silver_events"
bronze_read = spark.readStream.table(bronze_table)

silver_df = bronze_read \
    .filter("eventType IS NOT NULL") \
    .withColumn("event_date", to_date("eventTime"))

silver_df.writeStream \
    .format("delta") \
    .option("checkpointLocation", "/mnt/checkpoints/silver/") \
    .outputMode("append") \
    .table(silver_table)

# COMMAND ----------
# ✅ Gold Layer: Aggregation for analytics

gold_df = spark.read.table(silver_table) \
    .groupBy("event_date", "eventType") \
    .agg(count("*").alias("event_count"))

(gold_df.write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .saveAsTable("gold_event_summary"))

# COMMAND ----------
# 🚀 Performance: OPTIMIZE and ZORDER

# Optimize gold table for fast queries
spark.sql("OPTIMIZE gold_event_summary ZORDER BY (event_date)")

# COMMAND ----------
# ⚙️ Git Integration with Repos (UI: Repos > Add Repo)
# Or via CLI/API to sync GitHub or Azure DevOps
# Example: Cloning into /Workspace/Repos/user/email/project-name

# COMMAND ----------
# ⚙️ Job Orchestration with Databricks Workflows
# UI Steps:
# - Create Task 1 (Bronze notebook)
# - Task 2 depends on Task 1 (Silver notebook)
# - Task 3 (Gold notebook)
# - Add schedule, retries, and email alerts

# COMMAND ----------
# 🧠 Technical Term: Trigger.AvailableNow (non-continuous streaming)
# - Runs on all available data once, then stops
# - Useful for near real-time without long-running stream

# Example:
from pyspark.sql.streaming import Trigger

(silver_df.writeStream \
    .trigger(availableNow=True) \
    .format("delta") \
    .option("checkpointLocation", "/mnt/checkpoints/silver/") \
    .outputMode("append") \
    .table("silver_events_once"))

# COMMAND ----------
# ✅ Unity Catalog: Centralized data governance
# - Manage permissions per catalog/schema/table
# - Integrates with Azure Purview
# - Enforced row-level/column-level access controls

# Example (SQL):
# GRANT SELECT ON TABLE main.analytics.gold_event_summary TO `finance_team`;

# COMMAND ----------
# ✅ Access Modes
# - Single User: Used for interactive notebooks
# - Shared: Multiple users run jobs on same cluster
# - No Isolation (Legacy): Deprecated

# Use Single User mode for Unity Catalog

# ✅ Workspace Files vs Git Repos
# - Workspace Files: Local, no versioning
# - Repos: Synced with Git (GitHub, DevOps), supports pull/push/versioning

Here’s your complete advanced tutorial covering:


✅ Concepts Explained

1. Medallion Architecture (Bronze → Silver → Gold)

  • Bronze: Raw data ingestion (semi-structured/unstructured)
  • Silver: Cleansed, joined, enriched data
  • Gold: Aggregated, analytics-ready tables

2. Auto Loader vs Batch Load

Auto LoaderBatch Load
Incremental file detectionOne-time or scheduled loads
Uses cloud file notificationManual or cron-based
Scalable & schema-evolvingStatic schema mostly

3. Databricks Workflows (Jobs)

  • Allows multi-task jobs with dependencies
  • Use notebooks, Python scripts, JARs
  • Add retries, email alerts, and triggers

4. CI/CD in Databricks

  • Use Repos to sync with GitHub/Azure DevOps
  • Code versioning + branch deployment
  • Integrate with Azure Pipelines or GitHub Actions

5. Cluster Modes

ModeUse Case
StandardInteractive dev/testing
JobAutomated production workflows
SharedMulti-user shared compute

✅ Technical Terms

TermMeaning
Trigger.AvailableNowStream all available data once, then stop
CheckpointsTrack state of streaming jobs for fault-tolerance
Unity CatalogFine-grained, centralized access control for all tables across workspaces
Access ModesSingle User (secure, for Unity Catalog), Shared (legacy support)
ReposGit-connected notebooks with push/pull/versioning support

🎯 Interview Questions & Answers

1. Explain the Medallion Architecture in Databricks.

It’s a multi-layered Lakehouse pattern to separate raw → refined → aggregated data:

  • Bronze: Raw ingestion using Auto Loader or streaming
  • Silver: Data cleaning, joins, deduplication
  • Gold: BI-ready aggregations for dashboards

2. What’s the difference between Auto Loader and Structured Streaming?

  • Auto Loader is a file ingestion tool that uses file notification (via cloud events or listing) to incrementally detect and load new files.
  • Structured Streaming is a streaming engine to continuously process data from any source (Kafka, files, sockets).

3. How do you handle Git-based versioning in Databricks?

  • Use the Repos feature to connect to GitHub/Azure DevOps
  • Clone repo inside workspace → edit → commit & push
  • Supports branch switching, conflict resolution, and CI/CD integrations

4. How do you improve the performance of a slow PySpark job?

  • Optimize join strategy: broadcast small tables
  • Repartition wisely (reduce shuffles)
  • Cache reused DataFrames
  • Use OPTIMIZE ZORDER BY on Delta tables
  • Enable Adaptive Query Execution
  • Use appropriate cluster size and runtime

Pages: 1 2 3 4 5 6 7 8