# Databricks Advanced: Medallion Architecture, CI/CD, Performance

# COMMAND ----------
# ✅ Bronze Layer: Ingest raw data using Auto Loader

from pyspark.sql.functions import *

raw_path = "/mnt/raw/events/"
bronze_table = "bronze_events"

bronze_df = (
  spark.readStream
       .format("cloudFiles")
       .option("cloudFiles.format", "json")
       .option("cloudFiles.schemaLocation", "/mnt/schemas/events/")
       .load(raw_path)
       .withColumn("ingestion_date", current_timestamp())
)

bronze_df.writeStream \
    .format("delta") \
    .option("checkpointLocation", "/mnt/checkpoints/bronze/") \
    .outputMode("append") \
    .table(bronze_table)

# COMMAND ----------
# ✅ Silver Layer: Clean and transform Bronze data

silver_table = "silver_events"
bronze_read = spark.readStream.table(bronze_table)

silver_df = bronze_read \
    .filter("eventType IS NOT NULL") \
    .withColumn("event_date", to_date("eventTime"))

silver_df.writeStream \
    .format("delta") \
    .option("checkpointLocation", "/mnt/checkpoints/silver/") \
    .outputMode("append") \
    .table(silver_table)

# COMMAND ----------
# ✅ Gold Layer: Aggregation for analytics

gold_df = spark.read.table(silver_table) \
    .groupBy("event_date", "eventType") \
    .agg(count("*").alias("event_count"))

(gold_df.write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .saveAsTable("gold_event_summary"))

# COMMAND ----------
# 🚀 Performance: OPTIMIZE and ZORDER

# Optimize gold table for fast queries
spark.sql("OPTIMIZE gold_event_summary ZORDER BY (event_date)")

# COMMAND ----------
# ⚙️ Git Integration with Repos (UI: Repos > Add Repo)
# Or via CLI/API to sync GitHub or Azure DevOps
# Example: Cloning into /Workspace/Repos/user/email/project-name

# COMMAND ----------
# ⚙️ Job Orchestration with Databricks Workflows
# UI Steps:
# - Create Task 1 (Bronze notebook)
# - Task 2 depends on Task 1 (Silver notebook)
# - Task 3 (Gold notebook)
# - Add schedule, retries, and email alerts

# COMMAND ----------
# 🧠 Technical Term: Trigger.AvailableNow (non-continuous streaming)
# - Runs on all available data once, then stops
# - Useful for near real-time without long-running stream

# Example:
from pyspark.sql.streaming import Trigger

(silver_df.writeStream \
    .trigger(availableNow=True) \
    .format("delta") \
    .option("checkpointLocation", "/mnt/checkpoints/silver/") \
    .outputMode("append") \
    .table("silver_events_once"))

# COMMAND ----------
# ✅ Unity Catalog: Centralized data governance
# - Manage permissions per catalog/schema/table
# - Integrates with Azure Purview
# - Enforced row-level/column-level access controls

# Example (SQL):
# GRANT SELECT ON TABLE main.analytics.gold_event_summary TO `finance_team`;

# COMMAND ----------
# ✅ Access Modes
# - Single User: Used for interactive notebooks
# - Shared: Multiple users run jobs on same cluster
# - No Isolation (Legacy): Deprecated

# Use Single User mode for Unity Catalog

# ✅ Workspace Files vs Git Repos
# - Workspace Files: Local, no versioning
# - Repos: Synced with Git (GitHub, DevOps), supports pull/push/versioning

Here’s your complete advanced tutorial covering:


✅ Concepts Explained

1. Medallion Architecture (Bronze → Silver → Gold)

  • Bronze: Raw data ingestion (semi-structured/unstructured)
  • Silver: Cleansed, joined, enriched data
  • Gold: Aggregated, analytics-ready tables

2. Auto Loader vs Batch Load

Auto LoaderBatch Load
Incremental file detectionOne-time or scheduled loads
Uses cloud file notificationManual or cron-based
Scalable & schema-evolvingStatic schema mostly

3. Databricks Workflows (Jobs)

  • Allows multi-task jobs with dependencies
  • Use notebooks, Python scripts, JARs
  • Add retries, email alerts, and triggers

4. CI/CD in Databricks

  • Use Repos to sync with GitHub/Azure DevOps
  • Code versioning + branch deployment
  • Integrate with Azure Pipelines or GitHub Actions

5. Cluster Modes

ModeUse Case
StandardInteractive dev/testing
JobAutomated production workflows
SharedMulti-user shared compute

✅ Technical Terms

TermMeaning
Trigger.AvailableNowStream all available data once, then stop
CheckpointsTrack state of streaming jobs for fault-tolerance
Unity CatalogFine-grained, centralized access control for all tables across workspaces
Access ModesSingle User (secure, for Unity Catalog), Shared (legacy support)
ReposGit-connected notebooks with push/pull/versioning support

🎯 Interview Questions & Answers

1. Explain the Medallion Architecture in Databricks.

It’s a multi-layered Lakehouse pattern to separate raw → refined → aggregated data:

  • Bronze: Raw ingestion using Auto Loader or streaming
  • Silver: Data cleaning, joins, deduplication
  • Gold: BI-ready aggregations for dashboards

2. What’s the difference between Auto Loader and Structured Streaming?

  • Auto Loader is a file ingestion tool that uses file notification (via cloud events or listing) to incrementally detect and load new files.
  • Structured Streaming is a streaming engine to continuously process data from any source (Kafka, files, sockets).

3. How do you handle Git-based versioning in Databricks?

  • Use the Repos feature to connect to GitHub/Azure DevOps
  • Clone repo inside workspace → edit → commit & push
  • Supports branch switching, conflict resolution, and CI/CD integrations

4. How do you improve the performance of a slow PySpark job?

  • Optimize join strategy: broadcast small tables
  • Repartition wisely (reduce shuffles)
  • Cache reused DataFrames
  • Use OPTIMIZE ZORDER BY on Delta tables
  • Enable Adaptive Query Execution
  • Use appropriate cluster size and runtime

Pages: 1 2 3 4 5 6 7 8


Discover more from HintsToday

Subscribe to get the latest posts sent to your email.

Discover more from HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading