Designing and developing scalable data pipelines using Azure Databricks and the Medallion Architecture (Bronze, Silver, Gold) is a common and robust strategy for modern data engineering. Below is a complete practical guide, including:
- Architecture design
- Technology choices (especially on Azure)
- Pipeline stages (Bronze, Silver, Gold)
- Sample Databricks notebooks (PySpark)
- Optimization, governance, and interview-ready notes
π· 1. What Is Medallion Architecture?
The Medallion Architecture breaks a data pipeline into three stages:
Layer | Purpose | Example Ops |
---|---|---|
Bronze | Raw ingestion (landing zone) | Ingest from source, add ingestion metadata |
Silver | Cleaned/transformed data | Filter, join, enrich, deduplicate |
Gold | Business-level aggregates or dimensions/facts | Aggregations, KPIs, reporting-ready |
π· 2. Azure Databricks + Medallion Architecture
Key Azure Services:
Component | Purpose |
---|---|
Azure Data Lake Gen2 (ADLS) | Storage for all layers |
Azure Databricks | Compute engine (Spark) |
Delta Lake | Storage layer with ACID support |
Azure Event Hub / ADF | Ingestion trigger (batch/stream) |
Unity Catalog | Governance, security, lineage |
ποΈ 3. Folder Structure in ADLS (Example)
abfss://datalake@storageaccount.dfs.core.windows.net/
βββ bronze/
β βββ sales_raw/
βββ silver/
β βββ sales_clean/
βββ gold/
β βββ sales_summary/
Use Delta
format throughout all layers.
π§ͺ 4. Sample Pipeline Overview
Let’s design a Sales Data Pipeline:
πΈ Bronze Layer (Raw Ingestion)
from pyspark.sql.functions import input_file_name, current_timestamp
raw_df = (spark.read
.option("header", True)
.csv("abfss://raw@storageaccount.dfs.core.windows.net/sales/*.csv"))
bronze_df = raw_df.withColumn("source_file", input_file_name()) \
.withColumn("ingestion_time", current_timestamp())
bronze_df.write.format("delta").mode("append") \
.save("abfss://datalake@storageaccount.dfs.core.windows.net/bronze/sales_raw")
πΉ Silver Layer (Cleansing + Enrichment)
bronze_df = spark.read.format("delta").load("abfss://datalake@storageaccount.dfs.core.windows.net/bronze/sales_raw")
silver_df = bronze_df.filter("amount > 0").dropDuplicates(["order_id"]) \
.withColumnRenamed("cust_id", "customer_id")
silver_df.write.format("delta").mode("overwrite") \
.save("abfss://datalake@storageaccount.dfs.core.windows.net/silver/sales_clean")
π‘ Gold Layer (Aggregated for BI)
silver_df = spark.read.format("delta").load("abfss://datalake@storageaccount.dfs.core.windows.net/silver/sales_clean")
gold_df = silver_df.groupBy("region").agg(
sum("amount").alias("total_sales"),
countDistinct("order_id").alias("orders_count")
)
gold_df.write.format("delta").mode("overwrite") \
.save("abfss://datalake@storageaccount.dfs.core.windows.net/gold/sales_summary")
βοΈ 5. Optimization & Best Practices
β Delta Lake Features
- Schema evolution
- Time travel
- Merge (Upsert)
- Z-Ordering (for file pruning)
- Auto Optimize + Auto Compaction
OPTIMIZE delta.`/gold/sales_summary` ZORDER BY (region)
β Partitioning Strategy
Partition by date fields: partitionBy("sale_date")
π 6. Governance and Security (Unity Catalog on Azure)
- Use Unity Catalog for:
- Fine-grained access control (table/column/row-level)
- Lineage tracking
- Tagging PII/Confidential fields
GRANT SELECT ON TABLE gold.sales_summary TO `data_analyst_group`
π§° 7. Orchestration on Azure
Tool | Purpose |
---|---|
Azure Data Factory | Schedule and orchestrate notebooks |
Azure Event Hub | Real-time data triggers |
Databricks Workflows | Native orchestration inside Databricks |
π¦ 8. Sample Notebooks (Structure)
You can organize notebooks as follows:
/notebooks/
βββ bronze_ingestion_sales.py
βββ silver_transform_sales.py
βββ gold_aggregate_sales.py
βββ utils/
β βββ helpers.py # Common code: paths, configs, logging
π§ 9. Interview / Design Notes
Q: Why use Medallion Architecture?
- Modularity (each layer has a purpose)
- Auditing & Debugging (bronze holds raw)
- Cost-effective: only process whatβs needed
Q: Why Delta Lake?
- ACID transactions + versioning
- Works seamlessly with Spark
- Reliable for batch + streaming
Q: When to use streaming?
- For near-real-time dashboards or IoT logs
readStream
andwriteStream
in Spark
π 10. Optional Enhancements
Feature | Tool / Method |
---|---|
Data Quality Checks | Deequ or Great Expectations |
Alerting | Azure Monitor / email from ADF |
Versioned Metadata | Delta + Unity Catalog |
Logs | Audit tables in Delta |
Retry Logic | Use try-except in notebook |
π Final Output for End Users
Layer | Consumed By |
---|---|
Bronze | Data Engineers (debugging) |
Silver | ML Models / Data Scientists |
Gold | BI Tools (Power BI, Tableau) |
Gold layer data can be directly loaded into Power BI via Azure Synapse or Databricks SQL endpoint.
Great! Here’s a full set of inline Python Databricks notebooks using the Medallion Architecture (Bronze β Silver β Gold) on Azure, and examples of Unity Catalog policies for secure, governed access.
π Databricks Notebooks β Inline Python Examples
β
1_bronze_ingestion_sales.py
# Databricks notebook: Bronze layer - ingest raw sales data
from pyspark.sql.functions import input_file_name, current_timestamp
# Define path
bronze_path = "abfss://datalake@<your_storage>.dfs.core.windows.net/bronze/sales_raw/"
# Read CSV from raw landing zone
df_raw = (spark.read
.option("header", True)
.csv("abfss://raw@<your_storage>.dfs.core.windows.net/sales/*.csv"))
# Add metadata
df_bronze = df_raw.withColumn("ingestion_time", current_timestamp()) \
.withColumn("source_file", input_file_name())
# Save to Delta
df_bronze.write.format("delta").mode("append").save(bronze_path)
print("Bronze ingestion completed.")
β
2_silver_transform_sales.py
# Databricks notebook: Silver layer - clean and enrich sales data
from pyspark.sql.functions import col
# Load from Bronze
bronze_path = "abfss://datalake@<your_storage>.dfs.core.windows.net/bronze/sales_raw/"
silver_path = "abfss://datalake@<your_storage>.dfs.core.windows.net/silver/sales_clean/"
df_bronze = spark.read.format("delta").load(bronze_path)
# Cleaning & transformation
df_silver = df_bronze.filter(col("amount") > 0) \
.dropDuplicates(["order_id"]) \
.withColumnRenamed("cust_id", "customer_id")
# Save to Silver layer
df_silver.write.format("delta").mode("overwrite").save(silver_path)
print("Silver transformation completed.")
β
3_gold_aggregate_sales.py
# Databricks notebook: Gold layer - create business aggregates
from pyspark.sql.functions import sum, countDistinct
silver_path = "abfss://datalake@<your_storage>.dfs.core.windows.net/silver/sales_clean/"
gold_path = "abfss://datalake@<your_storage>.dfs.core.windows.net/gold/sales_summary/"
df_silver = spark.read.format("delta").load(silver_path)
# Aggregation by region
df_gold = df_silver.groupBy("region").agg(
sum("amount").alias("total_sales"),
countDistinct("order_id").alias("order_count")
)
df_gold.write.format("delta").mode("overwrite").save(gold_path)
print("Gold layer aggregated output created.")
β
4_optimize_zorder.py
# Optional optimization step for Gold Delta table
gold_path = "abfss://datalake@<your_storage>.dfs.core.windows.net/gold/sales_summary/"
spark.sql(f"OPTIMIZE delta.`{gold_path}` ZORDER BY (region)")
ποΈ Unity Catalog Policy Examples
Assuming Unity Catalog is enabled, and the table is saved in a catalog β schema β table format (e.g., sales_catalog.sales_schema.sales_summary
), hereβs how to secure it.
β 1. Create Managed Table in Unity Catalog
# Save Gold table in Unity Catalog
df_gold.write.format("delta").mode("overwrite").saveAsTable("sales_catalog.sales_schema.sales_summary")
β 2. Grant Access to Data Analysts
-- Allow a group to read the summary data
GRANT SELECT ON TABLE sales_catalog.sales_schema.sales_summary TO `data_analyst_group`;
β 3. Restrict PII Columns
-- Apply a masking policy to email column (if exists)
CREATE MASKING POLICY mask_email
AS (email STRING) RETURNS STRING ->
CASE
WHEN is_account_group_member('pii_readers') THEN email
ELSE '*** MASKED ***'
END;
-- Attach the masking policy
ALTER TABLE sales_catalog.sales_schema.customer_info
ALTER COLUMN email
SET MASKING POLICY mask_email;
β 4. Tag Columns for Sensitive Data
-- Add tags to mark sensitive fields for governance
ALTER TABLE sales_catalog.sales_schema.customer_info
ALTER COLUMN email
SET TAGS ('sensitivity' = 'pii', 'classification' = 'confidential');
β 5. Lineage Tracking & Auditing (Automatic in Unity Catalog)
Databricks automatically tracks:
- Source-to-target lineage
- Who accessed what data
- When and how (SQL, notebook, job, etc.)
You can view this in Data Explorer > Table > Lineage.
β Final Notes
- Replace
<your_storage>
with your real storage account. - Register datasets as Unity Catalog tables to get full governance.
- Store notebooks in workspace folders like:
/Repos/MedallionPipeline/bronze_ingestion_sales
Leave a Reply