Designing and developing scalable data pipelines using Azure Databricks and the Medallion Architecture (Bronze, Silver, Gold) is a common and robust strategy for modern data engineering. Below is a complete practical guide, including:

  • Architecture design
  • Technology choices (especially on Azure)
  • Pipeline stages (Bronze, Silver, Gold)
  • Sample Databricks notebooks (PySpark)
  • Optimization, governance, and interview-ready notes

πŸ”· 1. What Is Medallion Architecture?

The Medallion Architecture breaks a data pipeline into three stages:

LayerPurposeExample Ops
BronzeRaw ingestion (landing zone)Ingest from source, add ingestion metadata
SilverCleaned/transformed dataFilter, join, enrich, deduplicate
GoldBusiness-level aggregates or dimensions/factsAggregations, KPIs, reporting-ready

πŸ”· 2. Azure Databricks + Medallion Architecture

Key Azure Services:

ComponentPurpose
Azure Data Lake Gen2 (ADLS)Storage for all layers
Azure DatabricksCompute engine (Spark)
Delta LakeStorage layer with ACID support
Azure Event Hub / ADFIngestion trigger (batch/stream)
Unity CatalogGovernance, security, lineage

πŸ—οΈ 3. Folder Structure in ADLS (Example)

abfss://datalake@storageaccount.dfs.core.windows.net/
β”œβ”€β”€ bronze/
β”‚   └── sales_raw/
β”œβ”€β”€ silver/
β”‚   └── sales_clean/
β”œβ”€β”€ gold/
β”‚   └── sales_summary/

Use Delta format throughout all layers.


πŸ§ͺ 4. Sample Pipeline Overview

Let’s design a Sales Data Pipeline:

πŸ”Έ Bronze Layer (Raw Ingestion)

from pyspark.sql.functions import input_file_name, current_timestamp

raw_df = (spark.read
          .option("header", True)
          .csv("abfss://raw@storageaccount.dfs.core.windows.net/sales/*.csv"))

bronze_df = raw_df.withColumn("source_file", input_file_name()) \
                  .withColumn("ingestion_time", current_timestamp())

bronze_df.write.format("delta").mode("append") \
    .save("abfss://datalake@storageaccount.dfs.core.windows.net/bronze/sales_raw")

πŸ”Ή Silver Layer (Cleansing + Enrichment)

bronze_df = spark.read.format("delta").load("abfss://datalake@storageaccount.dfs.core.windows.net/bronze/sales_raw")

silver_df = bronze_df.filter("amount > 0").dropDuplicates(["order_id"]) \
                     .withColumnRenamed("cust_id", "customer_id")

silver_df.write.format("delta").mode("overwrite") \
    .save("abfss://datalake@storageaccount.dfs.core.windows.net/silver/sales_clean")

🟑 Gold Layer (Aggregated for BI)

silver_df = spark.read.format("delta").load("abfss://datalake@storageaccount.dfs.core.windows.net/silver/sales_clean")

gold_df = silver_df.groupBy("region").agg(
    sum("amount").alias("total_sales"),
    countDistinct("order_id").alias("orders_count")
)

gold_df.write.format("delta").mode("overwrite") \
    .save("abfss://datalake@storageaccount.dfs.core.windows.net/gold/sales_summary")

βš™οΈ 5. Optimization & Best Practices

βœ… Delta Lake Features

  • Schema evolution
  • Time travel
  • Merge (Upsert)
  • Z-Ordering (for file pruning)
  • Auto Optimize + Auto Compaction
OPTIMIZE delta.`/gold/sales_summary` ZORDER BY (region)

βœ… Partitioning Strategy

Partition by date fields: partitionBy("sale_date")


πŸ” 6. Governance and Security (Unity Catalog on Azure)

  • Use Unity Catalog for:
    • Fine-grained access control (table/column/row-level)
    • Lineage tracking
    • Tagging PII/Confidential fields
GRANT SELECT ON TABLE gold.sales_summary TO `data_analyst_group`

🧰 7. Orchestration on Azure

ToolPurpose
Azure Data FactorySchedule and orchestrate notebooks
Azure Event HubReal-time data triggers
Databricks WorkflowsNative orchestration inside Databricks

πŸ“¦ 8. Sample Notebooks (Structure)

You can organize notebooks as follows:

/notebooks/
β”œβ”€β”€ bronze_ingestion_sales.py
β”œβ”€β”€ silver_transform_sales.py
β”œβ”€β”€ gold_aggregate_sales.py
β”œβ”€β”€ utils/
β”‚   └── helpers.py  # Common code: paths, configs, logging

🧠 9. Interview / Design Notes

Q: Why use Medallion Architecture?

  • Modularity (each layer has a purpose)
  • Auditing & Debugging (bronze holds raw)
  • Cost-effective: only process what’s needed

Q: Why Delta Lake?

  • ACID transactions + versioning
  • Works seamlessly with Spark
  • Reliable for batch + streaming

Q: When to use streaming?

  • For near-real-time dashboards or IoT logs
  • readStream and writeStream in Spark

πŸ“ 10. Optional Enhancements

FeatureTool / Method
Data Quality ChecksDeequ or Great Expectations
AlertingAzure Monitor / email from ADF
Versioned MetadataDelta + Unity Catalog
LogsAudit tables in Delta
Retry LogicUse try-except in notebook

πŸ”š Final Output for End Users

LayerConsumed By
BronzeData Engineers (debugging)
SilverML Models / Data Scientists
GoldBI Tools (Power BI, Tableau)

Gold layer data can be directly loaded into Power BI via Azure Synapse or Databricks SQL endpoint.


Great! Here’s a full set of inline Python Databricks notebooks using the Medallion Architecture (Bronze β†’ Silver β†’ Gold) on Azure, and examples of Unity Catalog policies for secure, governed access.


πŸ““ Databricks Notebooks – Inline Python Examples


βœ… 1_bronze_ingestion_sales.py

# Databricks notebook: Bronze layer - ingest raw sales data

from pyspark.sql.functions import input_file_name, current_timestamp

# Define path
bronze_path = "abfss://datalake@<your_storage>.dfs.core.windows.net/bronze/sales_raw/"

# Read CSV from raw landing zone
df_raw = (spark.read
          .option("header", True)
          .csv("abfss://raw@<your_storage>.dfs.core.windows.net/sales/*.csv"))

# Add metadata
df_bronze = df_raw.withColumn("ingestion_time", current_timestamp()) \
                  .withColumn("source_file", input_file_name())

# Save to Delta
df_bronze.write.format("delta").mode("append").save(bronze_path)

print("Bronze ingestion completed.")

βœ… 2_silver_transform_sales.py

# Databricks notebook: Silver layer - clean and enrich sales data

from pyspark.sql.functions import col

# Load from Bronze
bronze_path = "abfss://datalake@<your_storage>.dfs.core.windows.net/bronze/sales_raw/"
silver_path = "abfss://datalake@<your_storage>.dfs.core.windows.net/silver/sales_clean/"

df_bronze = spark.read.format("delta").load(bronze_path)

# Cleaning & transformation
df_silver = df_bronze.filter(col("amount") > 0) \
                     .dropDuplicates(["order_id"]) \
                     .withColumnRenamed("cust_id", "customer_id")

# Save to Silver layer
df_silver.write.format("delta").mode("overwrite").save(silver_path)

print("Silver transformation completed.")

βœ… 3_gold_aggregate_sales.py

# Databricks notebook: Gold layer - create business aggregates

from pyspark.sql.functions import sum, countDistinct

silver_path = "abfss://datalake@<your_storage>.dfs.core.windows.net/silver/sales_clean/"
gold_path = "abfss://datalake@<your_storage>.dfs.core.windows.net/gold/sales_summary/"

df_silver = spark.read.format("delta").load(silver_path)

# Aggregation by region
df_gold = df_silver.groupBy("region").agg(
    sum("amount").alias("total_sales"),
    countDistinct("order_id").alias("order_count")
)

df_gold.write.format("delta").mode("overwrite").save(gold_path)

print("Gold layer aggregated output created.")

βœ… 4_optimize_zorder.py

# Optional optimization step for Gold Delta table
gold_path = "abfss://datalake@<your_storage>.dfs.core.windows.net/gold/sales_summary/"

spark.sql(f"OPTIMIZE delta.`{gold_path}` ZORDER BY (region)")

πŸ—‚οΈ Unity Catalog Policy Examples

Assuming Unity Catalog is enabled, and the table is saved in a catalog β†’ schema β†’ table format (e.g., sales_catalog.sales_schema.sales_summary), here’s how to secure it.


βœ… 1. Create Managed Table in Unity Catalog

# Save Gold table in Unity Catalog
df_gold.write.format("delta").mode("overwrite").saveAsTable("sales_catalog.sales_schema.sales_summary")

βœ… 2. Grant Access to Data Analysts

-- Allow a group to read the summary data
GRANT SELECT ON TABLE sales_catalog.sales_schema.sales_summary TO `data_analyst_group`;

βœ… 3. Restrict PII Columns

-- Apply a masking policy to email column (if exists)
CREATE MASKING POLICY mask_email
AS (email STRING) RETURNS STRING ->
  CASE
    WHEN is_account_group_member('pii_readers') THEN email
    ELSE '*** MASKED ***'
  END;

-- Attach the masking policy
ALTER TABLE sales_catalog.sales_schema.customer_info
  ALTER COLUMN email
  SET MASKING POLICY mask_email;

βœ… 4. Tag Columns for Sensitive Data

-- Add tags to mark sensitive fields for governance
ALTER TABLE sales_catalog.sales_schema.customer_info
  ALTER COLUMN email
  SET TAGS ('sensitivity' = 'pii', 'classification' = 'confidential');

βœ… 5. Lineage Tracking & Auditing (Automatic in Unity Catalog)

Databricks automatically tracks:

  • Source-to-target lineage
  • Who accessed what data
  • When and how (SQL, notebook, job, etc.)

You can view this in Data Explorer > Table > Lineage.


βœ… Final Notes

  • Replace <your_storage> with your real storage account.
  • Register datasets as Unity Catalog tables to get full governance.
  • Store notebooks in workspace folders like: /Repos/MedallionPipeline/bronze_ingestion_sales


Discover more from HintsToday

Subscribe to get the latest posts sent to your email.

Posted in

Leave a Reply

Discover more from HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading