Here’s your complete tutorial on Apache Spark DataFrame in Azure Databricks, covering everything from basics to advanced operations exclusive to Azure Databricks (not available in standard on-prem PySpark setups).

📘 Azure Databricks DataFrame Tutorial (2025)

📌 Part 1: What is a DataFrame in Spark?

A DataFrame is a distributed, table-like, high-level API for structured and semi-structured data.

🔧 Internals:

Built on top of RDDs.
Uses Catalyst Optimizer (logical + physical plan optimization).
Uses Tungsten Engine (bytecode generation for CPU/memory efficiency).

✅ Benefits:

SQL-like operations
Faster than RDDs
Schema-aware
Optimized under the hood

⚙️ Part 2: Creating DataFrames in Azure Databricks

📄 From CSV/Parquet/JSON:

df = spark.read.option("header", "true").csv("/databricks-datasets/retail-org/sales.csv")

🧱 From Python dict or list:

data = [("Alice", 29), ("Bob", 31)]
schema = ["name", "age"]
df = spark.createDataFrame(data, schema)

From Azure Blob or ADLS Gen2:

df = spark.read.format("csv").option("header", "true")\
     .load("abfss://container@storageaccount.dfs.core.windows.net/data/sales.csv")

🔍 Part 3: Common Operations

📌 Basic Actions

df.show()
df.printSchema()
df.columns
df.describe().show()

📌 Filtering & Transformation

df.filter(df.amount > 100).select("region", "amount").show()
df.withColumn("amount_tax", df.amount * 1.18).show()

📌 Grouping & Aggregation

df.groupBy("region").agg({"amount": "sum", "id": "count"}).show()

📊 Part 4: SQL on DataFrames

Temp View

df.createOrReplaceTempView("sales")
spark.sql("SELECT region, SUM(amount) FROM sales GROUP BY region").show()

🔁 Part 5: Joins

df1.join(df2, df1.id == df2.id, "inner").show()

Join types: inner, left, right, outer, semi, anti

🚀 Part 6: Performance Optimization in Azure Databricks

✅ Catalyst Optimizer

Pushes down filters
Eliminates unnecessary scans
Rewrites queries for efficiency

✅ Tungsten Engine

Memory management
Cache-aware computation
Whole-stage code generation

💎 Part 7: Azure Databricks-Specific Capabilities (Not Available in On-Prem PySpark)

Feature	Description
✅ Unity Catalog	Fine-grained access control across workspaces, tables, and views
✅ Auto Loader	Incrementally loads new files from cloud storage without manual tracking
✅ Delta Live Tables (DLT)	Declarative ETL pipelines with error handling, auto-dependency management
✅ Notebook Workflows	Orchestrate multi-notebook jobs with parameters, conditions, retries
✅ Cluster-Scoped Init Scripts	Inject libraries, secrets into clusters automatically
✅ Git-backed Repos in Workspace	Source control with GitHub, Azure DevOps built into UI
✅ Optimized Runtime for Spark	Preconfigured Spark runtime with MLlib, Delta, TensorFlow, RAPIDS, etc.
✅ Photon Engine (SQL Acceleration)	Accelerated vectorized engine for large SQL workloads

📦 Part 8: Working with Azure Services

🌐 Azure Blob Storage / ADLS Gen2

spark.conf.set("fs.azure.account.key.<storage>.blob.core.windows.net", "<access-key>")
df = spark.read.text("wasbs://<container>@<storage>.blob.core.windows.net/file.txt")

🔐 Azure Key Vault Integration

dbutils.secrets.get(scope="kv-scope", key="db-password")

🧪 Part 9: Use Cases on Azure Databricks DataFrame

✅ 1. Data Cleansing + Validation

cleaned_df = df.dropna().filter("amount > 0")

✅ 2. Streaming Data Ingestion + Analysis

streaming_df = spark.readStream.format("csv").schema(schema).load("/mnt/streaming/")
streaming_df.groupBy("region").count().writeStream.outputMode("complete").format("console").start()

✅ 3. Machine Learning Prep

from pyspark.ml.feature import StringIndexer, VectorAssembler
indexer = StringIndexer(inputCol="category", outputCol="category_index")
df = indexer.fit(df).transform(df)

📌 Part 10: Summary Table

Concept	Example
Read CSV	`spark.read.csv()`
Filter Rows	`df.filter(df.amount > 100)`
Group & Aggregate	`df.groupBy("region").sum("amount")`
SQL View	`createOrReplaceTempView("sales")`
Join	`df1.join(df2, "id")`
Auto Loader	`cloudFiles("path")` — only in Databricks
Delta Live Tables	Declarative ETL — only in Databricks
Git Integration	Built-in Git Repos

HintsToday

recent posts

about

Azure Databricks tutorial roadmap (Beginner → Advanced), tailored for Data Engineering interviews in India