Hereβs your complete tutorial on Apache Spark DataFrame in Azure Databricks, covering everything from basics to advanced operations exclusive to Azure Databricks (not available in standard on-prem PySpark setups).
π Azure Databricks DataFrame Tutorial (2025)
π Part 1: What is a DataFrame in Spark?
A DataFrame is a distributed, table-like, high-level API for structured and semi-structured data.
π§ Internals:
- Built on top of RDDs.
- Uses Catalyst Optimizer (logical + physical plan optimization).
- Uses Tungsten Engine (bytecode generation for CPU/memory efficiency).
β Benefits:
- SQL-like operations
- Faster than RDDs
- Schema-aware
- Optimized under the hood
βοΈ Part 2: Creating DataFrames in Azure Databricks
π From CSV/Parquet/JSON:
df = spark.read.option("header", "true").csv("/databricks-datasets/retail-org/sales.csv")
π§± From Python dict or list:
data = [("Alice", 29), ("Bob", 31)]
schema = ["name", "age"]
df = spark.createDataFrame(data, schema)
From Azure Blob or ADLS Gen2:
df = spark.read.format("csv").option("header", "true")\
.load("abfss://container@storageaccount.dfs.core.windows.net/data/sales.csv")
π Part 3: Common Operations
π Basic Actions
df.show()
df.printSchema()
df.columns
df.describe().show()
π Filtering & Transformation
df.filter(df.amount > 100).select("region", "amount").show()
df.withColumn("amount_tax", df.amount * 1.18).show()
π Grouping & Aggregation
df.groupBy("region").agg({"amount": "sum", "id": "count"}).show()
π Part 4: SQL on DataFrames
Temp View
df.createOrReplaceTempView("sales")
spark.sql("SELECT region, SUM(amount) FROM sales GROUP BY region").show()
π Part 5: Joins
df1.join(df2, df1.id == df2.id, "inner").show()
Join types: inner
, left
, right
, outer
, semi
, anti
π Part 6: Performance Optimization in Azure Databricks
β Catalyst Optimizer
- Pushes down filters
- Eliminates unnecessary scans
- Rewrites queries for efficiency
β Tungsten Engine
- Memory management
- Cache-aware computation
- Whole-stage code generation
π Part 7: Azure Databricks-Specific Capabilities (Not Available in On-Prem PySpark)
Feature | Description |
---|---|
β Unity Catalog | Fine-grained access control across workspaces, tables, and views |
β Auto Loader | Incrementally loads new files from cloud storage without manual tracking |
β Delta Live Tables (DLT) | Declarative ETL pipelines with error handling, auto-dependency management |
β Notebook Workflows | Orchestrate multi-notebook jobs with parameters, conditions, retries |
β Cluster-Scoped Init Scripts | Inject libraries, secrets into clusters automatically |
β Git-backed Repos in Workspace | Source control with GitHub, Azure DevOps built into UI |
β Optimized Runtime for Spark | Preconfigured Spark runtime with MLlib, Delta, TensorFlow, RAPIDS, etc. |
β Photon Engine (SQL Acceleration) | Accelerated vectorized engine for large SQL workloads |
π¦ Part 8: Working with Azure Services
π Azure Blob Storage / ADLS Gen2
spark.conf.set("fs.azure.account.key.<storage>.blob.core.windows.net", "<access-key>")
df = spark.read.text("wasbs://<container>@<storage>.blob.core.windows.net/file.txt")
π Azure Key Vault Integration
dbutils.secrets.get(scope="kv-scope", key="db-password")
π§ͺ Part 9: Use Cases on Azure Databricks DataFrame
β 1. Data Cleansing + Validation
cleaned_df = df.dropna().filter("amount > 0")
β 2. Streaming Data Ingestion + Analysis
streaming_df = spark.readStream.format("csv").schema(schema).load("/mnt/streaming/")
streaming_df.groupBy("region").count().writeStream.outputMode("complete").format("console").start()
β 3. Machine Learning Prep
from pyspark.ml.feature import StringIndexer, VectorAssembler
indexer = StringIndexer(inputCol="category", outputCol="category_index")
df = indexer.fit(df).transform(df)
π Part 10: Summary Table
Concept | Example |
---|---|
Read CSV | spark.read.csv() |
Filter Rows | df.filter(df.amount > 100) |
Group & Aggregate | df.groupBy("region").sum("amount") |
SQL View | createOrReplaceTempView("sales") |
Join | df1.join(df2, "id") |
Auto Loader | cloudFiles("path") β only in Databricks |
Delta Live Tables | Declarative ETL β only in Databricks |
Git Integration | Built-in Git Repos |