# Databricks Notebook: Azure Databricks DataFrame Tutorial
# COMMAND ----------
# ๐ Section 1: Creating DataFrames
# From CSV
sales_df = spark.read.option("header", True).csv("/databricks-datasets/retail-org/sales.csv")
sales_df.show(5)
# From Python data
from pyspark.sql import Row
data = [Row(name="Alice", age=25), Row(name="Bob", age=30)]
df = spark.createDataFrame(data)
df.show()
# COMMAND ----------
# ๐ Section 2: DataFrame Operations
# Filter and transform
df_filtered = sales_df.filter(sales_df.amount > 100).withColumnRenamed("amount", "sales_amount")
df_filtered.select("region", "sales_amount").show()
# Aggregation
df_agg = sales_df.groupBy("region").sum("amount")
df_agg.show()
# COMMAND ----------
# ๐ Section 3: SQL on DataFrame
sales_df.createOrReplaceTempView("sales")
spark.sql("SELECT region, SUM(amount) as total FROM sales GROUP BY region").show()
# COMMAND ----------
# ๐ Section 4: Join Example
# Dummy 2nd DF to demonstrate join
from pyspark.sql import Row
region_df = spark.createDataFrame([Row(region="West"), Row(region="East")])
sales_df.join(region_df, "region", "inner").show()
# COMMAND ----------
# ๐ Section 5: Auto Loader Example (Streaming)
auto_df = (spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("header", True)
.schema(sales_df.schema)
.load("/mnt/bronze/autoloader-input"))
# Aggregation over stream
auto_df.groupBy("region").count().writeStream.outputMode("complete").format("console").start()
# COMMAND ----------
# ๐ Section 6: Unity Catalog Table Access (Assuming UC Enabled)
# List catalogs
spark.sql("SHOW CATALOGS").show()
# List schemas from catalog
spark.sql("USE CATALOG main")
spark.sql("SHOW SCHEMAS").show()
# Access a table
spark.sql("SELECT * FROM main.default.sample_table LIMIT 10").show()
# COMMAND ----------
# ๐ Section 7: Notebook Workflow Demo (for Jobs)
# Define parameters using widgets
# In Job UI, these values can be overridden
dbutils.widgets.text("input_path", "/mnt/sample-data", "Input Path")
input_path = dbutils.widgets.get("input_path")
print(f"Reading from: {input_path}")
# Simulate downstream transformation
input_df = spark.read.option("header", True).csv(input_path)
input_df.groupBy("region").count().show()
# COMMAND ----------
# ๐ Section 8: Git Repo Clone Instructions
# Go to "Repos" in left sidebar
# Click "Add Repo" > paste your GitHub repo link
# Example: https://github.com/your-org/azure-databricks-tutorial
# Authenticate using personal access token (PAT)
# Repo appears in Workspace > Repos
# Notebooks, init scripts, configs can be version-controlled directly
# COMMAND ----------
# โ
End of Tutorial
Here’s a crisp interview-style Q&A sheet covering both conceptual and scenario-based questions from Spark & Databricks, tailored for Data Engineering roles in Azure-based environments:
๐ฏ Conceptual Interview Questions with Answers
1. What are the differences between RDD and DataFrame in Spark?
Feature | RDD | DataFrame |
---|---|---|
Abstraction | Low-level (row-level) | High-level (table-like) |
Type Safety | Type-safe | Not type-safe |
Optimization | No built-in optimizer | Catalyst Optimizer used |
Ease of Use | Verbose API | SQL-like, concise |
Performance | Slower | Faster (Catalyst + Tungsten engine) |
2. Explain Catalyst and Tungsten optimizers.
- Catalyst Optimizer: Optimizes logical and physical query plans using rules like predicate pushdown, constant folding, etc.
- Tungsten Engine: Low-level optimization for CPU/memory using whole-stage code generation and off-heap memory.
3. What are the benefits of Delta Lake over traditional data lakes?
- ACID transactions
- Time travel (versioning)
- Schema enforcement & evolution
- Efficient file compaction (OPTIMIZE)
- Scalable merge/upsert support (MERGE INTO)
4. What is Lakehouse Architecture and why is it needed?
A Lakehouse combines the reliability of data warehouses (schema, transactions) with the scalability of data lakes.
โ Needed because:
- Traditional lakes lack ACID + governance.
- Warehouses are costly & rigid.
- Lakehouse (via Delta Lake) brings unified data engineering, BI, and ML.
5. How do you handle schema evolution in Delta Lake?
- Use
mergeSchema=True
during write:df.write.option("mergeSchema", "true").format("delta").save("/path")
- Use
ALTER TABLE
to manually add columns. - Delta automatically tracks schema versions for time travel.
6. What is the difference between Interactive and Job Clusters in Databricks?
Cluster Type | Use Case | Behavior |
---|---|---|
Interactive | Development & notebooks | Auto-terminates after inactivity |
Job Cluster | Production/automated jobs | Created & destroyed per job run |
7. What is the role of createOrReplaceTempView()
in Spark SQL?
- Registers a DataFrame as a temporary SQL view in the session.
- Enables running SQL queries like:
df.createOrReplaceTempView("orders") spark.sql("SELECT * FROM orders WHERE amount > 100")
๐ Scenario-Based Questions with Answers
1. ๐งช You need to implement slowly changing dimensions (SCD) in a table. Which Delta Lake feature would you use?
- Use
MERGE INTO
for SCD Type 1/2.MERGE INTO dim_customer USING staging_customer ON dim_customer.id = staging_customer.id WHEN MATCHED THEN UPDATE SET ... WHEN NOT MATCHED THEN INSERT ...
2. ๐ Your Spark job runs slowly. How do you optimize joins and partitioning?
โ Optimization checklist:
- Use Broadcast joins for small tables.
- Repartition based on join key:
df.repartition("join_key")
- Cache reused DataFrames.
- Use
EXPLAIN
orspark.sql("SET -v")
to check plan/skew.
3. ๐งผ You receive semi-structured JSON data daily. How will you process and clean it using the Bronze โ Silver โ Gold model?
- Bronze: Raw ingestion using Autoloader
- Silver: Parsed + cleaned JSON โ normalized schema
- Gold: Aggregated/reporting-ready format
# Bronze raw_df = spark.read.json("/mnt/raw/json/") # Silver clean_df = raw_df.select("id", "user.name", "events[0].timestamp") # Gold agg_df = clean_df.groupBy("id").agg(...)
4. ๐ How would you integrate Azure Key Vault to manage credentials in Azure Databricks?
- Set up Databricks Secret Scope linked to Key Vault:
databricks secrets create-scope --scope kv-scope --scope-backend-type AZURE_KEYVAULT ...
- Access secrets in notebooks:
password = dbutils.secrets.get(scope="kv-scope", key="db-password")
5. ๐ You want to create dashboards directly on Delta tables. How do you connect Power BI or Synapse to your Lakehouse?
โ Power BI:
- Use Azure Databricks connector (DirectQuery/Import)
- Authenticate via Azure AD or PAT
โ Synapse:
- Use Spark or Serverless SQL pool
- Mount Lakehouse storage (ADLS Gen2) or query Delta Lake via Spark