Azure Databricks tutorial roadmap (Beginner → Advanced), tailored for Data Engineering interviews in India

# Databricks Notebook: Azure Databricks DataFrame Tutorial

# COMMAND ----------
# 📘 Section 1: Creating DataFrames

# From CSV
sales_df = spark.read.option("header", True).csv("/databricks-datasets/retail-org/sales.csv")
sales_df.show(5)

# From Python data
from pyspark.sql import Row
data = [Row(name="Alice", age=25), Row(name="Bob", age=30)]
df = spark.createDataFrame(data)
df.show()

# COMMAND ----------
# 📘 Section 2: DataFrame Operations

# Filter and transform
df_filtered = sales_df.filter(sales_df.amount > 100).withColumnRenamed("amount", "sales_amount")
df_filtered.select("region", "sales_amount").show()

# Aggregation
df_agg = sales_df.groupBy("region").sum("amount")
df_agg.show()

# COMMAND ----------
# 📘 Section 3: SQL on DataFrame

sales_df.createOrReplaceTempView("sales")
spark.sql("SELECT region, SUM(amount) as total FROM sales GROUP BY region").show()

# COMMAND ----------
# 📘 Section 4: Join Example

# Dummy 2nd DF to demonstrate join
from pyspark.sql import Row
region_df = spark.createDataFrame([Row(region="West"), Row(region="East")])
sales_df.join(region_df, "region", "inner").show()

# COMMAND ----------
# 📘 Section 5: Auto Loader Example (Streaming)

auto_df = (spark.readStream.format("cloudFiles")
           .option("cloudFiles.format", "csv")
           .option("header", True)
           .schema(sales_df.schema)
           .load("/mnt/bronze/autoloader-input"))

# Aggregation over stream
auto_df.groupBy("region").count().writeStream.outputMode("complete").format("console").start()

# COMMAND ----------
# 📘 Section 6: Unity Catalog Table Access (Assuming UC Enabled)

# List catalogs
spark.sql("SHOW CATALOGS").show()

# List schemas from catalog
spark.sql("USE CATALOG main")
spark.sql("SHOW SCHEMAS").show()

# Access a table
spark.sql("SELECT * FROM main.default.sample_table LIMIT 10").show()

# COMMAND ----------
# 📘 Section 7: Notebook Workflow Demo (for Jobs)

# Define parameters using widgets
# In Job UI, these values can be overridden

dbutils.widgets.text("input_path", "/mnt/sample-data", "Input Path")
input_path = dbutils.widgets.get("input_path")
print(f"Reading from: {input_path}")

# Simulate downstream transformation
input_df = spark.read.option("header", True).csv(input_path)
input_df.groupBy("region").count().show()

# COMMAND ----------
# 📘 Section 8: Git Repo Clone Instructions

# Go to "Repos" in left sidebar
# Click "Add Repo" > paste your GitHub repo link
# Example: https://github.com/your-org/azure-databricks-tutorial

# Authenticate using personal access token (PAT)
# Repo appears in Workspace > Repos

# Notebooks, init scripts, configs can be version-controlled directly

# COMMAND ----------
# ✅ End of Tutorial

Here’s a crisp interview-style Q&A sheet covering both conceptual and scenario-based questions from Spark & Databricks, tailored for Data Engineering roles in Azure-based environments:

🎯 Conceptual Interview Questions with Answers

1. What are the differences between RDD and DataFrame in Spark?

Feature	RDD	DataFrame
Abstraction	Low-level (row-level)	High-level (table-like)
Type Safety	Type-safe	Not type-safe
Optimization	No built-in optimizer	Catalyst Optimizer used
Ease of Use	Verbose API	SQL-like, concise
Performance	Slower	Faster (Catalyst + Tungsten engine)

2. Explain Catalyst and Tungsten optimizers.

Catalyst Optimizer: Optimizes logical and physical query plans using rules like predicate pushdown, constant folding, etc.
Tungsten Engine: Low-level optimization for CPU/memory using whole-stage code generation and off-heap memory.

3. What are the benefits of Delta Lake over traditional data lakes?

ACID transactions
Time travel (versioning)
Schema enforcement & evolution
Efficient file compaction (OPTIMIZE)
Scalable merge/upsert support (MERGE INTO)

4. What is Lakehouse Architecture and why is it needed?

A Lakehouse combines the reliability of data warehouses (schema, transactions) with the scalability of data lakes.

✅ Needed because:

Traditional lakes lack ACID + governance.
Warehouses are costly & rigid.
Lakehouse (via Delta Lake) brings unified data engineering, BI, and ML.

5. How do you handle schema evolution in Delta Lake?

Use mergeSchema=True during write: df.write.option("mergeSchema", "true").format("delta").save("/path")
Use ALTER TABLE to manually add columns.
Delta automatically tracks schema versions for time travel.

6. What is the difference between Interactive and Job Clusters in Databricks?

Cluster Type	Use Case	Behavior
Interactive	Development & notebooks	Auto-terminates after inactivity
Job Cluster	Production/automated jobs	Created & destroyed per job run

7. What is the role of `createOrReplaceTempView()` in Spark SQL?

Registers a DataFrame as a temporary SQL view in the session.
Enables running SQL queries like: df.createOrReplaceTempView("orders") spark.sql("SELECT * FROM orders WHERE amount > 100")

🔍 Scenario-Based Questions with Answers

1. 🧪 You need to implement slowly changing dimensions (SCD) in a table. Which Delta Lake feature would you use?

Use MERGE INTO for SCD Type 1/2. MERGE INTO dim_customer USING staging_customer ON dim_customer.id = staging_customer.id WHEN MATCHED THEN UPDATE SET ... WHEN NOT MATCHED THEN INSERT ...

2. 🔁 Your Spark job runs slowly. How do you optimize joins and partitioning?

✅ Optimization checklist:

Use Broadcast joins for small tables.
Repartition based on join key: df.repartition("join_key")
Cache reused DataFrames.
Use EXPLAIN or spark.sql("SET -v") to check plan/skew.

3. 🧼 You receive semi-structured JSON data daily. How will you process and clean it using the Bronze → Silver → Gold model?

Bronze: Raw ingestion using Autoloader
Silver: Parsed + cleaned JSON → normalized schema
Gold: Aggregated/reporting-ready format # Bronze raw_df = spark.read.json("/mnt/raw/json/") # Silver clean_df = raw_df.select("id", "user.name", "events[0].timestamp") # Gold agg_df = clean_df.groupBy("id").agg(...)

4. 🔒 How would you integrate Azure Key Vault to manage credentials in Azure Databricks?

Set up Databricks Secret Scope linked to Key Vault: databricks secrets create-scope --scope kv-scope --scope-backend-type AZURE_KEYVAULT ...
Access secrets in notebooks: password = dbutils.secrets.get(scope="kv-scope", key="db-password")

5. 📊 You want to create dashboards directly on Delta tables. How do you connect Power BI or Synapse to your Lakehouse?

✅ Power BI:

Use Azure Databricks connector (DirectQuery/Import)
Authenticate via Azure AD or PAT

✅ Synapse:

Use Spark or Serverless SQL pool
Mount Lakehouse storage (ADLS Gen2) or query Delta Lake via Spark

HintsToday

recent posts

about

🎯 Conceptual Interview Questions with Answers

1. What are the differences between RDD and DataFrame in Spark?

2. Explain Catalyst and Tungsten optimizers.

3. What are the benefits of Delta Lake over traditional data lakes?

4. What is Lakehouse Architecture and why is it needed?

5. How do you handle schema evolution in Delta Lake?

6. What is the difference between Interactive and Job Clusters in Databricks?

7. What is the role of `createOrReplaceTempView()` in Spark SQL?

🔍 Scenario-Based Questions with Answers

1. 🧪 You need to implement slowly changing dimensions (SCD) in a table. Which Delta Lake feature would you use?

2. 🔁 Your Spark job runs slowly. How do you optimize joins and partitioning?

3. 🧼 You receive semi-structured JSON data daily. How will you process and clean it using the Bronze → Silver → Gold model?

4. 🔒 How would you integrate Azure Key Vault to manage credentials in Azure Databricks?

5. 📊 You want to create dashboards directly on Delta tables. How do you connect Power BI or Synapse to your Lakehouse?

Like this:

recent posts

about

Azure Databricks tutorial roadmap (Beginner → Advanced), tailored for Data Engineering interviews in India

🎯 Conceptual Interview Questions with Answers

1. What are the differences between RDD and DataFrame in Spark?

2. Explain Catalyst and Tungsten optimizers.

3. What are the benefits of Delta Lake over traditional data lakes?

4. What is Lakehouse Architecture and why is it needed?

5. How do you handle schema evolution in Delta Lake?

6. What is the difference between Interactive and Job Clusters in Databricks?

7. What is the role of createOrReplaceTempView() in Spark SQL?

🔍 Scenario-Based Questions with Answers

1. 🧪 You need to implement slowly changing dimensions (SCD) in a table. Which Delta Lake feature would you use?

2. 🔁 Your Spark job runs slowly. How do you optimize joins and partitioning?

3. 🧼 You receive semi-structured JSON data daily. How will you process and clean it using the Bronze → Silver → Gold model?

4. 🔒 How would you integrate Azure Key Vault to manage credentials in Azure Databricks?

5. 📊 You want to create dashboards directly on Delta tables. How do you connect Power BI or Synapse to your Lakehouse?

Like this:

7. What is the role of `createOrReplaceTempView()` in Spark SQL?