Azure Databricks tutorial roadmap (Beginner โ†’ Advanced), tailored for Data Engineering interviews in India

# Databricks Notebook: Azure Databricks DataFrame Tutorial

# COMMAND ----------
# ๐Ÿ“˜ Section 1: Creating DataFrames

# From CSV
sales_df = spark.read.option("header", True).csv("/databricks-datasets/retail-org/sales.csv")
sales_df.show(5)

# From Python data
from pyspark.sql import Row
data = [Row(name="Alice", age=25), Row(name="Bob", age=30)]
df = spark.createDataFrame(data)
df.show()

# COMMAND ----------
# ๐Ÿ“˜ Section 2: DataFrame Operations

# Filter and transform
df_filtered = sales_df.filter(sales_df.amount > 100).withColumnRenamed("amount", "sales_amount")
df_filtered.select("region", "sales_amount").show()

# Aggregation
df_agg = sales_df.groupBy("region").sum("amount")
df_agg.show()

# COMMAND ----------
# ๐Ÿ“˜ Section 3: SQL on DataFrame

sales_df.createOrReplaceTempView("sales")
spark.sql("SELECT region, SUM(amount) as total FROM sales GROUP BY region").show()

# COMMAND ----------
# ๐Ÿ“˜ Section 4: Join Example

# Dummy 2nd DF to demonstrate join
from pyspark.sql import Row
region_df = spark.createDataFrame([Row(region="West"), Row(region="East")])
sales_df.join(region_df, "region", "inner").show()

# COMMAND ----------
# ๐Ÿ“˜ Section 5: Auto Loader Example (Streaming)

auto_df = (spark.readStream.format("cloudFiles")
           .option("cloudFiles.format", "csv")
           .option("header", True)
           .schema(sales_df.schema)
           .load("/mnt/bronze/autoloader-input"))

# Aggregation over stream
auto_df.groupBy("region").count().writeStream.outputMode("complete").format("console").start()

# COMMAND ----------
# ๐Ÿ“˜ Section 6: Unity Catalog Table Access (Assuming UC Enabled)

# List catalogs
spark.sql("SHOW CATALOGS").show()

# List schemas from catalog
spark.sql("USE CATALOG main")
spark.sql("SHOW SCHEMAS").show()

# Access a table
spark.sql("SELECT * FROM main.default.sample_table LIMIT 10").show()

# COMMAND ----------
# ๐Ÿ“˜ Section 7: Notebook Workflow Demo (for Jobs)

# Define parameters using widgets
# In Job UI, these values can be overridden

dbutils.widgets.text("input_path", "/mnt/sample-data", "Input Path")
input_path = dbutils.widgets.get("input_path")
print(f"Reading from: {input_path}")

# Simulate downstream transformation
input_df = spark.read.option("header", True).csv(input_path)
input_df.groupBy("region").count().show()

# COMMAND ----------
# ๐Ÿ“˜ Section 8: Git Repo Clone Instructions

# Go to "Repos" in left sidebar
# Click "Add Repo" > paste your GitHub repo link
# Example: https://github.com/your-org/azure-databricks-tutorial

# Authenticate using personal access token (PAT)
# Repo appears in Workspace > Repos

# Notebooks, init scripts, configs can be version-controlled directly

# COMMAND ----------
# โœ… End of Tutorial

Here’s a crisp interview-style Q&A sheet covering both conceptual and scenario-based questions from Spark & Databricks, tailored for Data Engineering roles in Azure-based environments:


๐ŸŽฏ Conceptual Interview Questions with Answers


1. What are the differences between RDD and DataFrame in Spark?

FeatureRDDDataFrame
AbstractionLow-level (row-level)High-level (table-like)
Type SafetyType-safeNot type-safe
OptimizationNo built-in optimizerCatalyst Optimizer used
Ease of UseVerbose APISQL-like, concise
PerformanceSlowerFaster (Catalyst + Tungsten engine)

2. Explain Catalyst and Tungsten optimizers.

  • Catalyst Optimizer: Optimizes logical and physical query plans using rules like predicate pushdown, constant folding, etc.
  • Tungsten Engine: Low-level optimization for CPU/memory using whole-stage code generation and off-heap memory.

3. What are the benefits of Delta Lake over traditional data lakes?

  • ACID transactions
  • Time travel (versioning)
  • Schema enforcement & evolution
  • Efficient file compaction (OPTIMIZE)
  • Scalable merge/upsert support (MERGE INTO)

4. What is Lakehouse Architecture and why is it needed?

A Lakehouse combines the reliability of data warehouses (schema, transactions) with the scalability of data lakes.

โœ… Needed because:

  • Traditional lakes lack ACID + governance.
  • Warehouses are costly & rigid.
  • Lakehouse (via Delta Lake) brings unified data engineering, BI, and ML.

5. How do you handle schema evolution in Delta Lake?

  • Use mergeSchema=True during write: df.write.option("mergeSchema", "true").format("delta").save("/path")
  • Use ALTER TABLE to manually add columns.
  • Delta automatically tracks schema versions for time travel.

6. What is the difference between Interactive and Job Clusters in Databricks?

Cluster TypeUse CaseBehavior
InteractiveDevelopment & notebooksAuto-terminates after inactivity
Job ClusterProduction/automated jobsCreated & destroyed per job run

7. What is the role of createOrReplaceTempView() in Spark SQL?

  • Registers a DataFrame as a temporary SQL view in the session.
  • Enables running SQL queries like: df.createOrReplaceTempView("orders") spark.sql("SELECT * FROM orders WHERE amount > 100")

๐Ÿ” Scenario-Based Questions with Answers


1. ๐Ÿงช You need to implement slowly changing dimensions (SCD) in a table. Which Delta Lake feature would you use?

  • Use MERGE INTO for SCD Type 1/2. MERGE INTO dim_customer USING staging_customer ON dim_customer.id = staging_customer.id WHEN MATCHED THEN UPDATE SET ... WHEN NOT MATCHED THEN INSERT ...

2. ๐Ÿ” Your Spark job runs slowly. How do you optimize joins and partitioning?

โœ… Optimization checklist:

  • Use Broadcast joins for small tables.
  • Repartition based on join key: df.repartition("join_key")
  • Cache reused DataFrames.
  • Use EXPLAIN or spark.sql("SET -v") to check plan/skew.

3. ๐Ÿงผ You receive semi-structured JSON data daily. How will you process and clean it using the Bronze โ†’ Silver โ†’ Gold model?

  • Bronze: Raw ingestion using Autoloader
  • Silver: Parsed + cleaned JSON โ†’ normalized schema
  • Gold: Aggregated/reporting-ready format # Bronze raw_df = spark.read.json("/mnt/raw/json/") # Silver clean_df = raw_df.select("id", "user.name", "events[0].timestamp") # Gold agg_df = clean_df.groupBy("id").agg(...)

4. ๐Ÿ”’ How would you integrate Azure Key Vault to manage credentials in Azure Databricks?

  1. Set up Databricks Secret Scope linked to Key Vault: databricks secrets create-scope --scope kv-scope --scope-backend-type AZURE_KEYVAULT ...
  2. Access secrets in notebooks: password = dbutils.secrets.get(scope="kv-scope", key="db-password")

5. ๐Ÿ“Š You want to create dashboards directly on Delta tables. How do you connect Power BI or Synapse to your Lakehouse?

โœ… Power BI:

  • Use Azure Databricks connector (DirectQuery/Import)
  • Authenticate via Azure AD or PAT

โœ… Synapse:

  • Use Spark or Serverless SQL pool
  • Mount Lakehouse storage (ADLS Gen2) or query Delta Lake via Spark

Pages: 1 2 3 4 5 6 7 8