# Databricks Notebook: Azure Databricks DataFrame Tutorial

# COMMAND ----------
# ๐Ÿ“˜ Section 1: Creating DataFrames

# From CSV
sales_df = spark.read.option("header", True).csv("/databricks-datasets/retail-org/sales.csv")
sales_df.show(5)

# From Python data
from pyspark.sql import Row
data = [Row(name="Alice", age=25), Row(name="Bob", age=30)]
df = spark.createDataFrame(data)
df.show()

# COMMAND ----------
# ๐Ÿ“˜ Section 2: DataFrame Operations

# Filter and transform
df_filtered = sales_df.filter(sales_df.amount > 100).withColumnRenamed("amount", "sales_amount")
df_filtered.select("region", "sales_amount").show()

# Aggregation
df_agg = sales_df.groupBy("region").sum("amount")
df_agg.show()

# COMMAND ----------
# ๐Ÿ“˜ Section 3: SQL on DataFrame

sales_df.createOrReplaceTempView("sales")
spark.sql("SELECT region, SUM(amount) as total FROM sales GROUP BY region").show()

# COMMAND ----------
# ๐Ÿ“˜ Section 4: Join Example

# Dummy 2nd DF to demonstrate join
from pyspark.sql import Row
region_df = spark.createDataFrame([Row(region="West"), Row(region="East")])
sales_df.join(region_df, "region", "inner").show()

# COMMAND ----------
# ๐Ÿ“˜ Section 5: Auto Loader Example (Streaming)

auto_df = (spark.readStream.format("cloudFiles")
           .option("cloudFiles.format", "csv")
           .option("header", True)
           .schema(sales_df.schema)
           .load("/mnt/bronze/autoloader-input"))

# Aggregation over stream
auto_df.groupBy("region").count().writeStream.outputMode("complete").format("console").start()

# COMMAND ----------
# ๐Ÿ“˜ Section 6: Unity Catalog Table Access (Assuming UC Enabled)

# List catalogs
spark.sql("SHOW CATALOGS").show()

# List schemas from catalog
spark.sql("USE CATALOG main")
spark.sql("SHOW SCHEMAS").show()

# Access a table
spark.sql("SELECT * FROM main.default.sample_table LIMIT 10").show()

# COMMAND ----------
# ๐Ÿ“˜ Section 7: Notebook Workflow Demo (for Jobs)

# Define parameters using widgets
# In Job UI, these values can be overridden

dbutils.widgets.text("input_path", "/mnt/sample-data", "Input Path")
input_path = dbutils.widgets.get("input_path")
print(f"Reading from: {input_path}")

# Simulate downstream transformation
input_df = spark.read.option("header", True).csv(input_path)
input_df.groupBy("region").count().show()

# COMMAND ----------
# ๐Ÿ“˜ Section 8: Git Repo Clone Instructions

# Go to "Repos" in left sidebar
# Click "Add Repo" > paste your GitHub repo link
# Example: https://github.com/your-org/azure-databricks-tutorial

# Authenticate using personal access token (PAT)
# Repo appears in Workspace > Repos

# Notebooks, init scripts, configs can be version-controlled directly

# COMMAND ----------
# โœ… End of Tutorial

Here’s a crisp interview-style Q&A sheet covering both conceptual and scenario-based questions from Spark & Databricks, tailored for Data Engineering roles in Azure-based environments:


๐ŸŽฏ Conceptual Interview Questions with Answers


1. What are the differences between RDD and DataFrame in Spark?

FeatureRDDDataFrame
AbstractionLow-level (row-level)High-level (table-like)
Type SafetyType-safeNot type-safe
OptimizationNo built-in optimizerCatalyst Optimizer used
Ease of UseVerbose APISQL-like, concise
PerformanceSlowerFaster (Catalyst + Tungsten engine)

2. Explain Catalyst and Tungsten optimizers.

  • Catalyst Optimizer: Optimizes logical and physical query plans using rules like predicate pushdown, constant folding, etc.
  • Tungsten Engine: Low-level optimization for CPU/memory using whole-stage code generation and off-heap memory.

3. What are the benefits of Delta Lake over traditional data lakes?

  • ACID transactions
  • Time travel (versioning)
  • Schema enforcement & evolution
  • Efficient file compaction (OPTIMIZE)
  • Scalable merge/upsert support (MERGE INTO)

4. What is Lakehouse Architecture and why is it needed?

A Lakehouse combines the reliability of data warehouses (schema, transactions) with the scalability of data lakes.

โœ… Needed because:

  • Traditional lakes lack ACID + governance.
  • Warehouses are costly & rigid.
  • Lakehouse (via Delta Lake) brings unified data engineering, BI, and ML.

5. How do you handle schema evolution in Delta Lake?

  • Use mergeSchema=True during write: df.write.option("mergeSchema", "true").format("delta").save("/path")
  • Use ALTER TABLE to manually add columns.
  • Delta automatically tracks schema versions for time travel.

6. What is the difference between Interactive and Job Clusters in Databricks?

Cluster TypeUse CaseBehavior
InteractiveDevelopment & notebooksAuto-terminates after inactivity
Job ClusterProduction/automated jobsCreated & destroyed per job run

7. What is the role of createOrReplaceTempView() in Spark SQL?

  • Registers a DataFrame as a temporary SQL view in the session.
  • Enables running SQL queries like: df.createOrReplaceTempView("orders") spark.sql("SELECT * FROM orders WHERE amount > 100")

๐Ÿ” Scenario-Based Questions with Answers


1. ๐Ÿงช You need to implement slowly changing dimensions (SCD) in a table. Which Delta Lake feature would you use?

  • Use MERGE INTO for SCD Type 1/2. MERGE INTO dim_customer USING staging_customer ON dim_customer.id = staging_customer.id WHEN MATCHED THEN UPDATE SET ... WHEN NOT MATCHED THEN INSERT ...

2. ๐Ÿ” Your Spark job runs slowly. How do you optimize joins and partitioning?

โœ… Optimization checklist:

  • Use Broadcast joins for small tables.
  • Repartition based on join key: df.repartition("join_key")
  • Cache reused DataFrames.
  • Use EXPLAIN or spark.sql("SET -v") to check plan/skew.

3. ๐Ÿงผ You receive semi-structured JSON data daily. How will you process and clean it using the Bronze โ†’ Silver โ†’ Gold model?

  • Bronze: Raw ingestion using Autoloader
  • Silver: Parsed + cleaned JSON โ†’ normalized schema
  • Gold: Aggregated/reporting-ready format # Bronze raw_df = spark.read.json("/mnt/raw/json/") # Silver clean_df = raw_df.select("id", "user.name", "events[0].timestamp") # Gold agg_df = clean_df.groupBy("id").agg(...)

4. ๐Ÿ”’ How would you integrate Azure Key Vault to manage credentials in Azure Databricks?

  1. Set up Databricks Secret Scope linked to Key Vault: databricks secrets create-scope --scope kv-scope --scope-backend-type AZURE_KEYVAULT ...
  2. Access secrets in notebooks: password = dbutils.secrets.get(scope="kv-scope", key="db-password")

5. ๐Ÿ“Š You want to create dashboards directly on Delta tables. How do you connect Power BI or Synapse to your Lakehouse?

โœ… Power BI:

  • Use Azure Databricks connector (DirectQuery/Import)
  • Authenticate via Azure AD or PAT

โœ… Synapse:

  • Use Spark or Serverless SQL pool
  • Mount Lakehouse storage (ADLS Gen2) or query Delta Lake via Spark

Pages: 1 2 3 4 5 6 7 8


Discover more from HintsToday

Subscribe to get the latest posts sent to your email.

Discover more from HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading