Great! Here’s how weโll structure both:
๐งช Hands-On Databricks Notebooks (Ready-to-Use)
Each notebook is short and focused on one concept, designed for execution in Azure Databricks.
๐ Notebook 1: Spark RDD Basics
- Create and transform RDDs
- Word count from text RDD
- Filter, map, reduce examples
rdd = spark.sparkContext.textFile("/databricks-datasets/shakespeare/*")
words = rdd.flatMap(lambda line: line.split(" "))
result = words.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
result.take(5)
๐ Notebook 2: DataFrame Basics
- Load CSV/JSON data
- Filter, group, aggregation
- SQL queries over DataFrames
df = spark.read.option("header", "true").csv("/databricks-datasets/retail-org/sales.csv")
df.filter(df.amount > 500).groupBy("region").sum("amount").show()
df.createOrReplaceTempView("sales")
spark.sql("SELECT region, SUM(amount) FROM sales GROUP BY region").show()
๐ Notebook 3: Delta Lake & Lakehouse
- Create Delta table
- Upserts with
MERGE INTO
- Time Travel with
VERSION AS OF
df.write.format("delta").mode("overwrite").saveAsTable("bronze_sales")
spark.sql("""
MERGE INTO silver_sales USING updates
ON silver_sales.id = updates.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")
spark.read.format("delta").option("versionAsOf", 0).table("silver_sales").show()
๐ Notebook 4: Databricks Workspace Basics
- How to use: Repos, Notebooks, Clusters
- Link GitHub repo
- Create Job from Notebook
๐ฏ Sample Interview Questions (Conceptual + Scenario)
๐ง Spark & Databricks Concepts
- What are the differences between RDD and DataFrame in Spark?
- Explain the Catalyst and Tungsten optimizers.
- What are the benefits of Delta Lake over traditional data lakes?
- What is a Lakehouse Architecture and why is it needed?
- How do you handle schema evolution in Delta Lake?
- What is the difference between Interactive and Job Clusters in Databricks?
- What is the role of
createOrReplaceTempView()
in Spark SQL?
๐ Scenario-Based Questions
- ๐งช You need to implement slowly changing dimensions (SCD) in a table. Which Delta Lake feature would you use?
- ๐ Your Spark job runs slowly. How do you optimize joins and partitioning?
- ๐งผ You receive semi-structured JSON data daily. How will you process and clean it using the Bronze โ Silver โ Gold model?
- ๐ How would you integrate Azure Key Vault to manage credentials in Azure Databricks?
- ๐ You want to create dashboards directly on Delta tables. How do you connect Power BI or Synapse to your Lakehouse?
Here’s a complete tutorial for Spark RDD on Azure Databricks, covering all key concepts, transformations, actions, and real-world use cases โ fully aligned with data engineering interview preparation.
๐ Spark RDD Tutorial on Azure Databricks
โ Youโll learn everything about RDDs: creation, transformations, actions, and when to use them over DataFrames.
๐งฑ 1. What is an RDD?
RDD (Resilient Distributed Dataset) is the core low-level API in Spark for distributed processing.
- Immutable
- Partitioned
- Lazy Evaluation
- Fault Tolerant (Lineage)
- In-memory computing
๐ง 2. How to Create an RDD
From a Python collection:
rdd1 = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
From external data (text file):
rdd2 = spark.sparkContext.textFile("/databricks-datasets/README.md")
๐ 3. Transformations (Lazy)
๐น map()
: Element-wise transformation
rdd = rdd1.map(lambda x: x * x)
๐น filter()
: Filter elements
even_rdd = rdd1.filter(lambda x: x % 2 == 0)
๐น flatMap()
: Break lines into words
rdd = spark.sparkContext.textFile("/databricks-datasets/README.md")
words = rdd.flatMap(lambda line: line.split(" "))
๐น distinct()
, union()
, intersection()
rdd1 = spark.sparkContext.parallelize([1, 2, 3])
rdd2 = spark.sparkContext.parallelize([2, 3, 4])
rdd1.union(rdd2).distinct().collect()
โ๏ธ 4. Actions (Trigger computation)
๐ธ collect()
, take(n)
rdd1.collect()
rdd1.take(3)
๐ธ reduce()
, count()
, first()
total = rdd1.reduce(lambda a, b: a + b)
๐ธ foreach()
rdd1.foreach(lambda x: print(x))
๐ง 5. Pair RDD (Key-Value)
Create a Pair RDD:
pair_rdd = spark.sparkContext.parallelize([("a", 1), ("b", 1), ("a", 2)])
Group and Reduce:
pair_rdd.reduceByKey(lambda a, b: a + b).collect()
pair_rdd.groupByKey().mapValues(list).collect()
๐ 6. RDD Use Cases
โ 1. Word Count (Classic Example)
lines = spark.sparkContext.textFile("/databricks-datasets/README.md")
words = lines.flatMap(lambda line: line.split())
word_pairs = words.map(lambda word: (word.lower(), 1))
word_counts = word_pairs.reduceByKey(lambda a, b: a + b)
word_counts.take(10)
โ 2. Log File Analysis
log_rdd = spark.sparkContext.textFile("/databricks-datasets/logs/sample_logs")
errors = log_rdd.filter(lambda line: "ERROR" in line)
print("Total errors:", errors.count())
โ 3. Clickstream Analytics
rdd = spark.sparkContext.parallelize([
("user1", "product1"),
("user2", "product2"),
("user1", "product2")
])
rdd.groupByKey().mapValues(list).collect()
โ 4. Custom ETL (Non-tabular JSON)
raw = spark.sparkContext.textFile("/databricks-datasets/iot/iot_stream.json")
parsed = raw.map(lambda x: json.loads(x))
temp_data = parsed.map(lambda x: (x["device_id"], x["temp"]))
high_temp = temp_data.filter(lambda x: x[1] > 70)
๐งช 7. When to Use RDD Over DataFrames
Use Case | Use RDD? |
---|---|
Custom transformations or logic | โ Yes |
Full control over data partitions | โ Yes |
Tabular data (structured) | โ Use DataFrame |
SQL queries or Catalyst optimization | โ Use DataFrame |
Handling unstructured/semi-structured | โ Yes |
๐งช 8. RDD Lineage and Fault Tolerance
Each RDD stores lineage of transformations, so in case of node failure, Spark can recompute lost partitions.
rdd = spark.sparkContext.parallelize([1, 2, 3])
rdd2 = rdd.map(lambda x: x + 1)
print(rdd2.toDebugString())
๐ก 9. RDD vs DataFrame vs Dataset
Feature | RDD | DataFrame | Dataset (Scala/Java) |
---|---|---|---|
API Level | Low-level | High-level | Mid-level |
Performance | Slower | Optimized | Optimized |
Use Case | Complex logic | Structured data | Type-safety + logic |
Catalyst Optimization | โ | โ | โ |
๐งโ๐ป Final Tip: Cluster & Notebook Setup on Databricks
- Go to Azure Databricks โ Create Workspace
- Launch a Cluster (choose runtime with Apache Spark)
- Create a Notebook โ Language: Python
- Run above RDD snippets to practice end-to-end!