Absolutely! Hereβs a complete Spark RDD tutorial with structured flow to help you master the concept from basics to advanced interview-level understanding.
π₯ Spark RDD Tutorial: Beginner to Advanced
π§ 1. What is an RDD?
- RDD (Resilient Distributed Dataset) is the core abstraction in Apache Spark.
- It is an immutable, distributed collection of objects that can be processed in parallel.
- RDDs support fault tolerance, lazy evaluation, lineage information, and in-memory computations.
π οΈ 2. How RDDs Are Created
From a collection (parallelize):
rdd1 = sc.parallelize([1, 2, 3, 4, 5])
From an external file (textFile):
rdd2 = sc.textFile("hdfs:///data/file.txt")
π 3. RDD Lineage and DAG
Every transformation creates a new RDD. Spark tracks how each RDD was derived from others using a lineage graph, helping with fault recovery.
π Lazy Evaluation: RDD transformations (like map
, filter
) are not executed immediately. They’re executed when an action (like collect
, count
) is called.
π Example of Lineage:
rdd = sc.textFile("data.txt")
words = rdd.flatMap(lambda x: x.split(" "))
filtered = words.filter(lambda x: x != "")
pairs = filtered.map(lambda x: (x, 1))
counts = pairs.reduceByKey(lambda a, b: a + b)
Here, Spark builds a lineage graph internally:
textFile β flatMap β filter β map β reduceByKey
π§© 4. Types of RDDs: Normal vs Pair RDD
β Normal RDD:
Elements are individual items (e.g., ["apple", "banana"]
)
β Pair RDD:
Elements are key-value pairs (e.g., [("a", 1), ("b", 2)]
)
rdd = sc.parallelize([("a", 1), ("b", 2), ("a", 3)])
Useful for aggregations by key (reduceByKey
, groupByKey
, etc.)
π§° 5. Common RDD Transformations
πΉ For Normal RDD:
map(func)
filter(func)
flatMap(func)
distinct()
union(rdd)
intersection(rdd)
sample(withReplacement, fraction)
πΉ For Pair RDD:
reduceByKey(func)
groupByKey()
mapValues(func)
flatMapValues(func)
join(other)
leftOuterJoin(other)
cogroup(other)
Example:
rdd = sc.parallelize([("a", 1), ("b", 2), ("a", 3)])
result = rdd.reduceByKey(lambda x, y: x + y).collect()
# Output: [('a', 4), ('b', 2)]
β‘ 6. Actions on RDD
collect()
count()
first()
take(n)
saveAsTextFile(path)
reduce(func)
foreach(func)
π 7. Converting Between DataFrame and RDD
β DataFrame to RDD:
df = spark.read.csv("data.csv", header=True)
rdd = df.rdd
π When and Why:
Use RDD when:
- You need fine-grained control over low-level transformations
- You need custom partitioning
- Working with legacy RDD-based APIs
- Complex unstructured data parsing
β RDD to DataFrame:
from pyspark.sql import Row
rdd = sc.parallelize([Row(name="Alice", age=30), Row(name="Bob", age=28)])
df = spark.createDataFrame(rdd)
π― 8. RDD Use Cases (When to Use)
- Low-level transformations or custom operations
- Iterative algorithms (e.g., ML, Graph processing)
- Avoiding Spark Catalyst/ Tungsten overhead
- Structured streaming under the hood still uses RDD
π§ 9. Interview Questions (Conceptual)
- What is an RDD? How is it different from a DataFrame?
- What are narrow and wide transformations?
- How does lineage work in RDDs?
- Difference between
reduceByKey
andgroupByKey
? - When would you use RDDs over DataFrames?
- What is the difference between
map
,flatMap
, andmapPartitions
?
π 10. Complex Coding Challenges
Q1. Word Count (classic)
rdd = sc.textFile("file.txt")
result = rdd.flatMap(lambda x: x.split()) \
.map(lambda x: (x, 1)) \
.reduceByKey(lambda a, b: a + b)
Q2. Join two datasets and compute average marks per student
students = sc.parallelize([("1", "John"), ("2", "Alice")])
marks = sc.parallelize([("1", 80), ("1", 90), ("2", 85)])
joined = students.join(marks) # ("1", ("John", 80))
avg_marks = joined.map(lambda x: (x[1][0], x[1][1])) \
.groupByKey() \
.mapValues(lambda x: sum(x) / len(x))
Q3. Invert a Key-Value RDD and sort
rdd = sc.parallelize([("a", 5), ("b", 3), ("c", 10)])
result = rdd.map(lambda x: (x[1], x[0])).sortByKey()
π 11. Optimization Tips
- Use
reduceByKey
instead ofgroupByKey
- Use
persist()
orcache()
only when reused - Avoid using
collect()
on large datasets - Use partitioning to optimize joins and shuffles
- Prefer
mapPartitions()
overmap()
for batch efficiency
π 12. Visual Lineage & DAG (Conceptual View)
Input File
|
textFile("file.txt")
|
flatMap(lambda x: x.split())
|
filter(lambda x: x != "")
|
map(lambda x: (x, 1))
|
reduceByKey(lambda a, b: a + b)
|
Action (collect)
Would you like me to:
β
Generate a downloadable PDF of this tutorial?
β
Convert this into a set of interview-ready flashcards or Jupyter notebook format?
β
Add a real-world problem set on top of this with test data?
Let me know your preference and Iβll extend this accordingly.
Leave a Reply