Absolutely! Here’s a complete Spark RDD tutorial with structured flow to help you master the concept from basics to advanced interview-level understanding.


πŸ”₯ Spark RDD Tutorial: Beginner to Advanced


🧠 1. What is an RDD?

  • RDD (Resilient Distributed Dataset) is the core abstraction in Apache Spark.
  • It is an immutable, distributed collection of objects that can be processed in parallel.
  • RDDs support fault tolerance, lazy evaluation, lineage information, and in-memory computations.

πŸ› οΈ 2. How RDDs Are Created

From a collection (parallelize):

rdd1 = sc.parallelize([1, 2, 3, 4, 5])

From an external file (textFile):

rdd2 = sc.textFile("hdfs:///data/file.txt")

πŸ”„ 3. RDD Lineage and DAG

Every transformation creates a new RDD. Spark tracks how each RDD was derived from others using a lineage graph, helping with fault recovery.

πŸ“Œ Lazy Evaluation: RDD transformations (like map, filter) are not executed immediately. They’re executed when an action (like collect, count) is called.

πŸ”— Example of Lineage:

rdd = sc.textFile("data.txt")
words = rdd.flatMap(lambda x: x.split(" "))
filtered = words.filter(lambda x: x != "")
pairs = filtered.map(lambda x: (x, 1))
counts = pairs.reduceByKey(lambda a, b: a + b)

Here, Spark builds a lineage graph internally:

textFile β†’ flatMap β†’ filter β†’ map β†’ reduceByKey

🧩 4. Types of RDDs: Normal vs Pair RDD

βœ… Normal RDD:

Elements are individual items (e.g., ["apple", "banana"])

βœ… Pair RDD:

Elements are key-value pairs (e.g., [("a", 1), ("b", 2)])

rdd = sc.parallelize([("a", 1), ("b", 2), ("a", 3)])

Useful for aggregations by key (reduceByKey, groupByKey, etc.)


🧰 5. Common RDD Transformations

πŸ”Ή For Normal RDD:

  • map(func)
  • filter(func)
  • flatMap(func)
  • distinct()
  • union(rdd)
  • intersection(rdd)
  • sample(withReplacement, fraction)

πŸ”Ή For Pair RDD:

  • reduceByKey(func)
  • groupByKey()
  • mapValues(func)
  • flatMapValues(func)
  • join(other)
  • leftOuterJoin(other)
  • cogroup(other)

Example:

rdd = sc.parallelize([("a", 1), ("b", 2), ("a", 3)])
result = rdd.reduceByKey(lambda x, y: x + y).collect()
# Output: [('a', 4), ('b', 2)]

⚑ 6. Actions on RDD

  • collect()
  • count()
  • first()
  • take(n)
  • saveAsTextFile(path)
  • reduce(func)
  • foreach(func)

πŸ” 7. Converting Between DataFrame and RDD

βœ… DataFrame to RDD:

df = spark.read.csv("data.csv", header=True)
rdd = df.rdd

πŸ“Œ When and Why:
Use RDD when:

  • You need fine-grained control over low-level transformations
  • You need custom partitioning
  • Working with legacy RDD-based APIs
  • Complex unstructured data parsing

βœ… RDD to DataFrame:

from pyspark.sql import Row

rdd = sc.parallelize([Row(name="Alice", age=30), Row(name="Bob", age=28)])
df = spark.createDataFrame(rdd)

🎯 8. RDD Use Cases (When to Use)

  • Low-level transformations or custom operations
  • Iterative algorithms (e.g., ML, Graph processing)
  • Avoiding Spark Catalyst/ Tungsten overhead
  • Structured streaming under the hood still uses RDD

🧠 9. Interview Questions (Conceptual)

  1. What is an RDD? How is it different from a DataFrame?
  2. What are narrow and wide transformations?
  3. How does lineage work in RDDs?
  4. Difference between reduceByKey and groupByKey?
  5. When would you use RDDs over DataFrames?
  6. What is the difference between map, flatMap, and mapPartitions?

πŸ” 10. Complex Coding Challenges

Q1. Word Count (classic)

rdd = sc.textFile("file.txt")
result = rdd.flatMap(lambda x: x.split()) \
            .map(lambda x: (x, 1)) \
            .reduceByKey(lambda a, b: a + b)

Q2. Join two datasets and compute average marks per student

students = sc.parallelize([("1", "John"), ("2", "Alice")])
marks = sc.parallelize([("1", 80), ("1", 90), ("2", 85)])

joined = students.join(marks)  # ("1", ("John", 80))
avg_marks = joined.map(lambda x: (x[1][0], x[1][1])) \
                  .groupByKey() \
                  .mapValues(lambda x: sum(x) / len(x))

Q3. Invert a Key-Value RDD and sort

rdd = sc.parallelize([("a", 5), ("b", 3), ("c", 10)])
result = rdd.map(lambda x: (x[1], x[0])).sortByKey()

πŸ“Œ 11. Optimization Tips

  • Use reduceByKey instead of groupByKey
  • Use persist() or cache() only when reused
  • Avoid using collect() on large datasets
  • Use partitioning to optimize joins and shuffles
  • Prefer mapPartitions() over map() for batch efficiency

πŸ“Ž 12. Visual Lineage & DAG (Conceptual View)

Input File
   |
textFile("file.txt")
   |
flatMap(lambda x: x.split())
   |
filter(lambda x: x != "")
   |
map(lambda x: (x, 1))
   |
reduceByKey(lambda a, b: a + b)
   |
Action (collect)

Would you like me to:

βœ… Generate a downloadable PDF of this tutorial?
βœ… Convert this into a set of interview-ready flashcards or Jupyter notebook format?
βœ… Add a real-world problem set on top of this with test data?

Let me know your preference and I’ll extend this accordingly.

Pages: 1 2 3 4

Posted in

Leave a Reply

Your email address will not be published. Required fields are marked *