Data Engineer Interview Questions Set2

Here’s a clear and structured comparison of RDD, DataFrame, and Dataset in Apache Spark:

Feature	RDD (Resilient Distributed Dataset)	DataFrame	Dataset
Introduced In	Spark 1.0	Spark 1.3	Spark 1.6
Type Safety	✅ Compile-time type safety (for RDD[T])	❌ Not type-safe (rows with schema)	✅ Type-safe (only in Scala/Java)
Ease of Use	❌ Low-level APIs (map, flatMap, filter)	✅ High-level SQL-like APIs	✅ High-level + typed objects
Performance	❌ Slower (no Catalyst or Tungsten)	✅ Optimized (Catalyst + Tungsten)	✅ Optimized (Catalyst + Tungsten)
Serialization	Java Serialization (slow)	Tungsten binary format (fast)	Tungsten binary format (fast)
Memory Usage	High (no optimization)	Optimized	Optimized
Transformations	Functional: `map`, `filter`, `reduce`	SQL-style: `select`, `groupBy`, `agg`	Combines functional and SQL-style
Lazy Evaluation	✅ Yes	✅ Yes	✅ Yes
Custom Functions	✅ Full flexibility with lambdas/UDFs	✅ UDFs allowed but slower	✅ UDFs + compile-time checks
Support in PySpark	✅ Yes	✅ Yes	❌ No (Dataset API is Scala/Java only)
When to Use	Complex, low-level data transformations	SQL-style analytics, aggregations	Type-safe transformations (Scala/Java only)

rdd = spark.sparkContext.parallelize([("a", 1), ("b", 2)])
rdd.map(lambda x: (x[0], x[1] + 1)).collect()

df = spark.createDataFrame([("a", 1), ("b", 2)], ["key", "value"])
df.selectExpr("key", "value + 1 as value_plus_1").show()

case class Record(key: String, value: Int)
val ds = Seq(Record("a", 1), Record("b", 2)).toDS()
ds.map(r => r.copy(value = r.value + 1)).show()

Use Case	Recommended API
Type-safety + compile-time error catching (Scala/Java)	Dataset
SQL-like analytics, best performance, PySpark-friendly	DataFrame
Full control, low-level transformations, complex UDFs	RDD

HintsToday