Here’s a clear and structured comparison of RDD, DataFrame, and Dataset in Apache Spark:
๐ RDD vs DataFrame vs Dataset
Feature | RDD (Resilient Distributed Dataset) | DataFrame | Dataset |
---|
Introduced In | Spark 1.0 | Spark 1.3 | Spark 1.6 |
Type Safety | โ
Compile-time type safety (for RDD[T]) | โ Not type-safe (rows with schema) | โ
Type-safe (only in Scala/Java) |
Ease of Use | โ Low-level APIs (map, flatMap, filter) | โ
High-level SQL-like APIs | โ
High-level + typed objects |
Performance | โ Slower (no Catalyst or Tungsten) | โ
Optimized (Catalyst + Tungsten) | โ
Optimized (Catalyst + Tungsten) |
Serialization | Java Serialization (slow) | Tungsten binary format (fast) | Tungsten binary format (fast) |
Memory Usage | High (no optimization) | Optimized | Optimized |
Transformations | Functional: map , filter , reduce | SQL-style: select , groupBy , agg | Combines functional and SQL-style |
Lazy Evaluation | โ
Yes | โ
Yes | โ
Yes |
Custom Functions | โ
Full flexibility with lambdas/UDFs | โ
UDFs allowed but slower | โ
UDFs + compile-time checks |
Support in PySpark | โ
Yes | โ
Yes | โ No (Dataset API is Scala/Java only) |
When to Use | Complex, low-level data transformations | SQL-style analytics, aggregations | Type-safe transformations (Scala/Java only) |
๐ง Code Examples
๐น RDD
rdd = spark.sparkContext.parallelize([("a", 1), ("b", 2)])
rdd.map(lambda x: (x[0], x[1] + 1)).collect()
๐น DataFrame
df = spark.createDataFrame([("a", 1), ("b", 2)], ["key", "value"])
df.selectExpr("key", "value + 1 as value_plus_1").show()
๐น Dataset (only in Scala)
case class Record(key: String, value: Int)
val ds = Seq(Record("a", 1), Record("b", 2)).toDS()
ds.map(r => r.copy(value = r.value + 1)).show()
๐ง Summary
Use Case | Recommended API |
---|
Type-safety + compile-time error catching (Scala/Java) | Dataset |
SQL-like analytics, best performance, PySpark-friendly | DataFrame |
Full control, low-level transformations, complex UDFs | RDD |
Pages: 1 2 3 4 5 6 7
Leave a Reply