Here’s a clear and structured comparison of RDD, DataFrame, and Dataset in Apache Spark:
๐ RDD vs DataFrame vs Dataset
| Feature | RDD (Resilient Distributed Dataset) | DataFrame | Dataset |
|---|
| Introduced In | Spark 1.0 | Spark 1.3 | Spark 1.6 |
| Type Safety | โ
Compile-time type safety (for RDD[T]) | โ Not type-safe (rows with schema) | โ
Type-safe (only in Scala/Java) |
| Ease of Use | โ Low-level APIs (map, flatMap, filter) | โ
High-level SQL-like APIs | โ
High-level + typed objects |
| Performance | โ Slower (no Catalyst or Tungsten) | โ
Optimized (Catalyst + Tungsten) | โ
Optimized (Catalyst + Tungsten) |
| Serialization | Java Serialization (slow) | Tungsten binary format (fast) | Tungsten binary format (fast) |
| Memory Usage | High (no optimization) | Optimized | Optimized |
| Transformations | Functional: map, filter, reduce | SQL-style: select, groupBy, agg | Combines functional and SQL-style |
| Lazy Evaluation | โ
Yes | โ
Yes | โ
Yes |
| Custom Functions | โ
Full flexibility with lambdas/UDFs | โ
UDFs allowed but slower | โ
UDFs + compile-time checks |
| Support in PySpark | โ
Yes | โ
Yes | โ No (Dataset API is Scala/Java only) |
| When to Use | Complex, low-level data transformations | SQL-style analytics, aggregations | Type-safe transformations (Scala/Java only) |
๐ง Code Examples
๐น RDD
rdd = spark.sparkContext.parallelize([("a", 1), ("b", 2)])
rdd.map(lambda x: (x[0], x[1] + 1)).collect()
๐น DataFrame
df = spark.createDataFrame([("a", 1), ("b", 2)], ["key", "value"])
df.selectExpr("key", "value + 1 as value_plus_1").show()
๐น Dataset (only in Scala)
case class Record(key: String, value: Int)
val ds = Seq(Record("a", 1), Record("b", 2)).toDS()
ds.map(r => r.copy(value = r.value + 1)).show()
๐ง Summary
| Use Case | Recommended API |
|---|
| Type-safety + compile-time error catching (Scala/Java) | Dataset |
| SQL-like analytics, best performance, PySpark-friendly | DataFrame |
| Full control, low-level transformations, complex UDFs | RDD |
Pages: 1 2 3 4 5 6 7
Leave a Reply