✅ What is a DataFrame in PySpark?
A DataFrame in PySpark is a distributed collection of data organized into named columns, similar to a table in a relational database or a Pandas DataFrame.
It is built on top of RDDs and provides:
- Schema awareness (column names & data types)
- High-level APIs (like
select
,groupBy
,join
) - Query optimization via Catalyst engine
- Integration with SQL, Hive, Delta, Parquet, etc.
📊 DataFrame = RDD + Schema
Under the hood:
DataFrame = RDD[Row] + Schema
So while RDD is just distributed records (no structure), DataFrame adds:
- Schema: like a table
- Optimizations: Catalyst query planner
- Better performance: due to SQL-style planning
📎 Example: From RDD to DataFrame
from pyspark.sql import Row
rdd = spark.sparkContext.parallelize([Row(name="Alice", age=30), Row(name="Bob", age=25)])
# Convert to DataFrame
df = spark.createDataFrame(rdd)
df.printSchema()
df.show()
📌 Output:
root
|-- name: string
|-- age: long
+-----+---+
|name |age|
+-----+---+
|Alice| 30|
|Bob | 25|
+-----+---+
🆚 DataFrame vs RDD
Feature | RDD | DataFrame |
---|---|---|
Structure | Unstructured | Structured (schema: columns & types) |
APIs | Low-level (map, filter, reduce) | High-level (select, groupBy, SQL) |
Performance | Slower | Faster (optimized via Catalyst) |
Optimization | Manual | Automatic (Catalyst + Tungsten) |
Ease of Use | More code | Concise SQL-like operations |
Use Cases | Fine-grained control, custom ops | Most ETL, Analytics, SQL workloads |
🧠 When to Use What?
- Use RDD if:
- You need fine-grained control (custom partitioning, complex transformations)
- You’re doing low-level transformations or working with unstructured data
- Use DataFrame if:
- You want performance & ease
- You’re working with structured/semi-structured data (JSON, CSV, Parquet, Hive)
- You want to write SQL-like logic
🚀 Bonus: PySpark SQL + DataFrame
DataFrames integrate with SQL:
df.createOrReplaceTempView("people")
spark.sql("SELECT name FROM people WHERE age > 28").show()
✅ Summary
- PySpark DataFrame is the go-to API for structured big data processing.
- It’s built on RDDs, but adds structure and optimization.
- Immutable and distributed just like RDDs, but more powerful for most real-world use cases.
In PySpark, DataFrames are immutable, just like RDDs.
✅ What Does Immutable Mean?
Once a DataFrame is created, it cannot be changed in-place. Any operation you perform returns a new DataFrame with the transformation applied.
🔁 Example:
df = spark.read.csv("data.csv", header=True)
# filter does not change df, it returns a new DataFrame
df_filtered = df.filter(df["price"] > 100)
# df is still the original
df.show()
df_filtered.show()
Here:
df
is unchanged.df_filtered
is a new DataFrame with a subset of rows.
🧠 Why Are DataFrames Immutable?
- Fault tolerance: Spark builds a lineage graph, so if a node fails, Spark can recompute lost partitions.
- Optimizations: Catalyst optimizer rewrites queries knowing transformations don’t mutate the source.
- Parallelism: Safe to operate across clusters without race conditions.
🔄 But You Can Reassign
While the DataFrame object is immutable, you can reassign the variable:
df = df.filter(df["price"] > 100) # now df points to the new result
But this is just Python variable reassignment, not mutation.
✅ Summary
Concept | Mutable? | Notes |
---|---|---|
RDD | ❌ No | Immutable |
DataFrame | ❌ No | Immutable, optimized via Catalyst |
Variable (Python) | ✅ Yes | Can point to a new DataFrame |
Leave a Reply