Pyspark Dataframe programming - operations, functions, all statements, syntax with Examples

✅ What is a DataFrame in PySpark?

A DataFrame in PySpark is a distributed collection of data organized into named columns, similar to a table in a relational database or a Pandas DataFrame.

It is built on top of RDDs and provides:

Schema awareness (column names & data types)
High-level APIs (like select, groupBy, join)
Query optimization via Catalyst engine
Integration with SQL, Hive, Delta, Parquet, etc.

📊 DataFrame = RDD + Schema

Under the hood:

DataFrame = RDD[Row] + Schema

So while RDD is just distributed records (no structure), DataFrame adds:

Schema: like a table
Optimizations: Catalyst query planner
Better performance: due to SQL-style planning

📎 Example: From RDD to DataFrame

from pyspark.sql import Row
rdd = spark.sparkContext.parallelize([Row(name="Alice", age=30), Row(name="Bob", age=25)])

# Convert to DataFrame
df = spark.createDataFrame(rdd)

df.printSchema()
df.show()

📌 Output:

root
 |-- name: string
 |-- age: long
+-----+---+
|name |age|
+-----+---+
|Alice| 30|
|Bob  | 25|
+-----+---+

🆚 DataFrame vs RDD

Feature	RDD	DataFrame
Structure	Unstructured	Structured (schema: columns & types)
APIs	Low-level (map, filter, reduce)	High-level (select, groupBy, SQL)
Performance	Slower	Faster (optimized via Catalyst)
Optimization	Manual	Automatic (Catalyst + Tungsten)
Ease of Use	More code	Concise SQL-like operations
Use Cases	Fine-grained control, custom ops	Most ETL, Analytics, SQL workloads

🧠 When to Use What?

Use RDD if:
- You need fine-grained control (custom partitioning, complex transformations)
- You’re doing low-level transformations or working with unstructured data
Use DataFrame if:
- You want performance & ease
- You’re working with structured/semi-structured data (JSON, CSV, Parquet, Hive)
- You want to write SQL-like logic

🚀 Bonus: PySpark SQL + DataFrame

DataFrames integrate with SQL:

df.createOrReplaceTempView("people")
spark.sql("SELECT name FROM people WHERE age > 28").show()

✅ Summary

PySpark DataFrame is the go-to API for structured big data processing.
It’s built on RDDs, but adds structure and optimization.
Immutable and distributed just like RDDs, but more powerful for most real-world use cases.

In PySpark, DataFrames are immutable, just like RDDs.

✅ What Does Immutable Mean?

Once a DataFrame is created, it cannot be changed in-place. Any operation you perform returns a new DataFrame with the transformation applied.

🔁 Example:

df = spark.read.csv("data.csv", header=True)

# filter does not change df, it returns a new DataFrame
df_filtered = df.filter(df["price"] > 100)

# df is still the original
df.show()
df_filtered.show()

Here:

df is unchanged.
df_filtered is a new DataFrame with a subset of rows.

🧠 Why Are DataFrames Immutable?

Fault tolerance: Spark builds a lineage graph, so if a node fails, Spark can recompute lost partitions.
Optimizations: Catalyst optimizer rewrites queries knowing transformations don’t mutate the source.
Parallelism: Safe to operate across clusters without race conditions.

🔄 But You Can Reassign

While the DataFrame object is immutable, you can reassign the variable:

df = df.filter(df["price"] > 100)  # now df points to the new result

But this is just Python variable reassignment, not mutation.

✅ Summary

Concept	Mutable?	Notes
RDD	❌ No	Immutable
DataFrame	❌ No	Immutable, optimized via Catalyst
Variable (Python)	✅ Yes	Can point to a new DataFrame

HintsToday

recent posts

about