PySpark Coding Practice Questions

Let’s explore selectExpr() and groupByExpr() — two powerful, flexible PySpark functions that allow you to use SQL expressions directly inside PySpark methods.

🧠 What are `selectExpr()` and `groupByExpr()`?

Both are expression-based alternatives to select() and groupBy() that accept SQL-style strings instead of column objects.

✅ `selectExpr(*exprs: str)`

🔹 Definition:

DataFrame.selectExpr(*exprs: str)

Allows you to write SQL-style expressions directly.
Supports aliases (AS), functions, arithmetic, case statements, etc.

🔸 Use Cases:

Use Case	Example
Aliasing columns	`"amount as total_amount"`
Math expressions	`"amount * 1.1 as amount_with_tax"`
Casting	`"CAST(salary AS INT) as salary_int"`
Conditions	`"CASE WHEN amount > 1000 THEN 'High' ELSE 'Low' END AS amount_type"`

✅ Example:

df.selectExpr(
    "dept",
    "amount * 0.9 as discounted_amount",
    "CASE WHEN amount > 1000 THEN 'HIGH' ELSE 'LOW' END as category"
)

Equivalent to writing:

from pyspark.sql.functions import expr
df.select(
    F.col("dept"),
    (F.col("amount") * 0.9).alias("discounted_amount"),
    F.when(F.col("amount") > 1000, "HIGH").otherwise("LOW").alias("category")
)

✅ `groupByExpr(*exprs: str)`

🔹 Definition:

DataFrame.groupByExpr(*exprs: str)

Lets you use complex SQL expressions as groupBy keys.
Useful for:
- Casting
- Extracting parts of strings/dates
- Bucketing/grouping logic

🔸 Use Cases:

Use Case	Example
Group by year of date	`"year(order_date)"`
Group by rounded values	`"floor(salary / 1000) * 1000 as salary_band"`
Group by substring	`"substring(region, 1, 3) as region_prefix"`
Use grouping sets/cube/rollup	`"GROUPING SETS ((dept), (month), ())"`

✅ Example 1: Group by transformed date

df.groupByExpr("year(order_date) as order_year").agg(F.sum("amount").alias("total"))

✅ Example 2: Grouping Sets

df.groupByExpr("GROUPING SETS ((dept), (month), ())") \
  .agg(F.sum("amount").alias("total_amount"))

Can’t do that with .groupBy() directly — only groupByExpr() allows this SQL syntax.

🔁 `selectExpr()` + `groupByExpr()` = SQL Without SQL

You can mimic SQL queries without writing full SQL:

✅ Example: Equivalent of

SELECT dept, month, SUM(amount)
FROM df
GROUP BY GROUPING SETS ((dept), (month), ())

In PySpark:

df.groupByExpr("GROUPING SETS ((dept), (month), ())") \
  .agg(F.sum("amount").alias("total_amount"))

📌 When to Use

Situation	Use
You want to write expressions quickly, like SQL	`selectExpr`
You need to group by complex expressions or grouping sets	`groupByExpr`
You want to avoid importing many functions from `pyspark.sql.functions`	Both
You are translating SQL logic to PySpark programmatically	Both

✨ Tips

selectExpr() is great for dynamic SQL-style transformation, e.g., in metadata-driven pipelines.
groupByExpr() is one of the few ways to use SQL grouping sets without using spark.sql().

🧪 Mini Demo

df.selectExpr(
    "dept",
    "amount",
    "amount * 1.05 as taxed_amount",
    "CASE WHEN amount > 1000 THEN 'BIG' ELSE 'SMALL' END as size_flag"
).groupByExpr("dept", "size_flag").agg(F.sum("amount"))

HintsToday

recent posts

about

🧠 What are `selectExpr()` and `groupByExpr()`?

✅ `selectExpr(*exprs: str)`

🔹 Definition:

🔸 Use Cases:

✅ Example:

✅ `groupByExpr(*exprs: str)`

🔹 Definition:

🔸 Use Cases:

✅ Example 1: Group by transformed date

✅ Example 2: Grouping Sets

🔁 `selectExpr()` + `groupByExpr()` = SQL Without SQL

✅ Example: Equivalent of

📌 When to Use

✨ Tips

🧪 Mini Demo

Like this:

recent posts

about

PySpark Coding Practice Questions

🧠 What are selectExpr() and groupByExpr()?

✅ selectExpr(*exprs: str)

🔹 Definition:

🔸 Use Cases:

✅ Example:

✅ groupByExpr(*exprs: str)

🔹 Definition:

🔸 Use Cases:

✅ Example 1: Group by transformed date

✅ Example 2: Grouping Sets

🔁 selectExpr() + groupByExpr() = SQL Without SQL

✅ Example: Equivalent of

📌 When to Use

✨ Tips

🧪 Mini Demo

Like this:

🧠 What are `selectExpr()` and `groupByExpr()`?

✅ `selectExpr(*exprs: str)`

✅ `groupByExpr(*exprs: str)`

🔁 `selectExpr()` + `groupByExpr()` = SQL Without SQL