Letβs explore selectExpr()
and groupByExpr()
β two powerful, flexible PySpark functions that allow you to use SQL expressions directly inside PySpark methods.
π§ What are selectExpr()
and groupByExpr()
?
Both are expression-based alternatives to select()
and groupBy()
that accept SQL-style strings instead of column objects.
β
selectExpr(*exprs: str)
πΉ Definition:
DataFrame.selectExpr(*exprs: str)
- Allows you to write SQL-style expressions directly.
- Supports aliases (
AS
), functions, arithmetic, case statements, etc.
πΈ Use Cases:
Use Case | Example |
---|---|
Aliasing columns | "amount as total_amount" |
Math expressions | "amount * 1.1 as amount_with_tax" |
Casting | "CAST(salary AS INT) as salary_int" |
Conditions | "CASE WHEN amount > 1000 THEN 'High' ELSE 'Low' END AS amount_type" |
β Example:
df.selectExpr(
"dept",
"amount * 0.9 as discounted_amount",
"CASE WHEN amount > 1000 THEN 'HIGH' ELSE 'LOW' END as category"
)
Equivalent to writing:
from pyspark.sql.functions import expr
df.select(
F.col("dept"),
(F.col("amount") * 0.9).alias("discounted_amount"),
F.when(F.col("amount") > 1000, "HIGH").otherwise("LOW").alias("category")
)
β
groupByExpr(*exprs: str)
πΉ Definition:
DataFrame.groupByExpr(*exprs: str)
- Lets you use complex SQL expressions as groupBy keys.
- Useful for:
- Casting
- Extracting parts of strings/dates
- Bucketing/grouping logic
πΈ Use Cases:
Use Case | Example |
---|---|
Group by year of date | "year(order_date)" |
Group by rounded values | "floor(salary / 1000) * 1000 as salary_band" |
Group by substring | "substring(region, 1, 3) as region_prefix" |
Use grouping sets/cube/rollup | "GROUPING SETS ((dept), (month), ())" |
β Example 1: Group by transformed date
df.groupByExpr("year(order_date) as order_year").agg(F.sum("amount").alias("total"))
β Example 2: Grouping Sets
df.groupByExpr("GROUPING SETS ((dept), (month), ())") \
.agg(F.sum("amount").alias("total_amount"))
Can’t do that with
.groupBy()
directly β onlygroupByExpr()
allows this SQL syntax.
π selectExpr()
+ groupByExpr()
= SQL Without SQL
You can mimic SQL queries without writing full SQL:
β Example: Equivalent of
SELECT dept, month, SUM(amount)
FROM df
GROUP BY GROUPING SETS ((dept), (month), ())
In PySpark:
df.groupByExpr("GROUPING SETS ((dept), (month), ())") \
.agg(F.sum("amount").alias("total_amount"))
π When to Use
Situation | Use |
---|---|
You want to write expressions quickly, like SQL | selectExpr |
You need to group by complex expressions or grouping sets | groupByExpr |
You want to avoid importing many functions from pyspark.sql.functions | Both |
You are translating SQL logic to PySpark programmatically | Both |
β¨ Tips
selectExpr()
is great for dynamic SQL-style transformation, e.g., in metadata-driven pipelines.groupByExpr()
is one of the few ways to use SQL grouping sets without usingspark.sql()
.
π§ͺ Mini Demo
df.selectExpr(
"dept",
"amount",
"amount * 1.05 as taxed_amount",
"CASE WHEN amount > 1000 THEN 'BIG' ELSE 'SMALL' END as size_flag"
).groupByExpr("dept", "size_flag").agg(F.sum("amount"))