PySpark Coding Practice Questions

Let’s explore selectExpr() and groupByExpr() β€” two powerful, flexible PySpark functions that allow you to use SQL expressions directly inside PySpark methods.


🧠 What are selectExpr() and groupByExpr()?

Both are expression-based alternatives to select() and groupBy() that accept SQL-style strings instead of column objects.


βœ… selectExpr(*exprs: str)

πŸ”Ή Definition:

DataFrame.selectExpr(*exprs: str)
  • Allows you to write SQL-style expressions directly.
  • Supports aliases (AS), functions, arithmetic, case statements, etc.

πŸ”Έ Use Cases:

Use CaseExample
Aliasing columns"amount as total_amount"
Math expressions"amount * 1.1 as amount_with_tax"
Casting"CAST(salary AS INT) as salary_int"
Conditions"CASE WHEN amount > 1000 THEN 'High' ELSE 'Low' END AS amount_type"

βœ… Example:

df.selectExpr(
    "dept",
    "amount * 0.9 as discounted_amount",
    "CASE WHEN amount > 1000 THEN 'HIGH' ELSE 'LOW' END as category"
)

Equivalent to writing:

from pyspark.sql.functions import expr
df.select(
    F.col("dept"),
    (F.col("amount") * 0.9).alias("discounted_amount"),
    F.when(F.col("amount") > 1000, "HIGH").otherwise("LOW").alias("category")
)

βœ… groupByExpr(*exprs: str)

πŸ”Ή Definition:

DataFrame.groupByExpr(*exprs: str)
  • Lets you use complex SQL expressions as groupBy keys.
  • Useful for:
    • Casting
    • Extracting parts of strings/dates
    • Bucketing/grouping logic

πŸ”Έ Use Cases:

Use CaseExample
Group by year of date"year(order_date)"
Group by rounded values"floor(salary / 1000) * 1000 as salary_band"
Group by substring"substring(region, 1, 3) as region_prefix"
Use grouping sets/cube/rollup"GROUPING SETS ((dept), (month), ())"

βœ… Example 1: Group by transformed date

df.groupByExpr("year(order_date) as order_year").agg(F.sum("amount").alias("total"))

βœ… Example 2: Grouping Sets

df.groupByExpr("GROUPING SETS ((dept), (month), ())") \
  .agg(F.sum("amount").alias("total_amount"))

Can’t do that with .groupBy() directly β€” only groupByExpr() allows this SQL syntax.


πŸ” selectExpr() + groupByExpr() = SQL Without SQL

You can mimic SQL queries without writing full SQL:

βœ… Example: Equivalent of

SELECT dept, month, SUM(amount)
FROM df
GROUP BY GROUPING SETS ((dept), (month), ())

In PySpark:

df.groupByExpr("GROUPING SETS ((dept), (month), ())") \
  .agg(F.sum("amount").alias("total_amount"))

πŸ“Œ When to Use
SituationUse
You want to write expressions quickly, like SQLselectExpr
You need to group by complex expressions or grouping setsgroupByExpr
You want to avoid importing many functions from pyspark.sql.functionsBoth
You are translating SQL logic to PySpark programmaticallyBoth

✨ Tips

  • selectExpr() is great for dynamic SQL-style transformation, e.g., in metadata-driven pipelines.
  • groupByExpr() is one of the few ways to use SQL grouping sets without using spark.sql().

πŸ§ͺ Mini Demo

df.selectExpr(
    "dept",
    "amount",
    "amount * 1.05 as taxed_amount",
    "CASE WHEN amount > 1000 THEN 'BIG' ELSE 'SMALL' END as size_flag"
).groupByExpr("dept", "size_flag").agg(F.sum("amount"))

Pages: 1 2 3 4 5 6 7