0.✅ Example DataFrame Schema
Assume a DataFrame df
like:
dept | month | amount |
---|---|---|
Sales | Jan | 1000 |
Sales | Feb | 1200 |
HR | Jan | 800 |
HR | Feb | 900 |
To calculate both the total sum and the monthly sum per department in the same PySpark DataFrame query using groupBy with grouping sets
, or by unioning two DataFrames (one for total, one for per-month-per-dept).
You can calculate both the total sum and the monthly sum per department in the same PySpark DataFrame query using groupBy with grouping sets
, or by unioning two DataFrames (one for total, one for per-month-per-dept). Here’s how to do both:
✅ Example DataFrame Schema
Assume a DataFrame df
like:
dept | month | amount |
---|---|---|
Sales | Jan | 1000 |
Sales | Feb | 1200 |
HR | Jan | 800 |
HR | Feb | 900 |
✅ Option 1: Use grouping sets
(single query)
from pyspark.sql import functions as F
result = (
df.groupBy(F.grouping_id(), "dept", "month")
.agg(F.sum("amount").alias("total_amount"))
.groupBy("dept", "month")
.agg(F.first("total_amount").alias("total_amount"))
)
But that’s verbose. A better and cleaner approach is using SQL-like cube
or grouping sets
via PySpark:
from pyspark.sql import functions as F
result = (
df.groupByExpr("GROUPING SETS ((dept, month), ())")
.agg(F.sum("amount").alias("total_amount"))
)
This gives:
dept | month | total_amount |
---|---|---|
Sales | Jan | 1000 |
Sales | Feb | 1200 |
HR | Jan | 800 |
HR | Feb | 900 |
null | null | 3900 |
… | … | … |
You can filter or label the row where
dept
andmonth
are null to show it’s the grand total.
✅ Option 2: Union two queries (flexible)
from pyspark.sql import functions as F
grouped = df.groupBy("dept", "month").agg(F.sum("amount").alias("total_amount"))
total = df.agg(F.sum("amount").alias("total_amount")).withColumn("dept", F.lit(None)).withColumn("month", F.lit(None))
result = grouped.unionByName(total)
This also gives both:
- Aggregated by dept and month
- One row with overall total (null for dept/month)
✅ Option 3: Show totals with window
If you want both total and per-group values side by side, use window functions:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
w_all = Window.partitionBy()
w_dept_month = Window.partitionBy("dept", "month")
result = df.withColumn("sum_over_all", F.sum("amount").over(w_all)) \
.withColumn("sum_by_dept_month", F.sum("amount").over(w_dept_month)) \
.select("dept", "month", "amount", "sum_by_dept_month", "sum_over_all")
This doesn’t reduce rows but adds columns for the required sums.
🧠 Choose Based On:
Goal | Method |
---|---|
Reduce to summary rows | grouping sets , cube , or union |
Add extra sum columns to original rows | Window functions |
Avoid nulls, get custom labels | union with lit() |
🔥 Advanced / Tricky GroupBy & Union Questions in PySpark
1. GroupBy with Multiple Aggregations and Totals at Different Levels
Q: How do you get sum by dept
, sum by month
, and the overall total, all in one DataFrame?
✅ Solution (with grouping sets
):
df.groupByExpr("GROUPING SETS ((dept), (month), ())") \
.agg(F.sum("amount").alias("total_amount"))
Output:
dept | month | total_amount |
---|---|---|
HR | null | 1700 |
Sales | null | 2200 |
null | Jan | 1800 |
null | Feb | 2100 |
null | null | 3900 |
2. Add Group Total as a Column for Each Row in the Group (Magic Total Column)
Q: For each row, add a column showing the total amount for the entire group (like dept or month).
✅ Solution:
window_spec = Window.partitionBy("dept")
df.withColumn("group_total", F.sum("amount").over(window_spec))
3. Add a Row for Each Group with “Total” Label
Q: How do you add a “Total” row per department?
✅ Solution:
grouped = df.groupBy("dept").agg(F.sum("amount").alias("amount")).withColumn("month", F.lit("TOTAL"))
df_totals = df.select("dept", "month", "amount").unionByName(grouped.select("dept", "month", "amount"))
This adds a
month="TOTAL"
row for each dept.
4. GroupBy with Conditional Aggregation (IF-THEN Style)
Q: How to group by department and calculate:
- total amount
- total only for months = ‘Jan’
✅ Solution:
df.groupBy("dept").agg(
F.sum("amount").alias("total_amount"),
F.sum(F.when(F.col("month") == "Jan", F.col("amount")).otherwise(0)).alias("jan_total")
)
5. Rolling Sum or Cumulative Aggregation per Group
Q: Calculate a running total per department ordered by month.
✅ Solution:
w = Window.partitionBy("dept").orderBy("month")
df.withColumn("running_total", F.sum("amount").over(w))
You may need to convert
month
to numeric if not already ordered properly.
6. Union Two Tables with Different Schema (Safe Union)
Q: You have two DataFrames with different columns. How to union them safely, filling missing columns with nulls?
✅ Solution:
from pyspark.sql.functions import lit
def safe_union(df1, df2):
cols1 = set(df1.columns)
cols2 = set(df2.columns)
for col in cols1 - cols2:
df2 = df2.withColumn(col, lit(None))
for col in cols2 - cols1:
df1 = df1.withColumn(col, lit(None))
return df1.select(sorted(df1.columns)).unionByName(df2.select(sorted(df2.columns)))
safe_union(df1, df2)
7. GroupBy and Keep All Columns (non-agg columns)
Q: How do you perform a groupBy
and still keep all non-aggregated columns (like SQL’s ANY_VALUE()
)?
✅ Solution:
df.groupBy("dept").agg(
F.sum("amount").alias("total"),
F.first("month").alias("sample_month")
)
8. Compare Monthly Amount with Dept Average (Flag High or Low)
Q: For each (dept, month), flag if amount is above dept average.
✅ Solution:
w = Window.partitionBy("dept")
df.withColumn("dept_avg", F.avg("amount").over(w)) \
.withColumn("is_above_avg", F.col("amount") > F.col("dept_avg"))
9. Explode List Column, Then GroupBy
Q: You have a column of type Array[String]
(e.g., multiple tags). How do you get total amount per tag?
✅ Solution:
df = df.withColumn("tag", F.explode("tags"))
df.groupBy("tag").agg(F.sum("amount").alias("total_amount"))
10. Pivot with Aggregation (Convert Rows to Columns)
Q: How to pivot months as columns for each dept?
✅ Solution:
df.groupBy("dept").pivot("month").agg(F.sum("amount"))
Output: One row per dept, months as columns.
✅ Want Even Trickier?
Here are more you can try:
Question | Trick Used |
---|---|
Show only those departments whose monthly total ever exceeded a threshold | groupBy + having |
Compute average of top 3 months per dept | Window.rank() + filter() |
Concatenate all months per dept into single string | collect_list() + concat_ws() |
Add “subtotal” rows for dept but keep existing rows | union + groupBy + labeling |