0.✅ Example DataFrame Schema
Assume a DataFrame df like:
| dept | month | amount |
|---|---|---|
| Sales | Jan | 1000 |
| Sales | Feb | 1200 |
| HR | Jan | 800 |
| HR | Feb | 900 |
To calculate both the total sum and the monthly sum per department in the same PySpark DataFrame query using groupBy with grouping sets, or by unioning two DataFrames (one for total, one for per-month-per-dept).
You can calculate both the total sum and the monthly sum per department in the same PySpark DataFrame query using groupBy with grouping sets, or by unioning two DataFrames (one for total, one for per-month-per-dept). Here’s how to do both:
✅ Example DataFrame Schema
Assume a DataFrame df like:
| dept | month | amount |
|---|---|---|
| Sales | Jan | 1000 |
| Sales | Feb | 1200 |
| HR | Jan | 800 |
| HR | Feb | 900 |
✅ Option 1: Use grouping sets (single query)
from pyspark.sql import functions as F
result = (
df.groupBy(F.grouping_id(), "dept", "month")
.agg(F.sum("amount").alias("total_amount"))
.groupBy("dept", "month")
.agg(F.first("total_amount").alias("total_amount"))
)
But that’s verbose. A better and cleaner approach is using SQL-like cube or grouping sets via PySpark:
from pyspark.sql import functions as F
result = (
df.groupByExpr("GROUPING SETS ((dept, month), ())")
.agg(F.sum("amount").alias("total_amount"))
)
This gives:
| dept | month | total_amount |
|---|---|---|
| Sales | Jan | 1000 |
| Sales | Feb | 1200 |
| HR | Jan | 800 |
| HR | Feb | 900 |
| null | null | 3900 |
| … | … | … |
You can filter or label the row where
deptandmonthare null to show it’s the grand total.
✅ Option 2: Union two queries (flexible)
from pyspark.sql import functions as F
grouped = df.groupBy("dept", "month").agg(F.sum("amount").alias("total_amount"))
total = df.agg(F.sum("amount").alias("total_amount")).withColumn("dept", F.lit(None)).withColumn("month", F.lit(None))
result = grouped.unionByName(total)
This also gives both:
- Aggregated by dept and month
- One row with overall total (null for dept/month)
✅ Option 3: Show totals with window
If you want both total and per-group values side by side, use window functions:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
w_all = Window.partitionBy()
w_dept_month = Window.partitionBy("dept", "month")
result = df.withColumn("sum_over_all", F.sum("amount").over(w_all)) \
.withColumn("sum_by_dept_month", F.sum("amount").over(w_dept_month)) \
.select("dept", "month", "amount", "sum_by_dept_month", "sum_over_all")
This doesn’t reduce rows but adds columns for the required sums.
🧠 Choose Based On:
| Goal | Method |
|---|---|
| Reduce to summary rows | grouping sets, cube, or union |
| Add extra sum columns to original rows | Window functions |
| Avoid nulls, get custom labels | union with lit() |
🔥 Advanced / Tricky GroupBy & Union Questions in PySpark
1. GroupBy with Multiple Aggregations and Totals at Different Levels
Q: How do you get sum by dept, sum by month, and the overall total, all in one DataFrame?
✅ Solution (with grouping sets):
df.groupByExpr("GROUPING SETS ((dept), (month), ())") \
.agg(F.sum("amount").alias("total_amount"))
Output:
| dept | month | total_amount |
|---|---|---|
| HR | null | 1700 |
| Sales | null | 2200 |
| null | Jan | 1800 |
| null | Feb | 2100 |
| null | null | 3900 |
2. Add Group Total as a Column for Each Row in the Group (Magic Total Column)
Q: For each row, add a column showing the total amount for the entire group (like dept or month).
✅ Solution:
window_spec = Window.partitionBy("dept")
df.withColumn("group_total", F.sum("amount").over(window_spec))
3. Add a Row for Each Group with “Total” Label
Q: How do you add a “Total” row per department?
✅ Solution:
grouped = df.groupBy("dept").agg(F.sum("amount").alias("amount")).withColumn("month", F.lit("TOTAL"))
df_totals = df.select("dept", "month", "amount").unionByName(grouped.select("dept", "month", "amount"))
This adds a
month="TOTAL"row for each dept.
4. GroupBy with Conditional Aggregation (IF-THEN Style)
Q: How to group by department and calculate:
- total amount
- total only for months = ‘Jan’
✅ Solution:
df.groupBy("dept").agg(
F.sum("amount").alias("total_amount"),
F.sum(F.when(F.col("month") == "Jan", F.col("amount")).otherwise(0)).alias("jan_total")
)
5. Rolling Sum or Cumulative Aggregation per Group
Q: Calculate a running total per department ordered by month.
✅ Solution:
w = Window.partitionBy("dept").orderBy("month")
df.withColumn("running_total", F.sum("amount").over(w))
You may need to convert
monthto numeric if not already ordered properly.
6. Union Two Tables with Different Schema (Safe Union)
Q: You have two DataFrames with different columns. How to union them safely, filling missing columns with nulls?
✅ Solution:
from pyspark.sql.functions import lit
def safe_union(df1, df2):
cols1 = set(df1.columns)
cols2 = set(df2.columns)
for col in cols1 - cols2:
df2 = df2.withColumn(col, lit(None))
for col in cols2 - cols1:
df1 = df1.withColumn(col, lit(None))
return df1.select(sorted(df1.columns)).unionByName(df2.select(sorted(df2.columns)))
safe_union(df1, df2)
7. GroupBy and Keep All Columns (non-agg columns)
Q: How do you perform a groupBy and still keep all non-aggregated columns (like SQL’s ANY_VALUE())?
✅ Solution:
df.groupBy("dept").agg(
F.sum("amount").alias("total"),
F.first("month").alias("sample_month")
)
8. Compare Monthly Amount with Dept Average (Flag High or Low)
Q: For each (dept, month), flag if amount is above dept average.
✅ Solution:
w = Window.partitionBy("dept")
df.withColumn("dept_avg", F.avg("amount").over(w)) \
.withColumn("is_above_avg", F.col("amount") > F.col("dept_avg"))
9. Explode List Column, Then GroupBy
Q: You have a column of type Array[String] (e.g., multiple tags). How do you get total amount per tag?
✅ Solution:
df = df.withColumn("tag", F.explode("tags"))
df.groupBy("tag").agg(F.sum("amount").alias("total_amount"))
10. Pivot with Aggregation (Convert Rows to Columns)
Q: How to pivot months as columns for each dept?
✅ Solution:
df.groupBy("dept").pivot("month").agg(F.sum("amount"))
Output: One row per dept, months as columns.
✅ Want Even Trickier?
Here are more you can try:
| Question | Trick Used |
|---|---|
| Show only those departments whose monthly total ever exceeded a threshold | groupBy + having |
| Compute average of top 3 months per dept | Window.rank() + filter() |
| Concatenate all months per dept into single string | collect_list() + concat_ws() |
| Add “subtotal” rows for dept but keep existing rows | union + groupBy + labeling |