PySpark Coding Practice Questions

0.✅ Example DataFrame Schema

Assume a DataFrame df like:

dept	month	amount
Sales	Jan	1000
Sales	Feb	1200
HR	Jan	800
HR	Feb	900

To calculate both the total sum and the monthly sum per department in the same PySpark DataFrame query using groupBy with grouping sets, or by unioning two DataFrames (one for total, one for per-month-per-dept).

You can calculate both the total sum and the monthly sum per department in the same PySpark DataFrame query using groupBy with grouping sets, or by unioning two DataFrames (one for total, one for per-month-per-dept). Here’s how to do both:

✅ Example DataFrame Schema

Assume a DataFrame df like:

dept	month	amount
Sales	Jan	1000
Sales	Feb	1200
HR	Jan	800
HR	Feb	900

✅ Option 1: Use `grouping sets` (single query)

from pyspark.sql import functions as F

result = (
    df.groupBy(F.grouping_id(), "dept", "month")
      .agg(F.sum("amount").alias("total_amount"))
      .groupBy("dept", "month")
      .agg(F.first("total_amount").alias("total_amount"))
)

But that’s verbose. A better and cleaner approach is using SQL-like cube or grouping sets via PySpark:

from pyspark.sql import functions as F

result = (
    df.groupByExpr("GROUPING SETS ((dept, month), ())")
      .agg(F.sum("amount").alias("total_amount"))
)

This gives:

dept	month	total_amount
Sales	Jan	1000
Sales	Feb	1200
HR	Jan	800
HR	Feb	900
null	null	3900
…	…	…

You can filter or label the row where dept and month are null to show it’s the grand total.

✅ Option 2: Union two queries (flexible)

from pyspark.sql import functions as F

grouped = df.groupBy("dept", "month").agg(F.sum("amount").alias("total_amount"))
total = df.agg(F.sum("amount").alias("total_amount")).withColumn("dept", F.lit(None)).withColumn("month", F.lit(None))

result = grouped.unionByName(total)

This also gives both:

Aggregated by dept and month
One row with overall total (null for dept/month)

✅ Option 3: Show totals with window

If you want both total and per-group values side by side, use window functions:

from pyspark.sql.window import Window
from pyspark.sql import functions as F

w_all = Window.partitionBy()
w_dept_month = Window.partitionBy("dept", "month")

result = df.withColumn("sum_over_all", F.sum("amount").over(w_all)) \
           .withColumn("sum_by_dept_month", F.sum("amount").over(w_dept_month)) \
           .select("dept", "month", "amount", "sum_by_dept_month", "sum_over_all")

This doesn’t reduce rows but adds columns for the required sums.

🧠 Choose Based On:

Goal	Method
Reduce to summary rows	`grouping sets`, `cube`, or `union`
Add extra sum columns to original rows	`Window functions`
Avoid nulls, get custom labels	`union` with lit()

🔥 Advanced / Tricky GroupBy & Union Questions in PySpark

1. GroupBy with Multiple Aggregations and Totals at Different Levels

Q: How do you get sum by dept, sum by month, and the overall total, all in one DataFrame?

✅ Solution (with `grouping sets`):

df.groupByExpr("GROUPING SETS ((dept), (month), ())") \
  .agg(F.sum("amount").alias("total_amount"))

Output:

dept	month	total_amount
HR	null	1700
Sales	null	2200
null	Jan	1800
null	Feb	2100
null	null	3900

2. Add Group Total as a Column for Each Row in the Group (Magic Total Column)

Q: For each row, add a column showing the total amount for the entire group (like dept or month).

✅ Solution:

window_spec = Window.partitionBy("dept")
df.withColumn("group_total", F.sum("amount").over(window_spec))

3. Add a Row for Each Group with “Total” Label

Q: How do you add a “Total” row per department?

✅ Solution:

grouped = df.groupBy("dept").agg(F.sum("amount").alias("amount")).withColumn("month", F.lit("TOTAL"))
df_totals = df.select("dept", "month", "amount").unionByName(grouped.select("dept", "month", "amount"))

This adds a month="TOTAL" row for each dept.

4. GroupBy with Conditional Aggregation (IF-THEN Style)

Q: How to group by department and calculate:

total amount
total only for months = ‘Jan’

✅ Solution:

df.groupBy("dept").agg(
    F.sum("amount").alias("total_amount"),
    F.sum(F.when(F.col("month") == "Jan", F.col("amount")).otherwise(0)).alias("jan_total")
)

5. Rolling Sum or Cumulative Aggregation per Group

Q: Calculate a running total per department ordered by month.

✅ Solution:

w = Window.partitionBy("dept").orderBy("month")
df.withColumn("running_total", F.sum("amount").over(w))

You may need to convert month to numeric if not already ordered properly.

6. Union Two Tables with Different Schema (Safe Union)

Q: You have two DataFrames with different columns. How to union them safely, filling missing columns with nulls?

✅ Solution:

from pyspark.sql.functions import lit

def safe_union(df1, df2):
    cols1 = set(df1.columns)
    cols2 = set(df2.columns)
    for col in cols1 - cols2:
        df2 = df2.withColumn(col, lit(None))
    for col in cols2 - cols1:
        df1 = df1.withColumn(col, lit(None))
    return df1.select(sorted(df1.columns)).unionByName(df2.select(sorted(df2.columns)))

safe_union(df1, df2)

7. GroupBy and Keep All Columns (non-agg columns)

Q: How do you perform a groupBy and still keep all non-aggregated columns (like SQL’s ANY_VALUE())?

✅ Solution:

df.groupBy("dept").agg(
    F.sum("amount").alias("total"),
    F.first("month").alias("sample_month")
)

8. Compare Monthly Amount with Dept Average (Flag High or Low)

Q: For each (dept, month), flag if amount is above dept average.

✅ Solution:

w = Window.partitionBy("dept")
df.withColumn("dept_avg", F.avg("amount").over(w)) \
  .withColumn("is_above_avg", F.col("amount") > F.col("dept_avg"))

9. Explode List Column, Then GroupBy

Q: You have a column of type Array[String] (e.g., multiple tags). How do you get total amount per tag?

✅ Solution:

df = df.withColumn("tag", F.explode("tags"))
df.groupBy("tag").agg(F.sum("amount").alias("total_amount"))

10. Pivot with Aggregation (Convert Rows to Columns)

Q: How to pivot months as columns for each dept?

✅ Solution:

df.groupBy("dept").pivot("month").agg(F.sum("amount"))

Output: One row per dept, months as columns.

✅ Want Even Trickier?

Here are more you can try:

Question	Trick Used
Show only those departments whose monthly total ever exceeded a threshold	`groupBy + having`
Compute average of top 3 months per dept	`Window.rank()` + `filter()`
Concatenate all months per dept into single string	`collect_list()` + `concat_ws()`
Add “subtotal” rows for dept but keep existing rows	`union + groupBy` + labeling

HintsToday

recent posts

about