PySpark Coding Practice Questions

0.✅ Example DataFrame Schema

Assume a DataFrame df like:

deptmonthamount
SalesJan1000
SalesFeb1200
HRJan800
HRFeb900

To calculate both the total sum and the monthly sum per department in the same PySpark DataFrame query using groupBy with grouping sets, or by unioning two DataFrames (one for total, one for per-month-per-dept).



You can calculate both the total sum and the monthly sum per department in the same PySpark DataFrame query using groupBy with grouping sets, or by unioning two DataFrames (one for total, one for per-month-per-dept). Here’s how to do both:


✅ Example DataFrame Schema

Assume a DataFrame df like:

deptmonthamount
SalesJan1000
SalesFeb1200
HRJan800
HRFeb900

✅ Option 1: Use grouping sets (single query)

from pyspark.sql import functions as F

result = (
    df.groupBy(F.grouping_id(), "dept", "month")
      .agg(F.sum("amount").alias("total_amount"))
      .groupBy("dept", "month")
      .agg(F.first("total_amount").alias("total_amount"))
)

But that’s verbose. A better and cleaner approach is using SQL-like cube or grouping sets via PySpark:

from pyspark.sql import functions as F

result = (
    df.groupByExpr("GROUPING SETS ((dept, month), ())")
      .agg(F.sum("amount").alias("total_amount"))
)

This gives:

deptmonthtotal_amount
SalesJan1000
SalesFeb1200
HRJan800
HRFeb900
nullnull3900

You can filter or label the row where dept and month are null to show it’s the grand total.


✅ Option 2: Union two queries (flexible)

from pyspark.sql import functions as F

grouped = df.groupBy("dept", "month").agg(F.sum("amount").alias("total_amount"))
total = df.agg(F.sum("amount").alias("total_amount")).withColumn("dept", F.lit(None)).withColumn("month", F.lit(None))

result = grouped.unionByName(total)

This also gives both:

  • Aggregated by dept and month
  • One row with overall total (null for dept/month)

✅ Option 3: Show totals with window

If you want both total and per-group values side by side, use window functions:

from pyspark.sql.window import Window
from pyspark.sql import functions as F

w_all = Window.partitionBy()
w_dept_month = Window.partitionBy("dept", "month")

result = df.withColumn("sum_over_all", F.sum("amount").over(w_all)) \
           .withColumn("sum_by_dept_month", F.sum("amount").over(w_dept_month)) \
           .select("dept", "month", "amount", "sum_by_dept_month", "sum_over_all")

This doesn’t reduce rows but adds columns for the required sums.


🧠 Choose Based On:

GoalMethod
Reduce to summary rowsgrouping sets, cube, or union
Add extra sum columns to original rowsWindow functions
Avoid nulls, get custom labelsunion with lit()


🔥 Advanced / Tricky GroupBy & Union Questions in PySpark


1. GroupBy with Multiple Aggregations and Totals at Different Levels

Q: How do you get sum by dept, sum by month, and the overall total, all in one DataFrame?

✅ Solution (with grouping sets):

df.groupByExpr("GROUPING SETS ((dept), (month), ())") \
  .agg(F.sum("amount").alias("total_amount"))

Output:

deptmonthtotal_amount
HRnull1700
Salesnull2200
nullJan1800
nullFeb2100
nullnull3900

2. Add Group Total as a Column for Each Row in the Group (Magic Total Column)

Q: For each row, add a column showing the total amount for the entire group (like dept or month).

✅ Solution:

window_spec = Window.partitionBy("dept")
df.withColumn("group_total", F.sum("amount").over(window_spec))

3. Add a Row for Each Group with “Total” Label

Q: How do you add a “Total” row per department?

✅ Solution:

grouped = df.groupBy("dept").agg(F.sum("amount").alias("amount")).withColumn("month", F.lit("TOTAL"))
df_totals = df.select("dept", "month", "amount").unionByName(grouped.select("dept", "month", "amount"))

This adds a month="TOTAL" row for each dept.


4. GroupBy with Conditional Aggregation (IF-THEN Style)

Q: How to group by department and calculate:

  • total amount
  • total only for months = ‘Jan’

✅ Solution:

df.groupBy("dept").agg(
    F.sum("amount").alias("total_amount"),
    F.sum(F.when(F.col("month") == "Jan", F.col("amount")).otherwise(0)).alias("jan_total")
)

5. Rolling Sum or Cumulative Aggregation per Group

Q: Calculate a running total per department ordered by month.

✅ Solution:

w = Window.partitionBy("dept").orderBy("month")
df.withColumn("running_total", F.sum("amount").over(w))

You may need to convert month to numeric if not already ordered properly.


6. Union Two Tables with Different Schema (Safe Union)

Q: You have two DataFrames with different columns. How to union them safely, filling missing columns with nulls?

✅ Solution:

from pyspark.sql.functions import lit

def safe_union(df1, df2):
    cols1 = set(df1.columns)
    cols2 = set(df2.columns)
    for col in cols1 - cols2:
        df2 = df2.withColumn(col, lit(None))
    for col in cols2 - cols1:
        df1 = df1.withColumn(col, lit(None))
    return df1.select(sorted(df1.columns)).unionByName(df2.select(sorted(df2.columns)))

safe_union(df1, df2)

7. GroupBy and Keep All Columns (non-agg columns)

Q: How do you perform a groupBy and still keep all non-aggregated columns (like SQL’s ANY_VALUE())?

✅ Solution:

df.groupBy("dept").agg(
    F.sum("amount").alias("total"),
    F.first("month").alias("sample_month")
)

8. Compare Monthly Amount with Dept Average (Flag High or Low)

Q: For each (dept, month), flag if amount is above dept average.

✅ Solution:

w = Window.partitionBy("dept")
df.withColumn("dept_avg", F.avg("amount").over(w)) \
  .withColumn("is_above_avg", F.col("amount") > F.col("dept_avg"))

9. Explode List Column, Then GroupBy

Q: You have a column of type Array[String] (e.g., multiple tags). How do you get total amount per tag?

✅ Solution:

df = df.withColumn("tag", F.explode("tags"))
df.groupBy("tag").agg(F.sum("amount").alias("total_amount"))

10. Pivot with Aggregation (Convert Rows to Columns)

Q: How to pivot months as columns for each dept?

✅ Solution:

df.groupBy("dept").pivot("month").agg(F.sum("amount"))

Output: One row per dept, months as columns.


✅ Want Even Trickier?

Here are more you can try:

QuestionTrick Used
Show only those departments whose monthly total ever exceeded a thresholdgroupBy + having
Compute average of top 3 months per deptWindow.rank() + filter()
Concatenate all months per dept into single stringcollect_list() + concat_ws()
Add “subtotal” rows for dept but keep existing rowsunion + groupBy + labeling

Pages: 1 2 3