How PySpark processes your pipeline – data size, partition count, shuffle impact, memory usage, time estimates, and executor configuration

Here’s a realistic Spark UI profiling walkthrough for your workloadβ€”based on your architecture and job profile (heavy pivot, groupBy, window, joins, rollups per customer across months):


πŸ” Spark UI Profiling Walkthrough (Real World)

We’ll walk through what you’ll likely see in Spark UI and what to focus on for tuning.


πŸ’‘ STAGE OVERVIEW (DAG Tab)

StageOperationDescriptionShuffle?# TasksNotes
1Load & FilterRead parquet for 12 months, filter necessary columns❌~600Fast stage (~20 sec), parallel read, pushdown filters, column pruning enabled
2Pivot on month.groupBy('cust_id').pivot('month')βœ…~2000Wide stage: expensive shuffle, high CPU, causes memory pressure
3Rolling Aggregates (6M, 12M, Qtr).withColumn using Window() over date rangesβœ…~1800Caching recommended before this stage. One of the heaviest stages.
4MDAB Summaries per Day β†’ Month.groupBy('cust_id', 'month')βœ…~1500Medium-heavy. Needs partitioning on cust_id, otherwise causes skew.
5Join with dim tables + fact base.join(...) to enrich final factβœ…~1200Tune with broadcast for smaller dims; avoid multiple shuffles
6Final Write.repartition(100).write.mode("overwrite").parquet(...)❌100Avoid large #files. Control with coalesce/repartition

πŸ” EVENT TIMELINE (Per Stage in Spark UI)

  • Executor CPU Time Spike in pivot and window stages
  • GC time spikes due to memory pressure when pivoting
  • Spill to Disk visible in large shuffles from .groupBy or .join stages
  • Skewed tasks: some tasks taking 5x–10x more time in join stage due to key skew

πŸ“Š SQL / EXPLAIN INSIGHTS

final_df.explain(True)

You’ll see:

  • WholeStageCodegen is enabled (good for performance)
  • ShuffleExchange for groupBy, pivot, window
  • If stage graph is deep, prefer caching intermediate results
  • Large logical plan (~30+ operators) may indicate over-complex job β€” consider breaking into pipelines

🧠 SPARK UI METRICS TO WATCH

MetricWhy it MattersThresholds
Task DeserializationHigh? Serialization overheadShould be < 5% of total time
Executor GC TimeHigh? Memory pressure< 10% ideal, > 30% indicates pressure
Input/Shuffle ReadShuffle-intensive stages>10GB per task indicates partitioning issues
Skewed TasksSome tasks take 5x longerIndicates skew β†’ Use salting / bucketing
Failed/Retried TasksRetry = bad shuffle/memory configUse speculative execution
Spill (Memory/Disk)Shuffle too large for memoryTry increasing shuffle partitions

πŸ” Tips for Live Profiling

What to CheckHowExpected Signal
Shuffle Size per StageExecutors tab > Shuffle Read>5GB per task = split partitions
Skewed Partition DetectionStage tab > TimelineLong tails / one task way slower
Join StrategySQL Tab > Physical PlanPrefer broadcast or shuffled hash join
Cache UtilizationStorage TabLook for actual in-memory usage

πŸ’‘ Suggestions Based on Spark UI Insights

  • Partitioning strategy: Repartition before join-heavy or pivot-heavy stages (use cust_id)
  • Broadcast small dim tables explicitly
  • Cache intermediate customer-level pivot tables before heavy window
  • Avoid nested complex UDFs before shuffle
  • Enable: spark.conf.set("spark.sql.adaptive.enabled", "true") spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true") spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

Great β€” let’s break this into three concrete parts to profile a slow Spark stage with skew, using both Spark UI and EXPLAIN PLAN. This will help you identify bottlenecks from the timeline, task distribution, and query plan analysis.


🧭 1. Timeline of a Skewed Stage: What You’d See in Spark UI

When there’s data skew, some tasks get much more data than others in the same stage. Here’s how it typically looks in the Spark UI timeline (Stages tab):

Time (sec)Task #Task Duration (sec)Shuffle ReadNotes
0–20–802100MBMost tasks finish quickly
3–108171.5GBSkewed partition; long-running
10–2582152.2GBHeavily skewed; very slow
26–60+8335+3.1GBOne task drags the stage

βœ… Symptoms in Spark UI β†’

  • In Stage Details > Summary Metrics, one task will have high Shuffle Read and long Task Duration.
  • Executors tab will show under-utilization as many executors finish early and sit idle.

πŸ“Š 2. Spark SQL EXPLAIN Plan Output

Let’s say you have a heavy .groupBy("cust_id").agg(...) followed by .pivot().

Here’s a sample df.explain(True) output for your transformation-heavy pipeline:

== Physical Plan ==
*(5) Project [...]
+- *(5) BroadcastHashJoin [cust_id], [cust_id], Inner, BuildRight
   :- *(2) HashAggregate(keys=[cust_id#23], functions=[sum(...)])
   :  +- *(2) Exchange hashpartitioning(cust_id#23, 200)
   :     +- *(1) Filter [...]
   :        +- *(1) FileScan parquet [fields] 
   +- BroadcastExchange HashedRelationBroadcastMode, [cust_id]

🧠 Key things to watch:

SectionMeaning
BroadcastExchangeBroadcasting lookup/dim table (good for <10MB data)
Exchange hashpartitioningShuffle triggered β€” expensive step
HashAggregateGroupBy logic
FileScan parquetSource read (watch if many small files)
*(1), *(2)Parallelism hints (pipelined stages)

If skewed, Spark will have:

  • One Exchange that reads GBs more data than others
  • No SortMergeJoin optimization if broadcast fails
  • Scan step shows imbalance if files not bucketed or partitioned properly

πŸ” 3. How Tasks Progress Over Time (Visual Representation)

Here’s a typical task timeline sketch of a skewed job:

Executor Timeline (Stage X):

| E1 |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ|
| E2 |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ|
| E3 |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ|
| E4 |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ|
| E5 |β–ˆ|
| E6 |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| ← SKewed Task

Time β†’ β†’ β†’
  • 5 tasks finish in 10–20 seconds
  • 1 skewed task runs for 5 minutes
  • This causes the entire stage to delay, even if other tasks are idle

You’ll find this under:
πŸ“Œ Spark UI β†’ Stages β†’ Click on Stage ID β†’ Tasks Timeline


🧰 πŸ”§ Optional Fixes You Can Use

IssueSuggested Fix
Skew on cust_idUse salting before groupBy or join
Shuffle too largeIncrease spark.sql.shuffle.partitions, or repartition with .repartition()
Long executor GC timeMonitor GC stats and memory pressure
Slow .pivot()Break into multiple .filter().groupBy() passes
Final write slownessUse .coalesce() to reduce small files

βœ… Summary: What to Check in Spark UI

Spark UI TabWhat to Check
StagesSkewed partitions, task durations
ExecutorsGC time, memory usage
SQL TabQuery Plan, duration per operator
JobsOverall pipeline timing
StorageIf you’re caching intermediate data

Pages: 1 2 3

Subscribe