Here’s a realistic Spark UI profiling walkthrough for your workloadβbased on your architecture and job profile (heavy pivot, groupBy, window, joins, rollups per customer across months):
π Spark UI Profiling Walkthrough (Real World)
We’ll walk through what you’ll likely see in Spark UI and what to focus on for tuning.
π‘ STAGE OVERVIEW (DAG Tab)
Stage | Operation | Description | Shuffle? | # Tasks | Notes |
---|---|---|---|---|---|
1 | Load & Filter | Read parquet for 12 months, filter necessary columns | β | ~600 | Fast stage (~20 sec), parallel read, pushdown filters, column pruning enabled |
2 | Pivot on month | .groupBy('cust_id').pivot('month') | β | ~2000 | Wide stage: expensive shuffle, high CPU, causes memory pressure |
3 | Rolling Aggregates (6M, 12M, Qtr) | .withColumn using Window() over date ranges | β | ~1800 | Caching recommended before this stage. One of the heaviest stages. |
4 | MDAB Summaries per Day β Month | .groupBy('cust_id', 'month') | β | ~1500 | Medium-heavy. Needs partitioning on cust_id , otherwise causes skew. |
5 | Join with dim tables + fact base | .join(...) to enrich final fact | β | ~1200 | Tune with broadcast for smaller dims; avoid multiple shuffles |
6 | Final Write | .repartition(100).write.mode("overwrite").parquet(...) | β | 100 | Avoid large #files. Control with coalesce/repartition |
π EVENT TIMELINE (Per Stage in Spark UI)
- Executor CPU Time Spike in pivot and window stages
- GC time spikes due to memory pressure when pivoting
- Spill to Disk visible in large shuffles from
.groupBy
or.join
stages - Skewed tasks: some tasks taking 5xβ10x more time in join stage due to key skew
π SQL / EXPLAIN INSIGHTS
final_df.explain(True)
Youβll see:
- WholeStageCodegen is enabled (good for performance)
- ShuffleExchange for
groupBy
,pivot
,window
- If stage graph is deep, prefer caching intermediate results
- Large logical plan (~30+ operators) may indicate over-complex job β consider breaking into pipelines
π§ SPARK UI METRICS TO WATCH
Metric | Why it Matters | Thresholds |
---|---|---|
Task Deserialization | High? Serialization overhead | Should be < 5% of total time |
Executor GC Time | High? Memory pressure | < 10% ideal, > 30% indicates pressure |
Input/Shuffle Read | Shuffle-intensive stages | >10GB per task indicates partitioning issues |
Skewed Tasks | Some tasks take 5x longer | Indicates skew β Use salting / bucketing |
Failed/Retried Tasks | Retry = bad shuffle/memory config | Use speculative execution |
Spill (Memory/Disk) | Shuffle too large for memory | Try increasing shuffle partitions |
π Tips for Live Profiling
What to Check | How | Expected Signal |
---|---|---|
Shuffle Size per Stage | Executors tab > Shuffle Read | >5GB per task = split partitions |
Skewed Partition Detection | Stage tab > Timeline | Long tails / one task way slower |
Join Strategy | SQL Tab > Physical Plan | Prefer broadcast or shuffled hash join |
Cache Utilization | Storage Tab | Look for actual in-memory usage |
π‘ Suggestions Based on Spark UI Insights
- Partitioning strategy: Repartition before join-heavy or pivot-heavy stages (use
cust_id
) - Broadcast small dim tables explicitly
- Cache intermediate customer-level pivot tables before heavy window
- Avoid nested complex UDFs before shuffle
- Enable:
spark.conf.set("spark.sql.adaptive.enabled", "true") spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true") spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
Great β let’s break this into three concrete parts to profile a slow Spark stage with skew, using both Spark UI and EXPLAIN PLAN. This will help you identify bottlenecks from the timeline, task distribution, and query plan analysis.
π§ 1. Timeline of a Skewed Stage: What Youβd See in Spark UI
When there’s data skew, some tasks get much more data than others in the same stage. Here’s how it typically looks in the Spark UI timeline (Stages tab):
Time (sec) | Task # | Task Duration (sec) | Shuffle Read | Notes |
---|---|---|---|---|
0β2 | 0β80 | 2 | 100MB | Most tasks finish quickly |
3β10 | 81 | 7 | 1.5GB | Skewed partition; long-running |
10β25 | 82 | 15 | 2.2GB | Heavily skewed; very slow |
26β60+ | 83 | 35+ | 3.1GB | One task drags the stage |
β Symptoms in Spark UI β
- In Stage Details > Summary Metrics, one task will have high
Shuffle Read
and longTask Duration
. - Executors tab will show under-utilization as many executors finish early and sit idle.
π 2. Spark SQL EXPLAIN Plan Output
Letβs say you have a heavy .groupBy("cust_id").agg(...)
followed by .pivot()
.
Hereβs a sample df.explain(True)
output for your transformation-heavy pipeline:
== Physical Plan ==
*(5) Project [...]
+- *(5) BroadcastHashJoin [cust_id], [cust_id], Inner, BuildRight
:- *(2) HashAggregate(keys=[cust_id#23], functions=[sum(...)])
: +- *(2) Exchange hashpartitioning(cust_id#23, 200)
: +- *(1) Filter [...]
: +- *(1) FileScan parquet [fields]
+- BroadcastExchange HashedRelationBroadcastMode, [cust_id]
π§ Key things to watch:
Section | Meaning |
---|---|
BroadcastExchange | Broadcasting lookup/dim table (good for <10MB data) |
Exchange hashpartitioning | Shuffle triggered β expensive step |
HashAggregate | GroupBy logic |
FileScan parquet | Source read (watch if many small files) |
*(1) , *(2) | Parallelism hints (pipelined stages) |
If skewed, Spark will have:
- One
Exchange
that reads GBs more data than others - No
SortMergeJoin
optimization if broadcast fails Scan
step shows imbalance if files not bucketed or partitioned properly
π 3. How Tasks Progress Over Time (Visual Representation)
Hereβs a typical task timeline sketch of a skewed job:
Executor Timeline (Stage X):
| E1 |βββββββββββββββ|
| E2 |βββββββββββββββ|
| E3 |βββββββββββββββ|
| E4 |βββββ|
| E5 |β|
| E6 |βββββββββββββββββββββββββββββββββββββββββββββββββ| β SKewed Task
Time β β β
- 5 tasks finish in 10β20 seconds
- 1 skewed task runs for 5 minutes
- This causes the entire stage to delay, even if other tasks are idle
Youβll find this under:
π Spark UI β Stages β Click on Stage ID β Tasks Timeline
π§° π§ Optional Fixes You Can Use
Issue | Suggested Fix |
---|---|
Skew on cust_id | Use salting before groupBy or join |
Shuffle too large | Increase spark.sql.shuffle.partitions , or repartition with .repartition() |
Long executor GC time | Monitor GC stats and memory pressure |
Slow .pivot() | Break into multiple .filter().groupBy() passes |
Final write slowness | Use .coalesce() to reduce small files |
β Summary: What to Check in Spark UI
Spark UI Tab | What to Check |
---|---|
Stages | Skewed partitions, task durations |
Executors | GC time, memory usage |
SQL Tab | Query Plan, duration per operator |
Jobs | Overall pipeline timing |
Storage | If youβre caching intermediate data |