How PySpark processes your pipeline - data size, partition count, shuffle impact, memory usage, time estimates, and executor configuration - Page 3 of 3

Here’s a realistic Spark UI profiling walkthrough for your workload—based on your architecture and job profile (heavy pivot, groupBy, window, joins, rollups per customer across months):

🔍 Spark UI Profiling Walkthrough (Real World)

We’ll walk through what you’ll likely see in Spark UI and what to focus on for tuning.

💡 STAGE OVERVIEW (DAG Tab)

Stage	Operation	Description	Shuffle?	# Tasks	Notes
1	Load & Filter	Read parquet for 12 months, filter necessary columns	❌	~600	Fast stage (~20 sec), parallel read, pushdown filters, column pruning enabled
2	Pivot on month	`.groupBy('cust_id').pivot('month')`	✅	~2000	Wide stage: expensive shuffle, high CPU, causes memory pressure
3	Rolling Aggregates (6M, 12M, Qtr)	`.withColumn` using `Window()` over date ranges	✅	~1800	Caching recommended before this stage. One of the heaviest stages.
4	MDAB Summaries per Day → Month	`.groupBy('cust_id', 'month')`	✅	~1500	Medium-heavy. Needs partitioning on `cust_id`, otherwise causes skew.
5	Join with dim tables + fact base	`.join(...)` to enrich final fact	✅	~1200	Tune with `broadcast` for smaller dims; avoid multiple shuffles
6	Final Write	`.repartition(100).write.mode("overwrite").parquet(...)`	❌	100	Avoid large #files. Control with coalesce/repartition

🔍 EVENT TIMELINE (Per Stage in Spark UI)

Executor CPU Time Spike in pivot and window stages
GC time spikes due to memory pressure when pivoting
Spill to Disk visible in large shuffles from .groupBy or .join stages
Skewed tasks: some tasks taking 5x–10x more time in join stage due to key skew

📊 SQL / EXPLAIN INSIGHTS

final_df.explain(True)

You’ll see:

WholeStageCodegen is enabled (good for performance)
ShuffleExchange for groupBy, pivot, window
If stage graph is deep, prefer caching intermediate results
Large logical plan (~30+ operators) may indicate over-complex job — consider breaking into pipelines

🧠 SPARK UI METRICS TO WATCH

Metric	Why it Matters	Thresholds
Task Deserialization	High? Serialization overhead	Should be < 5% of total time
Executor GC Time	High? Memory pressure	< 10% ideal, > 30% indicates pressure
Input/Shuffle Read	Shuffle-intensive stages	>10GB per task indicates partitioning issues
Skewed Tasks	Some tasks take 5x longer	Indicates skew → Use salting / bucketing
Failed/Retried Tasks	Retry = bad shuffle/memory config	Use speculative execution
Spill (Memory/Disk)	Shuffle too large for memory	Try increasing shuffle partitions

🔍 Tips for Live Profiling

What to Check	How	Expected Signal
Shuffle Size per Stage	Executors tab > Shuffle Read	>5GB per task = split partitions
Skewed Partition Detection	Stage tab > Timeline	Long tails / one task way slower
Join Strategy	SQL Tab > Physical Plan	Prefer broadcast or shuffled hash join
Cache Utilization	Storage Tab	Look for actual in-memory usage

💡 Suggestions Based on Spark UI Insights

Partitioning strategy: Repartition before join-heavy or pivot-heavy stages (use cust_id)
Broadcast small dim tables explicitly
Cache intermediate customer-level pivot tables before heavy window
Avoid nested complex UDFs before shuffle
Enable: spark.conf.set("spark.sql.adaptive.enabled", "true") spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true") spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

Great — let’s break this into three concrete parts to profile a slow Spark stage with skew, using both Spark UI and EXPLAIN PLAN. This will help you identify bottlenecks from the timeline, task distribution, and query plan analysis.

🧭 1. Timeline of a Skewed Stage: What You’d See in Spark UI

When there’s data skew, some tasks get much more data than others in the same stage. Here’s how it typically looks in the Spark UI timeline (Stages tab):

Time (sec)	Task #	Task Duration (sec)	Shuffle Read	Notes
0–2	0–80	2	100MB	Most tasks finish quickly
3–10	81	7	1.5GB	Skewed partition; long-running
10–25	82	15	2.2GB	Heavily skewed; very slow
26–60+	83	35+	3.1GB	One task drags the stage

✅ Symptoms in Spark UI →

In Stage Details > Summary Metrics, one task will have high Shuffle Read and long Task Duration.
Executors tab will show under-utilization as many executors finish early and sit idle.

📊 2. Spark SQL EXPLAIN Plan Output

Let’s say you have a heavy .groupBy("cust_id").agg(...) followed by .pivot().

Here’s a sample df.explain(True) output for your transformation-heavy pipeline:

== Physical Plan ==
*(5) Project [...]
+- *(5) BroadcastHashJoin [cust_id], [cust_id], Inner, BuildRight
   :- *(2) HashAggregate(keys=[cust_id#23], functions=[sum(...)])
   :  +- *(2) Exchange hashpartitioning(cust_id#23, 200)
   :     +- *(1) Filter [...]
   :        +- *(1) FileScan parquet [fields] 
   +- BroadcastExchange HashedRelationBroadcastMode, [cust_id]

🧠 Key things to watch:

Section	Meaning
`BroadcastExchange`	Broadcasting lookup/dim table (good for <10MB data)
`Exchange hashpartitioning`	Shuffle triggered — expensive step
`HashAggregate`	GroupBy logic
`FileScan parquet`	Source read (watch if many small files)
`(1)`, `(2)`	Parallelism hints (pipelined stages)

If skewed, Spark will have:

One Exchange that reads GBs more data than others
No SortMergeJoin optimization if broadcast fails
Scan step shows imbalance if files not bucketed or partitioned properly

🔍 3. How Tasks Progress Over Time (Visual Representation)

Here’s a typical task timeline sketch of a skewed job:

Executor Timeline (Stage X):

| E1 |███████████████|
| E2 |███████████████|
| E3 |███████████████|
| E4 |█████|
| E5 |█|
| E6 |█████████████████████████████████████████████████| ← SKewed Task

Time → → →

5 tasks finish in 10–20 seconds
1 skewed task runs for 5 minutes
This causes the entire stage to delay, even if other tasks are idle

You’ll find this under:
📌 Spark UI → Stages → Click on Stage ID → Tasks Timeline

🧰 🔧 Optional Fixes You Can Use

Issue	Suggested Fix
Skew on `cust_id`	Use `salting` before groupBy or join
Shuffle too large	Increase `spark.sql.shuffle.partitions`, or repartition with `.repartition()`
Long executor GC time	Monitor GC stats and memory pressure
Slow `.pivot()`	Break into multiple `.filter().groupBy()` passes
Final write slowness	Use `.coalesce()` to reduce small files

✅ Summary: What to Check in Spark UI

Spark UI Tab	What to Check
Stages	Skewed partitions, task durations
Executors	GC time, memory usage
SQL Tab	Query Plan, duration per operator
Jobs	Overall pipeline timing
Storage	If you’re caching intermediate data

HintsToday

recent posts

about