Letβs visualize how Spark schedules tasks when reading files (like CSV, Parquet, or from Hive), based on:
- File size
- File count
- Block size
- Partitions
- Cluster resources
βοΈ Step-by-Step: How Spark Schedules Tasks from Files
πΉ Step 1: Spark reads file metadata
When you call:
df = spark.read.csv("/data/large_file.csv")
- Spark queries the filesystem (HDFS/S3/DBFS) to get:
- File sizes
- Number of files
- Block sizes (128MB in HDFS by default)
- Spark splits large files into logical input splits.
πΉ Step 2: Input Splits β Tasks
File Size | Block Size | Input Splits | Resulting Tasks |
---|---|---|---|
1 file, 1 GB | 128 MB | 8 | 8 tasks (Stage 0) |
10 files, 100 MB each | 128 MB | 10 | 10 tasks |
1 file, 2 GB | 256 MB | 8 | 8 tasks |
π‘ Each input split = 1 Spark task in the first stage.
πΉ Step 3: Spark Schedules Tasks on Executors
- Spark sends 1 task per partition to an available executor core.
- If you have:
- 4 executors Γ 4 cores each = 16 parallel tasks
- But 32 splits β runs in 2 waves
π§ Example: 8 Input Splits on 2 Executors (4 cores each)
Stage 0:
[Task 0] β Executor 1 Core 1
[Task 1] β Executor 1 Core 2
[Task 2] β Executor 1 Core 3
[Task 3] β Executor 1 Core 4
[Task 4] β Executor 2 Core 1
[Task 5] β Executor 2 Core 2
[Task 6] β Executor 2 Core 3
[Task 7] β Executor 2 Core 4
Once all tasks finish, Stage 0 completes.
π Step 4: Shuffle Creates More Tasks
If a transformation like groupBy()
or join()
is done:
df.groupBy("col").count()
- Spark creates a shuffle stage
- Partitions into
spark.sql.shuffle.partitions
(default = 200) - Each shuffle partition β 1 task in next stage
π Key Concepts
Concept | What It Does |
---|---|
Input Splits | Determines task count on file read |
Partitions | Determines parallelism for processing |
Tasks | Atomic units of work (scheduled on cores) |
Executors/Cores | Where tasks are run |
Shuffle | Rearranging data by key (e.g., joins) |
π Visual Diagram
CSV File (1 GB)
βββββββββββββββββββββ
| Split 0: 0β128MB | β Task 0 β Executor 1 Core 1
| Split 1: 128β256MB | β Task 1 β Executor 2 Core 2
| Split 2: 256β384MB | β Task 2 β Executor 1 Core 2
...
| Split 7: 896β1024MB| β Task 7 β Executor 2 Core 4
β Real Spark UI Checkpoints
In the Spark UI β Stages Tab:
- See number of tasks per stage
- Hover over each task to view input split and data read
π Summary Table
What Affects Tasks from File | Affects Task Count? |
---|---|
File size | β Yes |
File count | β Yes |
File format (CSV, Parquet) | β Yes |
HDFS/S3 block size | β Yes |
repartition(n) | β Yes (after read) |
coalesce(n) | β Yes (after read) |
defaultParallelism | β No (not directly) |
Leave a Reply