Let’s visualize how Spark schedules tasks when reading files (like CSV, Parquet, or from Hive), based on:

  • File size
  • File count
  • Block size
  • Partitions
  • Cluster resources

βš™οΈ Step-by-Step: How Spark Schedules Tasks from Files


πŸ”Ή Step 1: Spark reads file metadata

When you call:

df = spark.read.csv("/data/large_file.csv")
  • Spark queries the filesystem (HDFS/S3/DBFS) to get:
    • File sizes
    • Number of files
    • Block sizes (128MB in HDFS by default)
  • Spark splits large files into logical input splits.

πŸ”Ή Step 2: Input Splits β†’ Tasks

File SizeBlock SizeInput SplitsResulting Tasks
1 file, 1 GB128 MB88 tasks (Stage 0)
10 files, 100 MB each128 MB1010 tasks
1 file, 2 GB256 MB88 tasks

πŸ’‘ Each input split = 1 Spark task in the first stage.


πŸ”Ή Step 3: Spark Schedules Tasks on Executors

  • Spark sends 1 task per partition to an available executor core.
  • If you have:
    • 4 executors Γ— 4 cores each = 16 parallel tasks
    • But 32 splits β†’ runs in 2 waves

🧠 Example: 8 Input Splits on 2 Executors (4 cores each)

Stage 0:
[Task 0] β†’ Executor 1 Core 1
[Task 1] β†’ Executor 1 Core 2
[Task 2] β†’ Executor 1 Core 3
[Task 3] β†’ Executor 1 Core 4
[Task 4] β†’ Executor 2 Core 1
[Task 5] β†’ Executor 2 Core 2
[Task 6] β†’ Executor 2 Core 3
[Task 7] β†’ Executor 2 Core 4

Once all tasks finish, Stage 0 completes.


πŸ” Step 4: Shuffle Creates More Tasks

If a transformation like groupBy() or join() is done:

df.groupBy("col").count()
  • Spark creates a shuffle stage
  • Partitions into spark.sql.shuffle.partitions (default = 200)
  • Each shuffle partition β†’ 1 task in next stage

πŸ” Key Concepts

ConceptWhat It Does
Input SplitsDetermines task count on file read
PartitionsDetermines parallelism for processing
TasksAtomic units of work (scheduled on cores)
Executors/CoresWhere tasks are run
ShuffleRearranging data by key (e.g., joins)

πŸ“Š Visual Diagram

          CSV File (1 GB)
        ─────────────────────
        | Split 0: 0–128MB   | β†’ Task 0 β†’ Executor 1 Core 1
        | Split 1: 128–256MB | β†’ Task 1 β†’ Executor 2 Core 2
        | Split 2: 256–384MB | β†’ Task 2 β†’ Executor 1 Core 2
        ...
        | Split 7: 896–1024MB| β†’ Task 7 β†’ Executor 2 Core 4

βœ… Real Spark UI Checkpoints

In the Spark UI β†’ Stages Tab:

  • See number of tasks per stage
  • Hover over each task to view input split and data read

πŸ”š Summary Table

What Affects Tasks from FileAffects Task Count?
File sizeβœ… Yes
File countβœ… Yes
File format (CSV, Parquet)βœ… Yes
HDFS/S3 block sizeβœ… Yes
repartition(n)βœ… Yes (after read)
coalesce(n)βœ… Yes (after read)
defaultParallelism❌ No (not directly)

Pages: 1 2


Discover more from HintsToday

Subscribe to get the latest posts sent to your email.

Posted in

Leave a Reply

Discover more from HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading