A job execution in Pyspark .. step by step explanation of How and what happens
Understanding how a PySpark job is executed involves grasping the internal mechanisms of Spark’s architecture and its execution flow. Here’s a step-by-step explanation of what happens when you run a PySpark job:
Step 1: Initialization
- Creating SparkSession:
- The entry point for a Spark application is creating a
SparkSession
. This session connects to a Spark cluster and sets up the necessary configurations.
from pyspark.sql import SparkSession spark = SparkSession.builder .appName("Example App") .master("local[*]") .getOrCreate()
- The entry point for a Spark application is creating a
- Initializing SparkContext:
- Internally,
SparkSession
creates aSparkContext
which represents the connection to the cluster and initializes the Spark execution environment.
- Internally,
Step 2: Data Loading
- Reading Data:
- Data is loaded into DataFrames or RDDs. Spark abstracts the data source and provides a unified interface for various formats (CSV, JSON, Parquet, etc.).
df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)
Step 3: Transformations
- Defining Transformations:
- Transformations are operations on DataFrames/RDDs that define a new dataset from an existing one. These operations are lazy, meaning they build up a lineage of transformations but are not executed immediately.
filtered_df = df.filter(df["age"] > 30) selected_df = filtered_df.select("name", "age")
Step 4: Actions
- Triggering Actions:
- Actions are operations that trigger the execution of the transformations and return results to the driver program or write data to an external storage system. Examples include
show()
,collect()
,count()
,write
, etc.
selected_df.show()
- Actions are operations that trigger the execution of the transformations and return results to the driver program or write data to an external storage system. Examples include
Step 5: Job Execution
- Logical Plan:
- When an action is called, Spark builds a logical plan. This plan describes the series of transformations that need to be applied to the data.
- Optimization:
- Spark’s Catalyst optimizer takes the logical plan and optimizes it. This involves applying various optimization rules, such as predicate pushdown, constant folding, and projection pruning.
- Physical Plan:
- The optimized logical plan is converted into a physical plan, which describes the actual execution steps. This plan includes details on how the data will be read, processed, and written.
- DAG Creation:
- Spark constructs a Directed Acyclic Graph (DAG) of stages and tasks based on the physical plan. Each stage consists of tasks that can be executed in parallel.
Step 6: Task Scheduling
- Task Scheduling:
- The DAG scheduler divides the job into stages based on data shuffling. Each stage is a set of tasks that can be executed concurrently. Tasks are the smallest unit of work in Spark, and each task is assigned to a partition of the data.
- Task Execution:
- The task scheduler sends tasks to the executor nodes. Executors are worker processes responsible for executing the tasks and returning the results to the driver.
Step 7: Execution on Executors
- Task Execution:
- Each executor runs the tasks on its partitions of the data. If a task involves reading data, the executor reads the data from the storage system (e.g., HDFS, S3, local file system).
- Shuffle Operations:
- For operations requiring data to be redistributed across nodes (e.g., join, groupBy), Spark performs a shuffle. This involves sending data over the network from one executor to another.
- In-Memory Computation:
- Executors perform computations in memory whenever possible to speed up processing. Intermediate results may be written to disk if they do not fit in memory.
Step 8: Collecting Results
- Result Collection:
- After all tasks in a stage complete, the results are sent back to the driver program. For actions like
collect()
, the results are collected in the driver’s memory. For actions likewrite
, the results are saved to the specified storage system.
result = selected_df.collect() # Collects the results to the driver
- After all tasks in a stage complete, the results are sent back to the driver program. For actions like
Step 9: Post-Execution
- Resource Cleanup:
- After the job completes, the resources allocated for the job (executors, memory, etc.) are released. This is automatically managed by Spark.
- Stopping SparkSession:
- It is good practice to stop the
SparkSession
to release the resources explicitly.
spark.stop()
- It is good practice to stop the
Summary of Execution Flow
- Initialization: Create a
SparkSession
andSparkContext
. - Data Loading: Load data into DataFrames/RDDs.
- Transformations: Define transformations (lazy operations).
- Actions: Trigger actions to start execution.
- Job Execution: Spark constructs logical plans, optimizes them, creates physical plans, and builds a DAG.
- Task Scheduling: Tasks are scheduled and sent to executors.
- Execution on Executors: Executors run tasks, perform computations, and handle shuffling.
- Collecting Results: Results are collected and returned to the driver or written to storage.
- Post-Execution: Clean up resources and stop
SparkSession
.
This flow highlights the distributed nature of Spark and the lazy evaluation mechanism that optimizes and efficiently executes big data processing tasks.