Let’s deep-dive into PySpark architecture β both theory + code-level behavior, and explain Driver Node vs Worker Node, Executor, and how PySpark behaves during spark-submit
or .py
script execution.
π§ PySpark Architecture β Overview
When you run a PySpark application (spark-submit
or script), the system is composed of the following major components:
πΉ Driver Program
- This is your main
.py
file β the entry point of the application. - It runs on the Driver Node.
- It contains your Spark code:
SparkSession
,DataFrame
logic, actions. - Responsible for:
- Creating the SparkContext
- Generating DAG (logical + physical plan)
- Requesting resources from Cluster Manager
- Collecting job results or status
πΉ Driver Node vs Driver Program
Term | Meaning |
---|---|
Driver Program | Your Python .py script |
Driver Node | The machine/server where the driver program runs |
π§ Only 1 driver per Spark application
πΉ Worker Nodes
- These are the nodes that run the actual code (tasks)
- Each worker node can host one or more Executors
- The driver sends tasks to worker nodes via the Cluster Manager
πΉ Executors (JVM Processes)
- Spark spins up JVM processes called executors on the worker nodes
- Each executor:
- Executes tasks on data partitions
- Holds cached data
- Returns results to driver
- Is bound to a single application (no reuse)
β
Example: Running a PySpark Script via spark-submit
Letβs say you run:
spark-submit \
--master yarn \
--deploy-mode cluster \
--num-executors 4 \
--executor-cores 2 \
--executor-memory 4G \
app.py
Here’s What Happens:
β Step-by-Step Architecture Flow:
- Your
.py
script (app.py
) is the driver program - The driver program is sent to a Driver Node in the cluster
- The driver requests 4 executors from YARN
- Executors are launched on 4 different worker nodes
- Driver:
- Divides logic into jobs β stages β tasks
- Sends tasks to executors
- Executors read data, perform transformations, and return results
- Driver collects results or writes to disk
π§ Does Driver Node Act as Worker Node?
It depends on the deploy mode:
Deploy Mode | Where Driver Runs | Does It Execute Tasks? |
---|---|---|
client mode | Your local machine (e.g., your laptop) | No |
cluster mode | On a node in the cluster | β Yes (can also run tasks) |
In cluster mode, the Driver Node can also run tasks like a Worker β it’s often treated as one.
π§ͺ Example 2: Running in PySpark Shell (pyspark
CLI)
pyspark --master local[4]
local[4]
means: 4 threads on your local machine- Driver and executors all run on same node
- Good for testing, learning, and dev β not production
β Summary Table
Component | Runs On | Role |
---|---|---|
Driver Program | Python script | Defines logic |
Driver Node | Local or cluster node | Hosts driver |
Worker Node | Cluster nodes | Run executors |
Executor | On worker (as JVM) | Run tasks, hold memory |
Cluster Manager | YARN, Kubernetes, Standalone | Allocates executors |
π Visual Diagram
+------------------+
| Driver Program |
| (app.py / shell)|
+--------+---------+
|
v
+------------------------------+
| Driver Node |
| - Builds DAG |
| - Sends tasks to workers |
+------------------------------+
|
+---------------------+----------------------+
| |
v v
+------------+ +------------+
| Worker Node| | Worker Node|
| Executor | | Executor |
| JVM Task | | JVM Task |
+------------+ +------------+
π Interview Insight
βIn PySpark, the driver program is a Python process, but the actual code runs inside JVM-based executors. The Driver Node can also act as a worker if in cluster mode. Tasks are executed on the worker nodes by executors, and Py4J is used for PythonβJVM interaction.β
Excellent β now youβre thinking like a systems-level PySpark engineer. Let’s now consolidate everything weβve discussed β from your previous questions β into one deep, full-scope PySpark architecture walkthrough that includes:
- PySpark execution lifecycle
- Py4J interaction between Python and JVM
- Cluster manager, driver, executor roles
- Behavior in
spark-submit
, shell, deploy modes - Code-to-cluster flow with clarity
π§ Full PySpark Architecture with Py4J & Execution Flow
What Is PySpark?
PySpark is the Python API for Apache Spark. It lets you write Python code to interact with Spark’s JVM-based distributed engine.
PySpark applications are not executed in native Python β instead, they control Spark via a bridge (Py4J) and run transformations on JVM executors across a cluster.
π¦ Full Architecture Overview
+--------------------------+
| Your PySpark Program | <- Python (.py or pyspark shell)
| (Driver Program) |
+------------+-------------+
|
| Py4J bridge (socket communication)
v
+--------------------------+
| JVM Driver Process | <- SparkContext, Catalyst Planner
| (on Driver Node) |
+------------+-------------+
|
| Requests resources
v
+----------------------+
| Cluster Manager | <- YARN, Kubernetes, Mesos, Standalone
+----------------------+
|
| Launches Executors on Worker Nodes
v
+-------------------+ +-------------------+
| Executor JVM 1 | | Executor JVM 2 | <- Tasks, Cached Data
| (on Worker Node) | | (on Worker Node) |
+-------------------+ +-------------------+
π§ Key Components and Responsibilities
βΆοΈ Driver Program (Python Layer)
- Your
.py
script or notebook - Creates the SparkSession & SparkContext
- Triggers jobs via actions like
.show()
,.write()
,.collect()
βοΈ Driver Node (JVM Layer)
- Converts logical plan to physical plan using Catalyst
- Talks to executors via the Cluster Manager
- Uses Py4J to communicate with Python
π Py4J: How Python Talks to Spark JVM
When you call something like:
df = spark.read.parquet("file.parquet")
This happens under the hood:
spark.read
is a proxy object in Python- Py4J sends the method call (
read.parquet
) as a command string over a socket to the JVM - JVM executes the call (e.g. Parquet reader) and returns a reference ID
- Python stores this reference as a proxy object
- Further transformations (like
df.filter(...)
) are just deferred commands - Actual execution happens only on
.show()
,.count()
, etc.
β
No real data moves to Python unless you .collect()
or use UDFs
π₯οΈ Cluster Components
Component | Role |
---|---|
Driver Node | Runs the SparkContext (JVM) + Python script (via Py4J) |
Cluster Manager | Allocates executors (YARN, Kubernetes, etc.) |
Worker Node | Hosts one or more Executors (JVM) |
Executor | JVM process: runs tasks, caches data, writes output |
π Execution Flow in Practice (spark-submit or script)
Let’s say you run:
spark-submit \
--master yarn \
--deploy-mode cluster \
--num-executors 3 \
app.py
Here’s what happens:
- Your Python script (
app.py
) is sent to the Driver Node - Spark starts the JVM driver and Python driver process
- Spark uses Cluster Manager (YARN) to allocate 3 Executors
- Each Executor is a JVM process on a Worker Node
- SparkContext divides the job β stages β tasks
- Tasks are sent to Executors, results streamed back to Driver
π‘ In cluster mode, the driver is in the cluster
π‘ In client mode, the driver is on your laptop/machine
π‘ Real Example with .py
Script
# app.py
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ExampleApp").getOrCreate()
df = spark.read.json("people.json")
df = df.filter("age > 25")
df.show()
When you run this with spark-submit
, the Python layer controls:
- Creating the logical plan (
read β filter β show
) - Sending it via Py4J β JVM
- JVM compiles it to physical plan
- Tasks are distributed to executors
df.show()
triggers actual execution
π When Does the Driver Node Act Like a Worker?
Mode | Driver Location | Executes Tasks? |
---|---|---|
client | Your local machine | β No (just driver logic) |
cluster | Inside cluster | β Yes, may act as worker |
π‘ In cluster
mode, driver node can execute tasks unless explicitly excluded
βοΈ UDF & Serialization in Architecture
- Python UDFs force row-by-row data movement from JVM to Python
- Adds serialization cost (JVM β Python)
- Use Spark SQL functions or pandas UDFs to avoid this
π§ Interview-Worthy Summary
PySpark architecture is built on top of Sparkβs JVM core. Your Python code acts as a driver via Py4J, sending commands to Spark’s JVM engine. The JVM driver creates the execution plan and communicates with the Cluster Manager to launch executors on worker nodes. These executors run the actual tasks in parallel. When using
spark-submit
, the Driver Program may live locally or on a cluster node, depending on deploy mode. UDFs introduce JVMβPython communication, which is handled through serialization and Py4J.
Our Posts on Pyspark Architecture:-
Outside Links to Refer:-
For Spark Architecture
https://0x0fff.com/spark-architecture
https://0x0fff.com/spark-architecture-shuffle
saleforce Posthttps://engineering.salesforce.com/how-to-optimize-your-apache-spark-application-with-partitions-257f2c1bb414/