Pyspark Wholesome Tutorial- Links to refer, PDfs

Let’s deep-dive into PySpark architecture β€” both theory + code-level behavior, and explain Driver Node vs Worker Node, Executor, and how PySpark behaves during spark-submit or .py script execution.


πŸ”§ PySpark Architecture – Overview

When you run a PySpark application (spark-submit or script), the system is composed of the following major components:


πŸ”Ή Driver Program

  • This is your main .py file β€” the entry point of the application.
  • It runs on the Driver Node.
  • It contains your Spark code: SparkSession, DataFrame logic, actions.
  • Responsible for:
    • Creating the SparkContext
    • Generating DAG (logical + physical plan)
    • Requesting resources from Cluster Manager
    • Collecting job results or status

πŸ”Ή Driver Node vs Driver Program

TermMeaning
Driver ProgramYour Python .py script
Driver NodeThe machine/server where the driver program runs

🧠 Only 1 driver per Spark application


πŸ”Ή Worker Nodes

  • These are the nodes that run the actual code (tasks)
  • Each worker node can host one or more Executors
  • The driver sends tasks to worker nodes via the Cluster Manager

πŸ”Ή Executors (JVM Processes)

  • Spark spins up JVM processes called executors on the worker nodes
  • Each executor:
    • Executes tasks on data partitions
    • Holds cached data
    • Returns results to driver
    • Is bound to a single application (no reuse)

βœ… Example: Running a PySpark Script via spark-submit

Let’s say you run:

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --num-executors 4 \
  --executor-cores 2 \
  --executor-memory 4G \
  app.py

Here’s What Happens:

βœ… Step-by-Step Architecture Flow:

  1. Your .py script (app.py) is the driver program
  2. The driver program is sent to a Driver Node in the cluster
  3. The driver requests 4 executors from YARN
  4. Executors are launched on 4 different worker nodes
  5. Driver:
    • Divides logic into jobs β†’ stages β†’ tasks
    • Sends tasks to executors
  6. Executors read data, perform transformations, and return results
  7. Driver collects results or writes to disk

🧠 Does Driver Node Act as Worker Node?

It depends on the deploy mode:

Deploy ModeWhere Driver RunsDoes It Execute Tasks?
client modeYour local machine (e.g., your laptop)No
cluster modeOn a node in the clusterβœ… Yes (can also run tasks)

In cluster mode, the Driver Node can also run tasks like a Worker β€” it’s often treated as one.


πŸ§ͺ Example 2: Running in PySpark Shell (pyspark CLI)

pyspark --master local[4]
  • local[4] means: 4 threads on your local machine
  • Driver and executors all run on same node
  • Good for testing, learning, and dev β€” not production

βœ… Summary Table

ComponentRuns OnRole
Driver ProgramPython scriptDefines logic
Driver NodeLocal or cluster nodeHosts driver
Worker NodeCluster nodesRun executors
ExecutorOn worker (as JVM)Run tasks, hold memory
Cluster ManagerYARN, Kubernetes, StandaloneAllocates executors

πŸš€ Visual Diagram

                +------------------+
                |  Driver Program  |
                |  (app.py / shell)| 
                +--------+---------+
                         |
                         v
         +------------------------------+
         |        Driver Node           |
         |  - Builds DAG                |
         |  - Sends tasks to workers    |
         +------------------------------+
                         |
    +---------------------+----------------------+
    |                                            |
    v                                            v
+------------+                              +------------+
| Worker Node|                              | Worker Node|
|  Executor  |                              |  Executor  |
|  JVM Task  |                              |  JVM Task  |
+------------+                              +------------+

πŸ” Interview Insight

β€œIn PySpark, the driver program is a Python process, but the actual code runs inside JVM-based executors. The Driver Node can also act as a worker if in cluster mode. Tasks are executed on the worker nodes by executors, and Py4J is used for Python↔JVM interaction.”

Excellent β€” now you’re thinking like a systems-level PySpark engineer. Let’s now consolidate everything we’ve discussed β€” from your previous questions β€” into one deep, full-scope PySpark architecture walkthrough that includes:

  • PySpark execution lifecycle
  • Py4J interaction between Python and JVM
  • Cluster manager, driver, executor roles
  • Behavior in spark-submit, shell, deploy modes
  • Code-to-cluster flow with clarity

🧠 Full PySpark Architecture with Py4J & Execution Flow


What Is PySpark?

PySpark is the Python API for Apache Spark. It lets you write Python code to interact with Spark’s JVM-based distributed engine.

PySpark applications are not executed in native Python β€” instead, they control Spark via a bridge (Py4J) and run transformations on JVM executors across a cluster.


πŸ“¦ Full Architecture Overview

+--------------------------+
|  Your PySpark Program    |   <- Python (.py or pyspark shell)
|  (Driver Program)        |
+------------+-------------+
             |
             |  Py4J bridge (socket communication)
             v
+--------------------------+
|   JVM Driver Process     |   <- SparkContext, Catalyst Planner
|   (on Driver Node)       |
+------------+-------------+
             |
             |  Requests resources
             v
     +----------------------+
     |  Cluster Manager     |   <- YARN, Kubernetes, Mesos, Standalone
     +----------------------+
             |
             |  Launches Executors on Worker Nodes
             v
+-------------------+    +-------------------+
|  Executor JVM 1   |    |  Executor JVM 2   |  <- Tasks, Cached Data
| (on Worker Node)  |    | (on Worker Node)  |
+-------------------+    +-------------------+

πŸ”§ Key Components and Responsibilities

▢️ Driver Program (Python Layer)

  • Your .py script or notebook
  • Creates the SparkSession & SparkContext
  • Triggers jobs via actions like .show(), .write(), .collect()

βš™οΈ Driver Node (JVM Layer)

  • Converts logical plan to physical plan using Catalyst
  • Talks to executors via the Cluster Manager
  • Uses Py4J to communicate with Python

πŸ” Py4J: How Python Talks to Spark JVM

When you call something like:

df = spark.read.parquet("file.parquet")

This happens under the hood:

  1. spark.read is a proxy object in Python
  2. Py4J sends the method call (read.parquet) as a command string over a socket to the JVM
  3. JVM executes the call (e.g. Parquet reader) and returns a reference ID
  4. Python stores this reference as a proxy object
  5. Further transformations (like df.filter(...)) are just deferred commands
  6. Actual execution happens only on .show(), .count(), etc.

βœ… No real data moves to Python unless you .collect() or use UDFs


πŸ–₯️ Cluster Components

ComponentRole
Driver NodeRuns the SparkContext (JVM) + Python script (via Py4J)
Cluster ManagerAllocates executors (YARN, Kubernetes, etc.)
Worker NodeHosts one or more Executors (JVM)
ExecutorJVM process: runs tasks, caches data, writes output

πŸš€ Execution Flow in Practice (spark-submit or script)

Let’s say you run:

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --num-executors 3 \
  app.py

Here’s what happens:

  1. Your Python script (app.py) is sent to the Driver Node
  2. Spark starts the JVM driver and Python driver process
  3. Spark uses Cluster Manager (YARN) to allocate 3 Executors
  4. Each Executor is a JVM process on a Worker Node
  5. SparkContext divides the job β†’ stages β†’ tasks
  6. Tasks are sent to Executors, results streamed back to Driver

πŸ’‘ In cluster mode, the driver is in the cluster
πŸ’‘ In client mode, the driver is on your laptop/machine


πŸ’‘ Real Example with .py Script

# app.py
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ExampleApp").getOrCreate()
df = spark.read.json("people.json")
df = df.filter("age > 25")
df.show()

When you run this with spark-submit, the Python layer controls:

  • Creating the logical plan (read β†’ filter β†’ show)
  • Sending it via Py4J β†’ JVM
  • JVM compiles it to physical plan
  • Tasks are distributed to executors
  • df.show() triggers actual execution

πŸ“Š When Does the Driver Node Act Like a Worker?

ModeDriver LocationExecutes Tasks?
clientYour local machine❌ No (just driver logic)
clusterInside clusterβœ… Yes, may act as worker

πŸ’‘ In cluster mode, driver node can execute tasks unless explicitly excluded


βš™οΈ UDF & Serialization in Architecture

  • Python UDFs force row-by-row data movement from JVM to Python
  • Adds serialization cost (JVM ↔ Python)
  • Use Spark SQL functions or pandas UDFs to avoid this

🧠 Interview-Worthy Summary

PySpark architecture is built on top of Spark’s JVM core. Your Python code acts as a driver via Py4J, sending commands to Spark’s JVM engine. The JVM driver creates the execution plan and communicates with the Cluster Manager to launch executors on worker nodes. These executors run the actual tasks in parallel. When using spark-submit, the Driver Program may live locally or on a cluster node, depending on deploy mode. UDFs introduce JVM↔Python communication, which is handled through serialization and Py4J.


Our Posts on Pyspark Architecture:-

Outside Links to Refer:-

For Spark Architecture

https://0x0fff.com/spark-architecture

https://0x0fff.com/spark-architecture-shuffle

https://www.linkedin.com/pulse/deep-dive-spark-internals-architecture-jayvardhan-reddy-vanchireddy?trk=public_profile_article_view

saleforce Posthttps://engineering.salesforce.com/how-to-optimize-your-apache-spark-application-with-partitions-257f2c1bb414/

Pages: 1 2 3 4 5 6 7 8