PySpark Architecture (Driver- Executor) two ways to describe

by HintsToday Team | Jun 6, 2024 | Pyspark | 0 comments

PySpark, as part of the Apache Spark ecosystem, follows a master-slave architecture(Or Driver- Executro Architecture) and provides a structured approach to distributed data processing.

Here’s a breakdown of the PySpark architecture with diagrams to illustrate the key components and their interactions.

Contents

1 1. Overview of PySpark Architecture
2 2. Diagram of PySpark Architecture
3 3. Components Explained
4 4. Detailed Diagram with Data Flow
5 Detailed Component Descriptions
6 Data Flow
7 Share this:

1. Overview of PySpark Architecture

The architecture of PySpark involves the following main components:

Driver Program: The main control program that manages the entire Spark application. When the driver program executes, it calls the original program of the app and generates a Spark Context.
Cluster Manager: Manages the resources and schedules tasks on the cluster. Examples include YARN, Mesos, or Spark’s standalone cluster manager.
Workers: Execute tasks on the cluster. Each worker runs one or more executors.
Executors: Run the tasks assigned by the driver on the worker nodes and store the data partitions.

2. Diagram of PySpark Architecture

Here’s a visual representation of the PySpark architecture:

+-------------------------------------------+
|                 Driver                    |
|  +-------------------------------------+  |
|  |           SparkContext              |  |
|  |                                     |  |
|  |  +-------------------------------+  |  |
|  |  |       Cluster Manager         |  |  |
|  |  |                               |  |  |
|  |  |  +------------+  +----------+ |  |  |
|  |  |  |   Worker 1  |  | Worker 2 | |  |  |
|  |  |  | +----------+|  |+--------+| |  |  |
|  |  |  | | Executor ||  || Executor|| |  |  |
|  |  |  | |          ||  ||         || |  |  |
|  |  |  +------------+  +----------+ |  |  |
|  |  |                               |  |  |
|  |  +-------------------------------+  |  |
|  +-------------------------------------+  |
+-------------------------------------------+

3. Components Explained

Driver Program: The entry point for the Spark application. It contains the SparkContext, which is the main interface for interacting with Spark. Spark Context includes all the basic functions. You can assume Spark Context as a gateway to all Spark’s functionality. The driver is responsible for:
- Creating RDDs, DataFrames, Datasets.
- Defining transformations and actions.
- Managing the lifecycle of Spark applications.
Cluster Manager: Manages the cluster resources and schedules tasks. The SparkContext connects to the cluster manager to negotiate resources and submit tasks. The cluster manager works with the Spark Context and also manages the execution of various jobs inside the cluster. The cluster manager can be:
- Standalone: Spark’s built-in cluster manager.
- YARN: Hadoop’s resource manager.
- Mesos: A distributed systems kernel.
Workers: Nodes in the cluster that execute the tasks. Each worker node hosts one or more executors.
Executors: Run on worker nodes and are responsible for:
- Executing code assigned by the driver.
- Storing data for in-memory processing and disk storage.
- Reporting the status and results of computations back to the driver.

4. Detailed Diagram with Data Flow

Here’s a more detailed diagram showing the data flow and interaction between components:

+---------------------------+                +-----------------------+
|        Driver             |                |     Cluster Manager    |
|                           |                |                       |
|  +---------------------+  |                |  +------------------+ |
|  | SparkContext        |  |                |  | Resource Manager  | |
|  +---------+-----------+  |                |  +--------+---------+ |
|            |              |                |           |           |
|            v              |                |           v           |
|   +-------------------+   |                |  +------------------+ |
|   |     DAG Scheduler  |<-------------------->| Task Scheduler   | |
|   +---------+---------+   |                |  +--------+---------+ |
|             |              |                |           |           |
|             v              |                |           v           |
|  +----------+------------+ |                |  +------------------+ |
|  | Task Scheduler        |<-------------------->| Worker Manager   | |
|  +----------+------------+ |                |  +------------------+ |
|             |              |                |                       |
|             v              |                +-----------------------+
|  +----------+------------+
|  |     Executors         |
|  +-----------------------+
|                           |
+---------------------------+

                  |                  
                  v                  
            +----------------------------+
            | Worker Nodes               |
            |                            |
            |  +----------------------+  |
            |  |        Executor 1    |  |
            |  +----------------------+  |
            |  |        Executor 2    |  |
            |  +----------------------+  |
            |                            |
            +----------------------------+

Detailed Component Descriptions

Driver Program:
- SparkContext: Initializes Spark application, connects to cluster manager, and creates RDDs.
- DAG Scheduler: Translates logical plans into a physical execution plan, creating a Directed Acyclic Graph (DAG) of stages.
- Task Scheduler: Schedules tasks to run on executors, handles retries on failure.
Cluster Manager:
- Resource Manager: Manages cluster resources and assigns them to applications.
- Task Scheduler: Assigns tasks to executors based on available resources.
Worker Nodes:
- Executors: Run the tasks, store the intermediate results in memory or disk, and communicate results back to the driver.

Data Flow

Submit Application: The driver program is submitted to the cluster.
Initialize SparkContext: SparkContext connects to the cluster manager.
Resource Allocation: Cluster manager allocates resources to the application.
Task Scheduling: Driver schedules tasks through the DAG scheduler and task scheduler.
Execution: Executors on worker nodes execute the tasks.
Data Storage: Intermediate results are stored in memory or on disk.
Completion: Executors return the results to the driver, which processes and provides the final output.

This architecture allows PySpark to efficiently process large-scale data in a distributed environment, leveraging the power of parallel computation and fault tolerance.

In Different Way:-

Here’s a breakdown of PySpark architecture using diagrams:

1. High-Level Overview:

+--------------------+         +--------------------+         +---------------------+
|       Driver       |         | Cluster Manager     |         | Worker Nodes (N)    |
+--------------------+         +--------------------+         +---------------------+
     |                     |         | (YARN, Mesos,       |         | (Executor Processes) |
     |                     |         | Standalone)        |         |                     |
     | Submits application |         +--------------------+         |                     |
     | and coordinates    |                                 |                     |
     | tasks              |                                 |   Spark Tasks       |
+--------------------+         +--------------------+         +---------------------+
     | (SparkContext)   |         |                     |         | (on each Executor) |
     |                     |         |                     |         |                     |
     |-----------------|         |                     |         |-----------------|
     |  Libraries (SQL,  |         |                     |         |  Data Processing   |
     |  MLlib, Streaming) |         |                     |         |   (RDDs, DataFrames) |
     |-----------------|         |                     |         |-----------------|

Driver: The program running your PySpark application. It submits the application to the cluster manager, coordinates tasks, and interacts with Spark libraries.
Cluster Manager: Manages resources in the cluster, allocating resources (machines) to applications like PySpark. Examples include YARN (Hadoop), Mesos, or Spark’s standalone mode.
Worker Nodes: Machines in the cluster that run Spark applications. Each node has an Executor process that executes Spark tasks.

2. Data Processing Flow:

+--------------------+         +--------------------+         +---------------------+
|       Driver       |         | Cluster Manager     |         | Worker Nodes (N)    |
+--------------------+         +--------------------+         +---------------------+
     | Submits job     |         |                     |         | (Executor Processes) |
     | (transforms)     |         |                     |         |                     |
     |-----------------|         |                     |         |-----------------|
     |  SparkContext   |         |                     |         |  RDD Operations   |
     |-----------------|         |                     |         |   (map, filter, etc) |
     |  Transform Data  |         |                     |         |   (on each partition) |
     |  (RDDs)          |         |                     |         |-----------------|
     |-----------------|         |                     |         |  Shuffle & Aggregation |
     |   Shuffle Data   |         |                     |         |   (if needed)        |
     |   (if needed)     |         |                     |         |-----------------|
     |-----------------|         |                     |         |   Write Results    |
     |   Save Results   |         +--------------------+         |   (to storage)     |
     +--------------------+                                 +---------------------+

The driver submits a Spark job with transformations to be applied to the data.
SparkContext in the driver translates the job into tasks for each partition of the data.
Executor processes on worker nodes execute these tasks on their assigned data partitions.
Shuffling (data exchange) might occur between executors if operations require data from different partitions (e.g., joins).
Finally, the results are written to storage or used for further processing.

3. Spark Libraries:

+--------------------+
|       Driver       | (imports libraries)
+--------------------+
     |
     |-----------------|
     |  SparkContext   |
     |-----------------|
     |  Spark SQL      |
     |  (DataFrame/SQL) |
     |-----------------|
     |  MLlib          |
     |  (Machine Learning)|
     |-----------------|
     |  Spark Streaming |
     |  (Real-time)    |
     |-----------------|

PySpark provides various libraries accessible through the SparkContext:
- Spark SQL: Enables SQL-like operations on DataFrames and Datasets.
- MLlib: Offers machine learning algorithms and tools for building and deploying models.
- Spark Streaming: Allows processing of continuous data streams.

These diagrams provide a visual representation of PySpark’s architecture, highlighting the key components and data processing flow. As you delve deeper into PySpark, these visuals can serve as a foundation for understanding its functionalities.

← Difference between RDD and Dataframes in Pyspark Big Data and Big Data Lake - Explain in simple words →

Written By HintsToday Team

undefined

Adaptive Query Execution (AQE) in Apache Spark- Explain with example

Jul 16, 2024 | Pyspark

Adaptive Query Execution (AQE) in Apache Spark 3.0 is a powerful feature that brings more intelligent and dynamic optimizations to Spark SQL on runtime statistics. By adapting the execution plan at runtime based on actual data statistics, AQE can provide significant...

PySpark Project Alert:- Dynamic list of variables Creation for ETL Jobs

Jul 7, 2024 | Pyspark

Let us create One or Multiple dynamic lists of variables and save it in dictionary or Array or other datastructure for further repeating use in Pyspark projects specially for ETL jobs. Variable names are in form of dynamic names for example Month_202401 to...

In How many ways pyspark script can be executed? Detailed explanation

Jul 7, 2024 | Pyspark

PySpark scripts can be executed in various environments and through multiple methods, each with its own configurations and settings. Here’s a detailed overview of the different ways to execute PySpark scripts: 1. Using spark-submit Command The spark-submit command is...

Error handling, Debugging and custom Log table, status table generation in Pyspark

Jul 7, 2024 | Pyspark

Error handling, debugging, and generating custom log tables and status tables are crucial aspects of developing robust PySpark applications. Here’s how you can implement these features in PySpark: 1. Error Handling in PySpark PySpark provides mechanisms to handle...

Project Alert: Automation in Pyspark

Jul 7, 2024 | Pyspark

Here is a detailed approach for dividing a monthly PySpark script into multiple code steps. Each step will be saved in the code column of a control DataFrame and executed sequentially. The script will include error handling and pre-checks to ensure source tables are...

A DAG Stage in Pyspark is divided into tasks based on the partitions of the data. How these partitions are decided?

Jul 1, 2024 | Pyspark

We know a stage in Pyspark is divided into tasks based on the partitions of the data. But Big Question is How these partions of data is decided? This post is succesor to our DAG Scheduler in Spark: Detailed Explanation, How it is involved at architecture Level. In...

What is PySpark DataFrame API? How it relates to Pyspark SQL

Jul 1, 2024 | Pyspark

In PySpark, you can perform operations on DataFrames using two main APIs: the DataFrame API and the Spark SQL API. Both are powerful and can be used interchangeably to some extent. Here's a breakdown of key concepts and functionalities: 1. Creating DataFrames: you can...

Exploratory Data Analysis (EDA) with Pandas in Banking – Converted in Pyspark

Jun 30, 2024 | Pyspark, Python

While searching for A free Pandas Project on Google Found this link -Exploratory Data Analysis (EDA) with Pandas in Banking . I have tried to convert this Pyscript in Pyspark one. First, let's handle the initial steps of downloading and extracting the data: # These...

DAG Scheduler in Spark: Detailed Explanation, How it is involved at architecture Level

Jun 30, 2024 | Pyspark

DAG Scheduler in Spark: Detailed Explanation The DAG (Directed Acyclic Graph) Scheduler is a crucial component in Spark's architecture. It plays a vital role in optimizing and executing Spark jobs. Here's a detailed breakdown of its function, its place in the...

Project Alert:- Building a ETL Data pipeline in Pyspark and using Pandas and Matplotlib for Further Processing

Jun 30, 2024 | Pyspark

Project Alert:- Building a ETL Data pipeline in Pyspark and using Pandas and Matplotlib for Further Processing. For Deployment we will consider using Bitbucket and Genkins. We will build a Data pipeline from BDL Reading Hive Tables in Pyspark and executing Pyspark...

PySpark Architecture (Driver- Executor) two ways to describe

1. Overview of PySpark Architecture

2. Diagram of PySpark Architecture

3. Components Explained

4. Detailed Diagram with Data Flow

Detailed Component Descriptions

Data Flow

Written By HintsToday Team

Related Posts

Adaptive Query Execution (AQE) in Apache Spark- Explain with example

PySpark Project Alert:- Dynamic list of variables Creation for ETL Jobs

In How many ways pyspark script can be executed? Detailed explanation

Error handling, Debugging and custom Log table, status table generation in Pyspark

Project Alert: Automation in Pyspark

A DAG Stage in Pyspark is divided into tasks based on the partitions of the data. How these partitions are decided?

What is PySpark DataFrame API? How it relates to Pyspark SQL

Exploratory Data Analysis (EDA) with Pandas in Banking – Converted in Pyspark

DAG Scheduler in Spark: Detailed Explanation, How it is involved at architecture Level

Project Alert:- Building a ETL Data pipeline in Pyspark and using Pandas and Matplotlib for Further Processing

0 Comments

Submit a Comment Cancel reply

Big Data

All about SQL

All About SAS

All About Python

PySpark Architecture (Driver- Executor) two ways to describe

1. Overview of PySpark Architecture

2. Diagram of PySpark Architecture

3. Components Explained

4. Detailed Diagram with Data Flow

Detailed Component Descriptions

Data Flow

Share this:

Written By HintsToday Team

Related Posts

0 Comments

Submit a Comment Cancel reply