PySpark architecture cheat sheet- How to Know Which parts of your PySpark ETL script are executed on the driver, master (YARN), or executors


— by

PySpark Architecture Cheat Sheet

1. Core Components of PySpark

ComponentDescriptionKey Features
Spark CoreThe foundational Spark component for scheduling, memory management, and fault tolerance.Task scheduling, data partitioning, RDD APIs.
Spark SQLEnables interaction with structured data via SQL, DataFrames, and Datasets.Supports SQL queries, schema inference, integration with Hive.
Spark StreamingAllows real-time data processing through micro-batching.DStreams, integration with Kafka, Flume, etc.
Spark MLlibProvides scalable machine learning algorithms.Algorithms for classification, clustering, regression, etc.
Spark GraphXSupports graph processing and analysis for complex networks.Graph algorithms, graph-parallel computation.

2. PySpark Layered Architecture

LayerDescriptionKey Functions
Application LayerContains user applications and custom PySpark code.Custom data workflows and business logic.
Spark API LayerProvides PySpark APIs to interact with Spark Core and other components.High-level abstractions for data manipulation, SQL, streaming.
Spark Core LayerProvides core functionalities including task scheduling and fault tolerance.Data locality, memory management, RDD transformations.
Execution LayerManages the execution of Spark tasks across the cluster.Task scheduling, load balancing, error handling.
Storage LayerManages data storage with distributed storage solutions.Supports HDFS, S3, Cassandra, and other storage integrations.

3. Spark Cluster Architecture

ComponentDescriptionKey Role
Driver NodeRuns the Spark application, creates Spark Context, and coordinates tasks execution.Manages jobs, scheduling, and resource allocation.
Executor NodesRun tasks assigned by the driver node and process data on each partition.Execute tasks, store intermediate results, and return output.
Cluster ManagerManages the resources of the Spark cluster.Allocates resources, manages executor lifecycle.
Distributed File SystemStores data across the cluster, ensuring high availability.HDFS, S3, or other compatible storage for data sharing.

4. Spark Execution Process

StepDescriptionKey Points
1. Application SubmissionUser submits a Spark application via command line or a UI.Initializes job creation in Spark.
2. Job CreationSpark creates jobs and splits them into stages and tasks based on transformations.Directed Acyclic Graph (DAG) creation based on data flow.
3. Task AssignmentDriver assigns tasks to executors based on data locality and resource availability.Utilizes data partitioning for parallelism.
4. Task ExecutionExecutors run assigned tasks on data partitions.Processes data in parallel across the cluster.
5. Result CollectionDriver collects results from all executors, aggregates, and returns the final output.Outputs final results to the user or designated storage.

5. Spark RDD Architecture

ComponentDescriptionKey Characteristics
RDD (Resilient Distributed Dataset)Immutable, distributed collection of objects across the cluster.Fault-tolerant, lineage-based recovery, in-memory processing.
PartitionSubset of data within an RDD stored on a single node.Enables parallel processing of data.
TaskSmallest unit of work that operates on a single partition.Executes transformations or actions on data.

6. Spark DataFrame Architecture

ComponentDescriptionKey Characteristics
DataFrameDistributed collection of data organized into named columns.Schema-based data handling, optimized storage, SQL compatibility.
DatasetStrongly-typed distributed collection of data (Java/Scala).Type safety, combines features of RDDs and DataFrames.
EncoderConverts data between JVM objects and Spark’s internal format for optimized serialization.Efficient serialization/deserialization for faster processing.

7. Spark SQL Architecture

ComponentDescriptionKey Functions
Catalyst OptimizerOptimizes Spark SQL queries for enhanced performance.Logical plan, physical plan optimization.
Query PlannerPlans the execution of SQL queries by selecting the best execution strategy.Converts optimized logical plan into physical execution plan.
Execution EngineExecutes SQL queries using Spark’s distributed computing framework.Leverages cluster resources for parallel query execution.

8. Spark Streaming Architecture

ComponentDescriptionKey Features
DStream (Discretized Stream)Continuous data stream split into micro-batches for processing.Batch processing in near real-time.
ReceiverIngests data from external sources like Kafka, Flume, etc.Acts as data source for streaming jobs.
ProcessorProcesses data within DStream by applying transformations and actions.Provides transformations similar to RDDs.

9. Spark Master-Slave Architecture

ComponentDescriptionKey Role
Master NodeCoordinates resource allocation, task scheduling, and overall job management.Central controller for Spark cluster.
Worker NodesExecute tasks on assigned data partitions as directed by the Master.Run computations, store data, and handle intermediate results.
ExecutorProcess-running unit on each worker node, responsible for executing tasks.Runs task code, caches data, and sends results to the driver.
TaskSmallest unit of work on data partitions assigned to executors by the driver.Executes transformations or actions on data partitions.
Driver ProgramInitiates the Spark application and coordinates overall job execution.Submits tasks to the master and receives results.
Cluster ManagerManages resources for the entire Spark cluster (YARN, Mesos, Kubernetes).Manages the lifecycle of executors and resource distribution.

10. Key Concepts in PySpark

ConceptDescriptionBenefits
Lazy EvaluationTransformations are not executed until an action is called.Optimizes query execution by grouping operations.
Fault ToleranceSpark recovers lost RDDs using lineage information when nodes fail.Increases reliability in distributed environments.
In-Memory ProcessingStores intermediate data in memory instead of writing to disk.Enables faster data processing by avoiding I/O overhead.

11. Common Use Cases for PySpark

  • Batch Processing: Large-scale ETL (Extract, Transform, Load) jobs.
  • Stream Processing: Real-time analytics, monitoring systems.
  • Machine Learning: Training models at scale using Spark MLlib.
  • Graph Processing: Social network analysis, recommendation systems.
  • Data Warehousing: Leveraging Spark SQL for querying structured datasets.

Which parts of your PySpark ETL script are executed on the driver, master (YARN), or executors

Understanding how PySpark scripts execute across different nodes in a cluster is crucial for optimization and debugging. Here’s a breakdown of how to identify which parts of your script run on the driver, master/YARN, executors, or NameNodes:

Driver:

  1. Script initialization: SparkSession creation, configuration, and setting up the Spark context.
  2. Data ingestion: Reading data from sources (e.g., files, databases).
  3. Data transformation definitions: Defining transformations (e.g., maps, filters, joins).
  4. Action execution: Calling actions (e.g., count(), show(), write()).

Executor:

  1. Task execution: Executing tasks assigned by the driver (e.g., mapping, filtering, joining).
  2. Data processing: Processing data in parallel across executor nodes.
  3. Shuffle operations: Exchanging data between executors during shuffle operations.

Master/YARN:

  1. Resource management: Managing resources (e.g., memory, CPU) for the Spark application.
  2. Job scheduling: Scheduling jobs and tasks on executor nodes.
  3. Monitoring: Tracking application progress and performance.

NameNode (HDFS):

  1. Data storage: Storing data in HDFS.
  2. Metadata management: Managing file system metadata.

To identify which parts of your script run on each component, follow these steps:

Step 1: Enable Spark UI

Add the following configuration to your SparkSession:

spark = SparkSession.builder \
    .appName("Your App Name") \
    .config("spark.ui.enabled", True) \
    .config("spark.ui.port", 4040) \
    .getOrCreate()

Step 2: Analyze Spark UI

Access the Spark UI at http://driver-node-ip:4040:

  1. Jobs: View job execution details, including task execution times.
  2. Stages: Analyze stage execution, including shuffle operations.
  3. Tasks: Examine task execution on individual executor nodes.

Step 3: Use Debugging Tools

  1. Spark logs: Inspect logs for driver, executor, and master nodes.
  2. Print statements: Add print statements to your script to track execution.

Real-Life Use Case:

Suppose you have a PySpark script that:

# Driver: Initialize SparkSession and read data
spark = SparkSession.builder.appName("ETL").getOrCreate()
data = spark.read.parquet("input_data")

# Driver: Define data transformations
transformed_data = data.filter(data["age"] > 18).join(data2, "id")

# Executor: Execute transformations and shuffle operations
transformed_data = transformed_data.repartition(10)

# Driver: Execute action and write data
transformed_data.write.parquet("output_data")

In this example:

  • The driver initializes the SparkSession, reads data, defines transformations, and executes the action.
  • The executor executes tasks assigned by the driver, including filtering, joining, and repartitioning.
  • The master/YARN manages resources, schedules jobs, and monitors application progress.

By analyzing the Spark UI, logs, and debugging tools, you can gain insights into which parts of your script run on each component.


Discover more from AI HintsToday

Subscribe to get the latest posts sent to your email.

Newsletter

Our latest updates in your e-mail.

Comments


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from AI HitsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading