πŸš€ PySpark Architecture & Execution Engine β€” Complete Guide



πŸ”₯ 1. Spark Evolution Recap

  • 2009: AMPLab (UC Berkeley)
  • Goal: Overcome Hadoop MapReduce slowness
  • 2013: Donated to Apache Foundation
  • Built using Scala, runs on JVM, supports Python (via PySpark)

βš”οΈ 2. Spark vs Hadoop (Core Comparison)

FeatureHadoop MapReduceApache Spark
EngineDisk-basedIn-memory
LanguagesJava-onlyScala, Python, R, SQL
Iterative SupportPoor (writes to disk)Native (in-memory)
SpeedSlow (I/O bound)Fast (RAM usage)
EcosystemLimitedUnified stack

🧱 3. Spark Design Goals

  • Speed: 100x faster than Hadoop (in-memory)
  • Unified Engine: Batch, Streaming, MLlib, GraphX
  • Language APIs: Python (PySpark), Scala, Java, SQL
  • Fault-Tolerant: Lineage tracking, RDDs
  • Scalability: Tested with 8000+ nodes

🐍 4. Why PySpark? (Python + Spark Bridge)

PySpark is the Python API to the Spark engine, which is natively written in Scala/JVM.

πŸ”„ How Python Code Reaches JVM:

[Python Code]
    ↓ py4j Gateway
[Python Driver Process]
    ↔
[JVM Driver (SparkContext)]
    ↓
[Executors (JVM Workers)]

πŸ”§ What is py4j?

  • A Java library embedded inside Spark that exposes JVM objects to Python.
  • Python calls Java methods as if they were native β€” SparkContext, RDDs, etc.

βœ… One-liner: PySpark uses py4j to route Python commands to the JVM-based Spark engine.


βš™οΈ 5. Master-Slave Architecture β€” Driver & Executors

🧠 Core Components:

ComponentRole
DriverOrchestrates everything, holds the SparkContext
SparkContextEntry point; connects to Cluster Manager
Cluster ManagerAllocates Executors (YARN, K8s, Standalone)
Executor(s)JVM processes on worker nodes; perform tasks + hold data
TaskAtomic unit of execution (mapped to a partition)
StageA set of tasks with no shuffle boundary
JobTriggered by an action (e.g. .show(), .count())

πŸ” Diagram: Simplified Flow

Python Code
   ↓
SparkSession β†’ SparkContext
   ↓
DAG Scheduler β†’ Stage Plan
   ↓
Task Scheduler β†’ Tasks
   ↓
Executors (on Worker Nodes)
   ↓
Reads/Writes Data β†’ Returns Result

πŸ” 6. DAG Scheduler & Task Lifecycle

πŸ”Ή What Happens When You Run .show()?

  1. Logical Plan β€” based on DataFrame transformations
  2. Optimized Plan β€” Catalyst optimizes for pushdowns, projections
  3. Physical Plan β€” translates to stages
  4. Stages β€” narrow (e.g., map) vs wide (e.g., join)
  5. Tasks β€” 1 per partition, assigned to executors

βœ… Lazy Evaluation: Transformations build a plan; actions trigger execution


πŸ”Œ 7. py4j + JVM Bridge Deep Dive

πŸ”§ JVM Roles

  • PySpark driver runs in Python
  • Real execution happens on the JVM via py4j
  • Each RDD/DataFrame operation is wrapped and sent over

πŸ” py4j Functions

RoleExample
JVM β†’ Python callbacksTo access Python lambdas
Python β†’ JVM proxyAll Spark API calls
Serialization LayerAuto-translates Python β†’ JVM

🧡 8. Cluster Managers: Modes of Deployment

ModeUse CaseDescription
LocalDev/TestRun Spark on 1 machine
StandaloneSmall ProdSpark’s built-in manager
YARNHadoop ClusterIndustry-standard scheduler
KubernetesCloud-NativeElastic containers, scalable, cloud-friendly

Examples:

spark-submit --master yarn --deploy-mode cluster script.py
spark-submit --master k8s://host --conf spark.kubernetes...

πŸ—ƒοΈ 9. Spark Memory Hierarchy

  • Storage Memory: For caching RDDs/DataFrames
  • Execution Memory: Used while shuffles/sorts/joins happen
  • Unified Memory Model: Shared space in executor
  • Broadcast Memory: For small datasets to distribute across nodes

βœ… Tune using:

--executor-memory 4g
--driver-memory 2g

πŸ“¦ 10. Monitoring via Spark UI

TabInsight Provided
JobsEach action triggers a job
StagesShows stage DAGs, parallelism/shuffles
SQLLogical and physical plans
ExecutorsMemory usage, task failure, GC time
StorageCached RDDs/DataFrames

Access via:

  • http://localhost:4040
  • Or Spark History Server (for cluster jobs)

πŸ’Ύ 11. Delta Lake + Databricks Essentials

Delta Lake runs on top of Spark + Parquet and adds:

FeaturePurpose
ACID TransactionsSafe multi-write concurrency
Schema EvolutionAdd/modify columns automatically
MERGE INTOSCD1/SCD2 Upserts
Time TravelRead past versions using version/timestamp

Example:

df.write.format("delta").save("/mnt/delta_table")

# Upsert
from delta.tables import DeltaTable
DeltaTable.forPath(spark, "/mnt/table").alias("t") \
.merge(df.alias("s"), "t.id = s.id") \
.whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()

πŸ§ͺ 12. Complete Execution Flow (Real Job)

# PySpark Job
df = spark.read.csv("hdfs://cluster/sales.csv")
df_filtered = df.filter("region = 'Asia'")
df_grouped = df_filtered.groupBy("product").sum("amount")
df_grouped.write.parquet("hdfs://results")

Internal Execution:

1. SparkSession β†’ SparkContext initialized
2. DAG created (3 transformations)
3. Catalyst optimizes it
4. DAG β†’ Stages (filter, groupBy)
5. Task Scheduler assigns tasks
6. YARN allocates executors
7. Executors run tasks per partition
8. Shuffle for groupBy
9. Result written to HDFS

🧾 13. Interview Cheat Sheet

βœ… Terminologies to Memorize:

ConceptExplanation
RDDImmutable, fault-tolerant collection
DataFrameDistributed table with schema
StageSet of tasks before/after shuffle
TaskExecutes per partition
ShuffleData movement across nodes
Broadcast JoinSmall β†’ large table optimization
CatalystQuery optimizer
TungstenMemory + code generation engine
py4jPython ↔ JVM communication
Cluster ManagerAllocates compute resources

🧠 One-Line Summary

PySpark wraps Python logic and routes it to Spark’s JVM backend using py4j. The Driver constructs the job, builds a DAG of transformations, breaks it into stages, and assigns tasks to Executors across the cluster, which execute tasks in parallel, read/write from HDFS, and return results via the SparkContext β€” all monitored using the Spark UI.


Interview-Ready Cheat Sheet

Core Concepts

  • RDD is the base abstraction β†’ fault-tolerant, distributed
  • DataFrame is a distributed table with schema
  • Wide Transformations (e.g., join) cause shuffle
  • Actions trigger execution (count(), show())

PySpark Interview One-Liners

  • SparkSession is the unified entry point in PySpark β‰₯ 2.0
  • Broadcast joins are optimal for small lookup tables
  • Executor memory is not the same as driver memory
  • Shuffle is expensive β†’ avoid unless necessary
  • Persist() vs Cache() β†’ Cache = memory only, Persist = custom storage levels

🧠 One-Liner Architecture Summary

Apache Spark uses a Driver-Executor model, where the Driver acts as the master to coordinate work, Executors run distributed tasks, and a Cluster Manager allocates resources. Spark builds a DAG of transformations, splits it into stages and tasks, and executes them in parallel on executors, using memory/disk/shuffle/broadcast optimizations. Storage can be HDFS, S3, Delta Lake, or JDBC, with monitoring via the Spark UI and event logs.


πŸ”š Bonus: Real Job Flow Trace (Condensed for Interview)

  1. Spark code is submitted β†’ Driver starts
  2. SparkContext connects to cluster
  3. Transformations β†’ DAG β†’ Optimized Plan
  4. Actions trigger DAG Scheduler
  5. Cluster Manager allocates Executors
  6. Tasks are assigned and run
  7. Data read/shuffled/written by Executors
  8. Results sent to Driver or written to output
  9. Progress tracked in Web UI or History Server

Absolutely, Rajeev. Here’s your complete Spark Architecture Mind Map (in text format) and a mock interview round focused entirely on PySpark Architecture & Internals β€” deep, realistic, and industry-proven.


🧠 Mind Map: Spark Architecture (Text Outline Style)

Apache Spark Architecture
β”‚
β”œβ”€β”€ 1. High-Level Abstractions
β”‚   β”œβ”€β”€ RDD: Resilient Distributed Dataset
β”‚   β”œβ”€β”€ DataFrame: Schema-aware RDD
β”‚   └── Dataset (Scala/Java only)
β”‚
β”œβ”€β”€ 2. Language Bindings
β”‚   β”œβ”€β”€ Scala (Native)
β”‚   β”œβ”€β”€ Java
β”‚   β”œβ”€β”€ Python (PySpark via py4j)
β”‚   └── R
β”‚
β”œβ”€β”€ 3. SparkSession
β”‚   └── Entry point for DataFrame/SQL API
β”‚
β”œβ”€β”€ 4. Execution Components
β”‚   β”œβ”€β”€ Driver
β”‚   β”‚   β”œβ”€β”€ Runs user code
β”‚   β”‚   β”œβ”€β”€ Hosts SparkContext
β”‚   β”‚   └── Builds DAG, manages job lifecycle
β”‚   β”œβ”€β”€ Executors
β”‚   β”‚   β”œβ”€β”€ JVM processes on worker nodes
β”‚   β”‚   β”œβ”€β”€ Execute tasks
β”‚   β”‚   └── Cache data & shuffle blocks
β”‚   └── Cluster Manager
β”‚       β”œβ”€β”€ Standalone
β”‚       β”œβ”€β”€ YARN
β”‚       └── Kubernetes
β”‚
β”œβ”€β”€ 5. Internal Engines
β”‚   β”œβ”€β”€ Catalyst Optimizer (logical/physical plan)
β”‚   └── Tungsten (memory management & codegen)
β”‚
β”œβ”€β”€ 6. Execution Flow
β”‚   β”œβ”€β”€ Transformations (lazy)
β”‚   β”œβ”€β”€ Actions (trigger execution)
β”‚   β”œβ”€β”€ Logical Plan β†’ Optimized Plan β†’ Physical Plan
β”‚   β”œβ”€β”€ DAG Scheduler β†’ Stages
β”‚   └── Task Scheduler β†’ Tasks β†’ Executors
β”‚
β”œβ”€β”€ 7. Memory Management
β”‚   β”œβ”€β”€ Execution Memory (shuffles, joins)
β”‚   β”œβ”€β”€ Storage Memory (caching)
β”‚   └── Reserved Memory
β”‚
β”œβ”€β”€ 8. Storage & IO
β”‚   β”œβ”€β”€ HDFS / S3 / Azure / GCS
β”‚   β”œβ”€β”€ JDBC / Hive Metastore
β”‚   └── Delta Lake
β”‚
β”œβ”€β”€ 9. Monitoring Tools
β”‚   β”œβ”€β”€ Spark UI (4040)
β”‚   β”œβ”€β”€ History Server
β”‚   └── Event Logs
β”‚
β”œβ”€β”€ 10. PySpark Internals
β”‚   β”œβ”€β”€ Python API via py4j
β”‚   β”œβ”€β”€ Python β†’ JVM bridging
β”‚   └── Serialization overhead considerations
β”‚
└── 11. Optimizations
    β”œβ”€β”€ Broadcast joins
    β”œβ”€β”€ Caching / Persistence
    β”œβ”€β”€ Repartitioning / Coalesce
    └── AQE (Adaptive Query Execution)

You can visualize this as a circular or radial mind map with Apache Spark in the center.


🎯 Mock Interview: PySpark Architecture & Internals

Level: Mid to Senior
Format: Q&A + Suggested Ideal Answers

Q1. What is the role of the Driver in Spark?

βœ… Answer:
The Driver is the master process that runs the user’s main function and coordinates Spark jobs. It contains the SparkContext, builds the DAG, converts it into stages and tasks, and sends tasks to executors.


Q2. What happens when you call an action like .count() on a DataFrame?

βœ… Answer:

  • A logical plan is built based on prior transformations
  • Catalyst optimizes it into an optimized and physical plan
  • DAG Scheduler splits it into stages
  • Task Scheduler assigns tasks to executors
  • Results are aggregated and returned to the driver

Q3. How does PySpark communicate with the Spark engine?

βœ… Answer:
PySpark uses a library called py4j to send Python commands to the JVM backend, which executes them using Spark’s Scala-based engine. The results are passed back via py4j.


Q4. What is the difference between an Executor and a Task?

βœ… Answer:
An Executor is a JVM process launched on worker nodes responsible for executing code and storing data.
A Task is the smallest unit of work sent to an executor, and corresponds to one data partition.


Q5. What are narrow and wide transformations?

βœ… Answer:

  • Narrow transformations (e.g., map, filter) do not require shuffling β€” data stays on the same node.
  • Wide transformations (e.g., groupBy, join) require shuffle, moving data across the network.

Q6. How is memory managed in Spark Executors?

βœ… Answer:
Memory in an executor is split into:

  • Execution memory: Used for operations like joins/sorts.
  • Storage memory: For caching RDD/DataFrame.
  • Reserved memory: For internal Spark needs.
    Unified Memory Model (default since Spark 1.6) shares space between execution and storage.

Q7. What is Catalyst Optimizer?

βœ… Answer:
Catalyst is Spark’s query optimization engine for transforming logical plans into efficient physical plans. It supports constant folding, predicate pushdown, projection pruning, etc.


Q8. How do you handle small lookup tables during joins?

βœ… Answer:
Use Broadcast Join β€” small DataFrame is broadcast to all executors to avoid shuffle.

from pyspark.sql.functions import broadcast
df.join(broadcast(df_small), "key").show()

Q9. What is the role of py4j in PySpark?

βœ… Answer:
py4j is the gateway that allows Python code to invoke JVM methods. It bridges the Python driver with Spark’s Scala backend.


Q10. Explain the complete life of a Spark job.

βœ… Answer:

  1. SparkSession starts; SparkContext initialized
  2. Lazy transformations build a logical plan
  3. Catalyst optimizes it β†’ physical plan
  4. DAG scheduler breaks it into stages
  5. Task scheduler schedules tasks per partition
  6. Executors run tasks and return results
  7. Spark UI shows status of all components

Bonus: ⚠️ Common Pitfalls

MistakeSolution
Too many small files (S3/HDFS)Use .coalesce() or Delta OPTIMIZE
Shuffles on large datasetsUse broadcast joins or repartitioning
Overuse of UDFsPrefer Spark SQL functions
Caching everythingCache only reused data
Executor OOMsTune memory and partition sizes

Absolutely, Rajeev! You’re in elite mastery mode now. Let’s hit all three topics in depth:


🧱 DELTA LAKE ARCHITECTURE & TIME TRAVEL β€” INTERNALS


βœ… What is Delta Lake?

Delta Lake is a storage layer on top of Parquet that brings ACID transactions, schema enforcement, time travel, and merge operations to Apache Spark.


🧰 Delta Lake Key Features

FeatureDescription
ACID TransactionsAtomicity, Consistency, Isolation, Durability across Spark jobs
Time TravelQuery data as of previous versions (via version or timestamp)
Schema EnforcementEnforce column names/types on write
Schema EvolutionAutomatically handle schema changes when enabled
Merge (Upserts)Handle SCD Type 1 & Type 2 use cases
OPTIMIZE/ZORDERImprove scan speed via file compaction and indexing
VACUUMClean obsolete files after retention period

πŸ”¬ Internals: How Delta Lake Works

  1. Each write creates a new version (JSON + Parquet files).
  2. Transaction log is stored in _delta_log directory.
  3. Read operations refer to the latest version (or past if versionAsOf is used).
  4. All changes are append-only β€” no overwrite of existing data files.
  5. Delta maintains a commit log (00000000.json) + checkpointing for fast access.

πŸ“¦ Delta Time Travel

# Read latest
df = spark.read.format("delta").load("/mnt/sales")

# Read version 5
df_v5 = spark.read.format("delta").option("versionAsOf", 5).load("/mnt/sales")

# Read as of timestamp
df_time = spark.read.format("delta").option("timestampAsOf", "2024-12-01 12:00:00").load("/mnt/sales")

πŸ§ͺ Merge (Upsert) Example

from delta.tables import DeltaTable

delta_table = DeltaTable.forPath(spark, "/mnt/sales")

delta_table.alias("t").merge(
    updates_df.alias("u"),
    "t.id = u.id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

🧹 Optimize & Vacuum

spark.sql("OPTIMIZE delta.`/mnt/sales` ZORDER BY (region, product)")

spark.sql("VACUUM delta.`/mnt/sales` RETAIN 168 HOURS")

πŸ“ RAW β†’ BRONZE β†’ SILVER β†’ GOLD: LAKEHOUSE PIPELINE DESIGN


πŸ”Ά 1. Raw Layer (Landing)

  • Source: CSVs, JSON, Kafka, APIs, RDBMS
  • Storage: No schema enforcement, audit only
  • Examples: /datalake/raw/sales/
raw_df = spark.read.option("multiline", True).json("/raw/sales/")

🟫 2. Bronze Layer (Cleaned & Ingested)

  • Validated, minimally cleaned
  • Add metadata: _ingest_date, _source
  • Write to Delta Table
  • Partitioned by key like date or region
bronze_df = raw_df.withColumn("_ingest_date", current_date())
bronze_df.write.format("delta").mode("overwrite").save("/bronze/sales/")

βšͺ 3. Silver Layer (Validated + Transformed)

  • Business-ready format
  • Joins with lookup tables
  • Conform dimensions (e.g., DateDim, ProductDim)
  • Column renaming, derived metrics
silver_df = bronze_df \
    .filter("amount > 0") \
    .join(region_dim, "region_id") \
    .withColumn("profit", col("amount") - col("cost"))

silver_df.write.format("delta").mode("overwrite").save("/silver/sales/")

🟑 4. Gold Layer (Aggregated/Reporting)

  • Dashboard-ready format
  • KPI aggregations, rollups
  • Materialized views
gold_df = silver_df.groupBy("region", "month").agg(
    sum("profit").alias("monthly_profit"),
    count("*").alias("txn_count")
)

gold_df.write.format("delta").mode("overwrite").save("/gold/sales_summary/")

πŸ’‘ BONUS: Real-Time Trigger (Auto Loader + Streaming)

raw_stream = spark.readStream.format("cloudFiles") \
    .option("cloudFiles.format", "json") \
    .load("/raw/sales")

bronze_stream = raw_stream.withColumn("_ingest_time", current_timestamp())

bronze_stream.writeStream.format("delta") \
    .option("checkpointLocation", "/checkpoints/bronze/") \
    .start("/bronze/sales")

πŸ“‹ MOCK INTERVIEW ROUND 2 β€” Delta Lake, Lakehouse, and Pipelines


Q1. How does Delta Lake achieve ACID transactions in Spark?

βœ… Answer:
Delta uses a transaction log (_delta_log) that records every commit as a JSON file, similar to a WAL (write-ahead log). Operations are atomic, and concurrent writes are serialized.


Q2. What is the difference between Delta Lake and regular Parquet files?

βœ… Answer:

Delta LakeParquet
ACID support❌
Time Travelβœ…
Schema evolutionβœ… (with control)
Merge/Update/Deleteβœ…
Performance TuningOPTIMIZE, ZORDER
ReliabilityHigher

Q3. How would you implement SCD Type 2 using Delta Merge?

βœ… Answer:
Use merge() with whenMatchedUpdate, whenNotMatchedInsert, and maintain a current_flag, start_date, end_date column to track changes over time.


Q4. What is the purpose of Bronze/Silver/Gold architecture?

βœ… Answer:

  • Bronze: Raw cleaned data, audit trails
  • Silver: Business logic applied, conform dimensions
  • Gold: Aggregated, report-ready data
    It enforces data quality, version control, and modularity.

Q5. How is Time Travel internally maintained in Delta Lake?

βœ… Answer:
Each write creates a new snapshot/version in _delta_log. The log stores metadata and the exact set of files for that version. versionAsOf or timestampAsOf fetches that version during read.


Q6. How do you prevent VACUUM from deleting old data prematurely?

βœ… Answer:
Delta has a default 7-day retention (168 hours). Set lower manually only if absolutely safe:

SET spark.databricks.delta.retentionDurationCheck.enabled = false;
VACUUM ... RETAIN 1 HOURS

Q7. How do you ensure schema consistency in Bronze β†’ Silver flow?

βœ… Answer:
Use mergeSchema=True when writing OR set strict enforcement to avoid corrupting downstream pipelines.

df.write.option("mergeSchema", True).format("delta")...

Pages: 1 2 3


Discover more from HintsToday

Subscribe to get the latest posts sent to your email.

Posted in

Leave a Reply

Your email address will not be published. Required fields are marked *

Discover more from HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading