Here’s a clear and concise breakdown of multiprocessing vs multithreading in Python, with differences, real-world data engineering use cases, and code illustrations.
🧠 Core Difference:
Feature
Multithreading
Multiprocessing
Concurrency Type
I/O-bound
CPU-bound
Threads/Processes
Multiple threads in the same process (share memory)
Multiple processes (each with its own memory)
GIL Impact
Affected by Python’s GIL (Global Interpreter Lock)
Bypasses GIL—true parallelism
Memory Usage
Shared memory (less RAM used)
Separate memory (more RAM used)
Best For
I/O-bound tasks (API calls, file reads/writes)
CPU-bound tasks (data crunching, large transformations)
Stability
Less stable if thread crashes
More isolated and fault-tolerant
🔧 When to Use What?
Task Type
Use
Reading many CSVs from disk
Multithreading (I/O-bound)
Fetching multiple APIs
Multithreading (network-bound)
Parsing XML/JSON
Multithreading
Large data transformations
Multiprocessing (CPU-intensive)
Complex calculations (e.g., ML)
Multiprocessing
Cleaning large 1–10 GB chunks
Multiprocessing
🧪 Example: Multithreading (I/O-bound)
import requests
from concurrent.futures import ThreadPoolExecutor
urls = ["https://httpbin.org/delay/1"] * 5 # Simulates slow API
def fetch(url):
response = requests.get(url)
return url, response.status_code
with ThreadPoolExecutor(max_workers=5) as executor:
results = executor.map(fetch, urls)
for r in results:
print(r)
✅ This runs in ~1 second total (instead of 5 sec sequentially).
🧪 Example: Multiprocessing (CPU-bound)
from multiprocessing import Pool
import time
def square(n):
time.sleep(1)
return n * n
numbers = [1, 2, 3, 4, 5]
with Pool(processes=5) as pool:
results = pool.map(square, numbers)
print(results)
✅ Also runs in ~1 second total, because each number is squared in a separate process.
🧱 Real-World Use Case Comparison
Scenario
Recommended
Why?
ETL: Fetch from 10 APIs
Multithreading
Low CPU, high wait time
ETL: Transform 5 large CSVs
Multiprocessing
High CPU, parallel processing
Web scraper fetching 100 URLs
Multithreading
Mostly network delay
ML training or parallel prediction
Multiprocessing
CPU-intensive work
Read/write to disk in batch
Multithreading
I/O-bound operation
Pandas chunked transform
Multiprocessing
Heavy CPU usage for large data
🔐 Note on GIL (Global Interpreter Lock)
Python has a GIL which prevents multiple native threads from executing Python bytecodes at the same time.
Threading is limited in Python because of the GIL.
Below is a fully in‑line Jupyter‑style notebook (Markdown + code cells) that you can copy‑paste straight into any .ipynb or run cell‑by‑cell in an interactive notebook. It covers:
# Cell 1 ─ Imports
import os, time, math, random, requests, csv, pathlib
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor
from multiprocessing import Pool, cpu_count
import pandas as pd
# Cell 2 ─ Retry decorator (with optional back‑off)
def retry(n=3, delay=1, backoff=1):
"""
Retry a function up to *n* times.
:param delay: initial wait in seconds
:param backoff: back‑off multiplier (1 = constant delay)
"""
def decorator(fn):
def wrapper(*args, **kwargs):
wait = delay
for attempt in range(1, n + 1):
try:
return fn(*args, **kwargs)
except Exception as e:
if attempt == n:
raise
print(f"[retry] attempt {attempt} failed: {e} → waiting {wait}s")
time.sleep(wait)
wait *= backoff
return wrapper
return decorator
# Cell 3 ─ OOP: loaders & custom error
class DataLoadError(Exception):
"""Raised when a data source cannot be loaded."""
pass
class DataLoader:
def __init__(self, source):
self.source = source
def load(self):
raise NotImplementedError
class CSVLoader(DataLoader):
def load(self):
if not self.source.endswith(".csv"):
raise DataLoadError("Only CSV supported")
return pd.read_csv(self.source)
class APILoader(DataLoader):
@retry(n=2, delay=2)
def load(self):
print(f"Fetching {self.source}")
r = requests.get(self.source, timeout=2)
r.raise_for_status()
return r.json()
# Cell 4 ─ Helper: generate sample CSVs (~10 MB total, adjustable)
def generate_csv_files(n_files=5, rows=200_000, out_dir="sample_csv"):
Path(out_dir).mkdir(exist_ok=True)
files = []
for i in range(n_files):
path = Path(out_dir) / f"data_{i}.csv"
if path.exists(): # avoid regenerating
files.append(str(path)); continue
df = pd.DataFrame({
"id": range(rows),
"value": [random.random() for _ in range(rows)]
})
df.to_csv(path, index=False)
files.append(str(path))
return files
csv_files = generate_csv_files()
csv_files[:2] # peek
# Cell 8 ─ Run and compare
with timer("THREADING"):
thread_out = threaded_etl(csv_files, max_workers=5)
with timer("MULTIPROCESSING"):
mp_out = multiproc_etl(csv_files, workers=5)
display(thread_out.head())
display(mp_out.head())
## Observations
* **THREADING** excels when the bottleneck is disk/network I/O
(CSV read → light transform).
* **MULTIPROCESSING** shines when each task is CPU‑intensive
(heavy math transform), bypassing the GIL for true parallelism.
Tune `max_workers` / `workers` based on CPU cores, file count, and memory.
Can the multithreading / multiprocessing patterns be “lifted” to a distributed PySpark environment?
Yes — but you need to translate where the parallelism happens and who owns what work. Below is a concise guide that maps each element of the earlier notebook to its Spark‑native equivalent, highlights the “gotchas,” and finishes with a minimal PySpark pipeline skeleton you can paste into a Databricks / EMR / local‐Spark notebook.
1. What changes and what stays the same?
Earlier Pattern
Role in a single‑node script
Spark‑Native Equivalent
Key Adaptation Points
ThreadPool / multiprocessing
Gave parallelism on one machine
Spark executors & tasks
Don’t spawn extra threads/processes inside executors; Spark already parallelises.
Pandas read_csv
I/O‑bound file reads
spark.read.csv (wildcards / partition discovery)
Spark splits input files into many partitions automatically.
Custom loader classes (CSVLoader, APILoader)
Encapsulate source‑specific logic
Keep them! They now run on the driver to configure Spark jobs, or run on each executor via UDF / mapPartitions.
Retry decorator
Deals with flaky I/O
Still valuable for: • remote API fetch in driver • per‑record API calls via @pandas_udf or mapPartitions
Be sure decorator is serialisable (pure‑Python functions serialise fine).
CPU‑heavy transform
Illustrates bypassing GIL
Use Spark SQL / DataFrame API or vectorised Pandas UDFs (Arrow) ‑‑> true distributed CPU use.
Timing helper
Measures local runtime
Use Spark UI / spark.time() or wrap actions in Python time.perf_counter().
Houses global OOP orchestration, retry decorators, and external‑system calls that do not need cluster‑scale.
Executors:
Automatically handle data‑parallel work (CSV reading, DataFrame transformations, pandas UDFs).
Don’t start concurrent.futures pools inside them unless you really know what you’re doing— extra processes contend with Spark’s own thread pools and can starve resources.
Each partition is already a separate process; no extra pool needed.
5. Common Pitfalls
Nested multiprocessing inside Spark executors → often hangs or OOMs.
Large Python objects in broadcast variables → serialisation overhead.
Logging / retry code that isn’t serialisable (non‑global lambdas, open sockets).
Ignoring data skew—one “giant” CSV or partition can bottleneck an otherwise parallel job.
Global Interpreter Lock (GIL) — in Plain English and Under‑the‑Hood
What it is
Why it exists
Practical impact
A mutual‑exclusion (mutex) lock inside the CPython interpreter that ensures only one OS thread at a time executes Python byte‑code.
CPython’s memory model uses reference counting for garbage‑collection. Incrementing/decrementing those counters must be thread‑safe; the simplest guarantee is to wrap byte‑code execution in a single lock.
• CPU‑bound threads can’t run in parallel on multiple cores.• I/O‑bound threads often benefit anyway (GIL is released while waiting on I/O).• True multi‑core execution needs processes (multiprocessing, Spark, Ray) or native extensions that release the GIL.
1. Where the GIL Lives
┌────────── CPython Runtime ──────────┐
│ Thread‑1 Thread‑2 Thread‑3 │
│ ▼ ▼ ▼ │
│ ┌──────── Global Interpreter Lock ─┐│
│ │ Only **one** thread at a time ││
│ └───────────────────────────────────┘│
│ ... executes Python byte‑code ... │
└────────────────────────────────────────┘
2. When the GIL Is Held vs. Released
State
GIL
Examples
Executing byte‑code
🔒
Pure‑Python loops, math, string ops
Blocking on I/O
🔓
socket.recv(), time.sleep(), file read/write
Inside C‑extension code that calls Py_BEGIN_ALLOW_THREADS … Py_END_ALLOW_THREADS
🔓
NumPy vector ops, pandas C‑level routines, requests.get() waiting on network
Result: Python threads can overlap I/O and C‑level numerics, but not CPU‑bound pure‑Python work.
3. Demonstrating the Bottleneck (conceptual)
import threading, time, math
def cpu_bound(n):
s = 0
for i in range(10_000_000):
s += math.sin(i)
return s
start = time.perf_counter()
threads = [threading.Thread(target=cpu_bound, args=(0,)) for _ in range(4)]
[t.start() for t in threads]; [t.join() for t in threads]
print(f"4 threads: {time.perf_counter()-start:.2f}s") # ≈ same as one thread
Running four threads is not faster than one because each thread waits for the GIL.
4. Ways to Work Around the GIL
Strategy
How it helps
Typical Tools
Multiprocessing / joblib / Ray / Spark
Each child is a new interpreter process → own GIL → true multi‑core
multiprocessing.Pool, spark-submit, ray
Native extensions that release the GIL
Heavy work done in C/C++/Fortran while GIL is released
NumPy, pandas, TensorFlow, PyTorch
Cython “nogil” blocks / Rust + PyO3
Manually drop the GIL inside compiled code regions
Separate interpreters in one process with per‑interpreter GILs; still early‑stage
interpreters.create()
5. When It Doesn’t Matter
I/O‑bound workloads (web scrapers, Kafka consumers, ETL reading S3): threads overlap network/disk waits; GIL is released while blocking.
Vectorised numeric workloads: NumPy, pandas, and many ML libs push work to BLAS/CUDA—which run outside the GIL.
6. Mental Model Cheat‑Sheet
Is the job CPU‑bound ➔ use processes or Spark.
Is the job I/O‑bound ➔ threads are fine (GIL mostly released).
Is the job numeric C‑backed ➔ threads + NumPy OK.
Need micro‑parallel loops in pure Python ➔ Cython/Rust or rewrite algorithm.
Key Take‑Aways
The GIL is implementation‑specific (only CPython); your Python code, not the language spec, imposes it.
It’s a trade‑off: simplicity & C‑extension speed vs multi‑core threading.
Data engineers typically bypass it with multiprocessing, distributed engines (PySpark, Dask, Ray), or C‑accelerated libraries—so you can still scale Python pipelines to TB‑scale data.
Leave a Reply