Great question โ understanding SparkSession
vs SparkContext
is essential, especially when dealing with RDDs, DataFrames, or any Spark internals.
๐ TL;DR Difference
Feature | SparkContext | SparkSession (since Spark 2.0+) |
---|---|---|
Purpose | Low-level entry point to Spark functionality | Unified entry point to Spark: SQL, Streaming, Hive, RDD |
API Focus | RDDs only | DataFrames, Datasets, SQL, RDDs |
Usage (Modern) | Used through SparkSession.sparkContext | Recommended for all modern Spark apps |
Standalone? | Used in early Spark apps directly | Wraps SparkContext , SQLContext , HiveContext etc. |
๐ง Internal Creation: How They Work Together
โ
1. SparkSession
includes SparkContext
When you create a SparkSession
, it internally creates a SparkContext
(or uses an existing one):
spark = SparkSession.builder \
.appName("MyApp") \
.getOrCreate()
sc = spark.sparkContext # <- This is your SparkContext
So:
SparkSession
is a wrapper/factory forSparkContext
,SQLContext
,HiveContext
, etc.SparkContext
is still used under the hood, especially when working with RDDs.
๐งฑ Dependency Graph
SparkSession
|
+-- sparkContext -> instance of SparkContext
+-- sqlContext -> for SQL/DataFrame APIs
+-- catalog
RDD API <--- uses ---> SparkContext
DF API <--- uses ---> SQLContext & SparkContext
๐ RDD Context
If you need to work with RDDs, you still use the SparkContext
:
rdd = spark.sparkContext.parallelize([1, 2, 3])
rdd.map(lambda x: x * 2).collect()
Even in a DataFrame-based app, RDD actions are routed via
sparkContext
, which in turn communicates with the driver and cluster manager.
๐ฏ Real Use Case Difference
Task | Which One? | Why? |
---|---|---|
RDD transformation | SparkContext | RDDs are SparkContext-based |
DataFrame read/write | SparkSession | Unified entry point for IO |
SQL Queries | SparkSession | Has .sql() method |
Streaming (Structured) | SparkSession | Handles streaming DFs |
Spark on Hive Tables | SparkSession | Has Hive catalog support |
๐ ๏ธ Creation Example
Older (Pre-Spark 2.0)
from pyspark import SparkContext
sc = SparkContext(appName="OldApp")
rdd = sc.parallelize([1, 2, 3])
Modern (Post-Spark 2.0)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ModernApp").getOrCreate()
rdd = spark.sparkContext.parallelize([1, 2, 3])
โ ๏ธ Note
- You cannot have more than one active
SparkContext
per JVM. - But you can call
SparkSession.builder.getOrCreate()
multiple times โ it reuses the existingSparkContext
.
๐ Summary
Concept | Explanation |
---|---|
SparkContext | Backbone of Spark, low-level API, needed for RDDs |
SparkSession | Higher-level abstraction that includes SparkContext |
RDD Dependency | All RDD operations go through sparkContext |
Internally | SparkSession โ creates or wraps โ SparkContext |
Leave a Reply