Great question โ€” understanding SparkSession vs SparkContext is essential, especially when dealing with RDDs, DataFrames, or any Spark internals.


๐Ÿ” TL;DR Difference

FeatureSparkContextSparkSession (since Spark 2.0+)
PurposeLow-level entry point to Spark functionalityUnified entry point to Spark: SQL, Streaming, Hive, RDD
API FocusRDDs onlyDataFrames, Datasets, SQL, RDDs
Usage (Modern)Used through SparkSession.sparkContextRecommended for all modern Spark apps
Standalone?Used in early Spark apps directlyWraps SparkContext, SQLContext, HiveContext etc.

๐Ÿง  Internal Creation: How They Work Together

โœ… 1. SparkSession includes SparkContext

When you create a SparkSession, it internally creates a SparkContext (or uses an existing one):

spark = SparkSession.builder \
    .appName("MyApp") \
    .getOrCreate()

sc = spark.sparkContext  # <- This is your SparkContext

So:

  • SparkSession is a wrapper/factory for SparkContext, SQLContext, HiveContext, etc.
  • SparkContext is still used under the hood, especially when working with RDDs.

๐Ÿงฑ Dependency Graph

SparkSession
   |
   +-- sparkContext     -> instance of SparkContext
   +-- sqlContext       -> for SQL/DataFrame APIs
   +-- catalog          

RDD API   <--- uses ---> SparkContext
DF API    <--- uses ---> SQLContext & SparkContext

๐ŸŒ€ RDD Context

If you need to work with RDDs, you still use the SparkContext:

rdd = spark.sparkContext.parallelize([1, 2, 3])
rdd.map(lambda x: x * 2).collect()

Even in a DataFrame-based app, RDD actions are routed via sparkContext, which in turn communicates with the driver and cluster manager.


๐ŸŽฏ Real Use Case Difference

TaskWhich One?Why?
RDD transformationSparkContextRDDs are SparkContext-based
DataFrame read/writeSparkSessionUnified entry point for IO
SQL QueriesSparkSessionHas .sql() method
Streaming (Structured)SparkSessionHandles streaming DFs
Spark on Hive TablesSparkSessionHas Hive catalog support

๐Ÿ› ๏ธ Creation Example

Older (Pre-Spark 2.0)

from pyspark import SparkContext
sc = SparkContext(appName="OldApp")
rdd = sc.parallelize([1, 2, 3])

Modern (Post-Spark 2.0)

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ModernApp").getOrCreate()
rdd = spark.sparkContext.parallelize([1, 2, 3])

โš ๏ธ Note

  • You cannot have more than one active SparkContext per JVM.
  • But you can call SparkSession.builder.getOrCreate() multiple times โ€” it reuses the existing SparkContext.

๐Ÿ“Œ Summary

ConceptExplanation
SparkContextBackbone of Spark, low-level API, needed for RDDs
SparkSessionHigher-level abstraction that includes SparkContext
RDD DependencyAll RDD operations go through sparkContext
InternallySparkSession โ†’ creates or wraps โ†’ SparkContext

Pages: 1 2

Posted in

Leave a Reply

Your email address will not be published. Required fields are marked *