Great! You’re absolutely right that compressed columnar formats like Parquet and ORC are preferred in Spark for performance, schema awareness, and column pruning.
Let’s answer your question:
✅ Q: “How do I enable compression when writing files in Spark (Parquet/ORC)?”
Spark does not compress Parquet data by default, but you can easily enable it by setting a config.
🔧 1. Enable Compression Globally in Your Spark App
spark = SparkSession.builder \
.appName("EnableCompression") \
.config("spark.sql.parquet.compression.codec", "snappy") \
.config("spark.sql.orc.compression.codec", "zlib") \
.getOrCreate()
This ensures all Parquet files written from this session are compressed with Snappy (recommended), and ORC with Zlib.
🧪 2. Enable Compression Per Write
# Parquet
df.write.option("compression", "snappy").parquet("output_parquet/")
# ORC
df.write.option("compression", "zlib").orc("output_orc/")
🔍 Supported Compression Codecs in Spark
Format | Option Key | Recommended |
---|---|---|
Parquet | spark.sql.parquet.compression.codec | snappy , gzip , lz4 |
ORC | spark.sql.orc.compression.codec | zlib , snappy , none |
Text/CSV | codec option in .write() | gzip , bzip2 , deflate |
🔥 Why Use Parquet/ORC (compressed)?
Feature | Parquet/ORC |
---|---|
Compression | Block-level, high ratio |
Schema-aware | Yes |
Column pruning | Yes |
Predicate pushdown | Yes |
Read speed | Fast (only required columns are read) |
Storage size | Small (Snappy ~70% smaller than CSV) |
💡 Real Use Case Example
spark.read.csv("data.csv", header=True, inferSchema=True) \
.write \
.option("compression", "snappy") \
.mode("overwrite") \
.parquet("output/snappy_compressed/")
✅ Verifying Compression (Optional)
Use parquet-tools
(or hdfs dfs -ls
) to check metadata and size:
parquet-tools meta part-0000.snappy.parquet
Or read it back and check the schema:
spark.read.parquet("output/snappy_compressed/").printSchema()
Leave a Reply