Great! You’re absolutely right that compressed columnar formats like Parquet and ORC are preferred in Spark for performance, schema awareness, and column pruning.

Let’s answer your question:


✅ Q: “How do I enable compression when writing files in Spark (Parquet/ORC)?”

Spark does not compress Parquet data by default, but you can easily enable it by setting a config.


🔧 1. Enable Compression Globally in Your Spark App

spark = SparkSession.builder \
    .appName("EnableCompression") \
    .config("spark.sql.parquet.compression.codec", "snappy") \
    .config("spark.sql.orc.compression.codec", "zlib") \
    .getOrCreate()

This ensures all Parquet files written from this session are compressed with Snappy (recommended), and ORC with Zlib.


🧪 2. Enable Compression Per Write

# Parquet
df.write.option("compression", "snappy").parquet("output_parquet/")

# ORC
df.write.option("compression", "zlib").orc("output_orc/")

🔍 Supported Compression Codecs in Spark

FormatOption KeyRecommended
Parquetspark.sql.parquet.compression.codecsnappy, gzip, lz4
ORCspark.sql.orc.compression.codeczlib, snappy, none
Text/CSVcodec option in .write()gzip, bzip2, deflate

🔥 Why Use Parquet/ORC (compressed)?

FeatureParquet/ORC
CompressionBlock-level, high ratio
Schema-awareYes
Column pruningYes
Predicate pushdownYes
Read speedFast (only required columns are read)
Storage sizeSmall (Snappy ~70% smaller than CSV)

💡 Real Use Case Example

spark.read.csv("data.csv", header=True, inferSchema=True) \
     .write \
     .option("compression", "snappy") \
     .mode("overwrite") \
     .parquet("output/snappy_compressed/")

✅ Verifying Compression (Optional)

Use parquet-tools (or hdfs dfs -ls) to check metadata and size:

parquet-tools meta part-0000.snappy.parquet

Or read it back and check the schema:

spark.read.parquet("output/snappy_compressed/").printSchema()

Pages: 1 2 3 4


Discover more from HintsToday

Subscribe to get the latest posts sent to your email.

Posted in

Leave a Reply

Discover more from HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading