Data Engineer Interview Questions Set5

Great! You’re absolutely right that compressed columnar formats like Parquet and ORC are preferred in Spark for performance, schema awareness, and column pruning.

Let’s answer your question:


✅ Q: “How do I enable compression when writing files in Spark (Parquet/ORC)?”

Spark does not compress Parquet data by default, but you can easily enable it by setting a config.


🔧 1. Enable Compression Globally in Your Spark App

spark = SparkSession.builder \
    .appName("EnableCompression") \
    .config("spark.sql.parquet.compression.codec", "snappy") \
    .config("spark.sql.orc.compression.codec", "zlib") \
    .getOrCreate()

This ensures all Parquet files written from this session are compressed with Snappy (recommended), and ORC with Zlib.


🧪 2. Enable Compression Per Write

# Parquet
df.write.option("compression", "snappy").parquet("output_parquet/")

# ORC
df.write.option("compression", "zlib").orc("output_orc/")

🔍 Supported Compression Codecs in Spark

FormatOption KeyRecommended
Parquetspark.sql.parquet.compression.codecsnappy, gzip, lz4
ORCspark.sql.orc.compression.codeczlib, snappy, none
Text/CSVcodec option in .write()gzip, bzip2, deflate

🔥 Why Use Parquet/ORC (compressed)?

FeatureParquet/ORC
CompressionBlock-level, high ratio
Schema-awareYes
Column pruningYes
Predicate pushdownYes
Read speedFast (only required columns are read)
Storage sizeSmall (Snappy ~70% smaller than CSV)

💡 Real Use Case Example

spark.read.csv("data.csv", header=True, inferSchema=True) \
     .write \
     .option("compression", "snappy") \
     .mode("overwrite") \
     .parquet("output/snappy_compressed/")

✅ Verifying Compression (Optional)

Use parquet-tools (or hdfs dfs -ls) to check metadata and size:

parquet-tools meta part-0000.snappy.parquet

Or read it back and check the schema:

spark.read.parquet("output/snappy_compressed/").printSchema()

Pages: 1 2 3 4

Leave a Reply

Latest Posts

Discover more from HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading