- HintsToday

June 7, 2025 at 1:42 pm #6495

lochan2014

Keymaster

`python
# Write DataFrame to Parquet while controlling file size

# Step 1: Estimate desired file size per file (in bytes)
# For example, target ~128MB per file
target_file_size = 128 * 1024 * 1024 # 128 MB

# Step 2: Use df.rdd.getNumPartitions() and data size to tune partitions
# Alternatively, use coalesce/repartition if size is known or can be estimated
# Suppose you want 10 files around 128MB each
df = df.repartition(10)

# Step 3: Write with compression to reduce actual storage size
df.write \
.mode(“overwrite”) \
.option(“compression”, “snappy”) \
.parquet(“s3://your-bucket/target-path/”)

# Optional: Use maxRecordsPerFile to cap records per file if you know average row size
df.write \
.mode(“overwrite”) \
.option(“compression”, “snappy”) \
.option(“maxRecordsPerFile”, 1_000_000) \
.parquet(“s3://your-bucket/target-path/”)
`

**Key options used:**

* .repartition(n) to control the number of output files.
* .option("maxRecordsPerFile", n) to cap the number of rows per file.
* Compression (snappy) to ensure consistent file size and performance.

HintsToday

Reply To: SET 1