`python
# Write DataFrame to Parquet while controlling file size
# Step 1: Estimate desired file size per file (in bytes)
# For example, target ~128MB per file
target_file_size = 128 * 1024 * 1024 # 128 MB
# Step 2: Use df.rdd.getNumPartitions()
and data size to tune partitions
# Alternatively, use coalesce/repartition if size is known or can be estimated
# Suppose you want 10 files around 128MB each
df = df.repartition(10)
# Step 3: Write with compression to reduce actual storage size
df.write \
.mode(“overwrite”) \
.option(“compression”, “snappy”) \
.parquet(“s3://your-bucket/target-path/”)
# Optional: Use maxRecordsPerFile
to cap records per file if you know average row size
df.write \
.mode(“overwrite”) \
.option(“compression”, “snappy”) \
.option(“maxRecordsPerFile”, 1_000_000) \
.parquet(“s3://your-bucket/target-path/”)
`
**Key options used:**
* .repartition(n)
to control the number of output files.
* .option("maxRecordsPerFile", n)
to cap the number of rows per file.
* Compression (snappy
) to ensure consistent file size and performance.