HintsToday

Hints and Answers for Everything

- HintsToday

Reply To: SET 1

#6495
lochan2014
Keymaster

    `python
    # Write DataFrame to Parquet while controlling file size

    # Step 1: Estimate desired file size per file (in bytes)
    # For example, target ~128MB per file
    target_file_size = 128 * 1024 * 1024 # 128 MB

    # Step 2: Use df.rdd.getNumPartitions() and data size to tune partitions
    # Alternatively, use coalesce/repartition if size is known or can be estimated
    # Suppose you want 10 files around 128MB each
    df = df.repartition(10)

    # Step 3: Write with compression to reduce actual storage size
    df.write \
    .mode(“overwrite”) \
    .option(“compression”, “snappy”) \
    .parquet(“s3://your-bucket/target-path/”)

    # Optional: Use maxRecordsPerFile to cap records per file if you know average row size
    df.write \
    .mode(“overwrite”) \
    .option(“compression”, “snappy”) \
    .option(“maxRecordsPerFile”, 1_000_000) \
    .parquet(“s3://your-bucket/target-path/”)
    `

    **Key options used:**

    * .repartition(n) to control the number of output files.
    * .option("maxRecordsPerFile", n) to cap the number of rows per file.
    * Compression (snappy) to ensure consistent file size and performance.