HintsToday

Hints and Answers for Everything

- HintsToday

Reply To: SET 1

#6493
lochan2014
Keymaster

    To **optimize reading only specific partitions** using spark.read.parquet(), you can leverage **partition pruning**, which allows Spark to skip reading unnecessary files and only load relevant partitions.

    ### ✅ Here’s how:

    ### 💡 **Assume your dataset is partitioned like this:**

    /data/events/year=2023/month=01/
    /month=02/
    /year=2024/month=01/
    `

    ### ✅ Option 1: **Partition Pruning via Filter**

    If the Parquet data is partitioned by year and month, Spark will **automatically prune partitions** when you filter on those columns **after reading**:

    df = spark.read.parquet(“/data/events”)
    df_filtered = df.filter((df.year == 2023) & (df.month == 1))
    `

    ➡️ Spark reads **only the matching folders** using directory structure without scanning full data.

    ### ✅ Option 2: **Partition Directory Path Filtering (Manual)**

    You can also directly specify the partition path to **only read that portion** of the dataset:

    df = spark.read.parquet(“/data/events/year=2023/month=01”)
    `

    ➡️ This skips the full directory scan and reads only data from specified partition paths.

    ### ✅ Option 3: **Passing multiple paths**

    If you want to read a few selected partitions:

    df = spark.read.parquet(
    “/data/events/year=2023/month=01”,
    “/data/events/year=2024/month=01”
    )
    `

    ➡️ Only the specified paths are read—efficient when you know exact partitions to load.

    ### 🛠️ Best Practices

    * **Use filter on partition columns early** (for lazy pruning).
    * Always **partition your data** based on query patterns (e.g., time-based like year/month/day).
    * Avoid filtering on non-partitioned columns if performance is a concern during reads.
    * **Don’t cast partition columns**, as it can disable pruning (e.g., df.filter(df.year.cast("string") == "2023") may break pruning).

    ### 🔍 Check if pruning is applied:

    You can verify partition pruning by enabling physical plan explain:

    df_filtered.explain(True)
    `

    Look for PushedFilters: and ensure only necessary partitions are being accessed.