- HintsToday

June 7, 2025 at 1:22 pm #6493

lochan2014

Keymaster

To **optimize reading only specific partitions** using spark.read.parquet(), you can leverage **partition pruning**, which allows Spark to skip reading unnecessary files and only load relevant partitions.

### ✅ Here’s how:

—

### 💡 **Assume your dataset is partitioned like this:**

/data/events/year=2023/month=01/
/month=02/
/year=2024/month=01/
`

—

### ✅ Option 1: **Partition Pruning via Filter**

If the Parquet data is partitioned by year and month, Spark will **automatically prune partitions** when you filter on those columns **after reading**:

df = spark.read.parquet(“/data/events”)
df_filtered = df.filter((df.year == 2023) & (df.month == 1))
`

➡️ Spark reads **only the matching folders** using directory structure without scanning full data.

—

### ✅ Option 2: **Partition Directory Path Filtering (Manual)**

You can also directly specify the partition path to **only read that portion** of the dataset:

df = spark.read.parquet(“/data/events/year=2023/month=01”)
`

➡️ This skips the full directory scan and reads only data from specified partition paths.

—

### ✅ Option 3: **Passing multiple paths**

If you want to read a few selected partitions:

df = spark.read.parquet(
“/data/events/year=2023/month=01”,
“/data/events/year=2024/month=01”
)
`

➡️ Only the specified paths are read—efficient when you know exact partitions to load.

—

### 🛠️ Best Practices

* **Use filter on partition columns early** (for lazy pruning).
* Always **partition your data** based on query patterns (e.g., time-based like year/month/day).
* Avoid filtering on non-partitioned columns if performance is a concern during reads.
* **Don’t cast partition columns**, as it can disable pruning (e.g., df.filter(df.year.cast("string") == "2023") may break pruning).

—

### 🔍 Check if pruning is applied:

You can verify partition pruning by enabling physical plan explain:

df_filtered.explain(True)
`

Look for PushedFilters: and ensure only necessary partitions are being accessed.

HintsToday

Reply To: SET 1