Data Engineer Interview Questions Set2

the practicality of partitioning or bucketing on a high-cardinality column like a primary key (id).

Let’s break this down:

🧠 Scenario Recap:

You have a join on id, which is a primary key, i.e., unique for every row in a million-row table. You wonder:

❓ What happens if we partition by id before join?
❓ Will it create 1 million partitions/files?
❓ Is it logical-only or physically applied?

🔍 1. Partitioning by Unique Key (e.g., `id`)

What Happens:

df.write.partitionBy("id").parquet(".../output")

✅ This is not just logical — it physically creates one folder per distinct id.
❌ In your case, this would create ~1 million folders/files!

Why That’s Bad:

Huge file explosion (small files problem)
Overhead for metadata handling in Spark/Hive/Delta
Slower performance during read due to file system pressure
Wastes storage and memory

🔍 2. Bucketing by Unique Key (e.g., `id`)

df.write.bucketBy(100, "id").sortBy("id").saveAsTable("my_table")

✅ Bucketing does not create one file per value.
✅ You define the number of buckets (e.g., 100).
Spark will hash the id and place rows into one of the 100 buckets.
Still works well for joins on id, even if it’s unique.

Why That’s Better:

Controlled number of output files
Join shuffles avoided if both sides are bucketed on same key & same bucket count
Works even when join key is unique

🧠 Summary Table

Feature	Partition by `id`	Bucket by `id`
One folder per key?	✅ Yes	❌ No (fixed buckets)
Good for unique keys?	❌ No	✅ Yes
Shuffle-free joins?	❌ Not guaranteed	✅ Yes (if same buckets/sort key)
Filtering optimization	✅ Yes (on small sets)	❌ No
File explosion risk	🔥 Extremely high	❌ No

✅ Final Recommendation

🔹 For joins on a high-cardinality column like id,
✅ Use Bucketing, NOT Partitioning.

Partitioning should only be used on:

Columns with low cardinality (e.g., country, region, month)
Used often in WHERE clauses for filtering

Absolutely! Below is a set of interview-style trick questions based on Partitioning vs Bucketing, focusing on join optimization, storage behavior, and edge cases like partitioning on high-cardinality columns.

🎯 Trick & Conceptual Interview Questions – Partitioning vs Bucketing in Spark

#	Question	Answer / Trap
1️⃣	Can you partition a table by a column with 1 million unique values?	❌ Yes, but you shouldn’t. It creates 1 million folders, leading to small file problems.
2️⃣	What happens if I partition by a primary key before a join?	Spark writes 1 file per key, creating massive file explosion, poor performance.
3️⃣	If `id` is unique per row, which is better for join optimization – partitioning or bucketing?	✅ Bucketing is better – controlled number of files, avoids shuffle.
4️⃣	Will bucketing avoid shuffle if two tables are bucketed by the same column?	✅ Yes – only if: bucket count is same, bucket column is the join key, and both tables are sorted by that column.
5️⃣	Can you partition and bucket on the same column?	✅ Yes – e.g., partition by `dt`, bucket by `id`. Helps in both filter and join performance.
6️⃣	Is bucketing a logical concept like partitioning?	❌ No – bucketing is physical, requires `saveAsTable()` in Hive-compatible format.
7️⃣	What if you bucket one table by `id` and the other by `user_id` but both have same values?	❌ Won’t optimize join – Spark checks column name, not just value compatibility.
8️⃣	How many files will be created if you bucket a table with 1 million rows and use 8 buckets?	✅ 8 files – rows are hashed and assigned to buckets.
9️⃣	Is bucketing supported in Delta Lake natively in write operations?	❌ No native `bucketBy()` support in Delta write – only in Hive tables.
🔟	When does partitioning help most – during joins or filters?	✅ Filters – partitioning prunes data during scan, not necessarily during join.

🧠 Follow-up Deep-Dive Questions

“If both tables are bucketed on the same column but have different bucket counts, will Spark optimize the join?”
❌ No. For bucket join optimization, bucket counts must match.
“How does Spark know if it can avoid shuffle during a join?”
✅ It looks at physical layout metadata (e.g., sorted + bucketed + same partitioning).
“How do you fix small file problems caused by over-partitioning?”
✅ Repartition before write or compact files using .coalesce(n) or OPTIMIZE in Delta.

Great follow-up!

🔴 No, you cannot apply bucketBy() or partitionBy() on temporary views or temp tables directly.
These optimizations are applied only when physically writing data to disk or saving as a permanent Hive/Delta/Parquet table.

🔍 Breakdown

Operation	Supported for Partitioning/Bucketing?	Why?
`temp view` via `.createOrReplaceTempView()`	❌ No	Temp views are logical, in-memory only.
`bucketBy()` on DataFrame without write	❌ No	Bucketing requires physical layout, enforced only on write to table.
`partitionBy()` + `.parquet()` or `.saveAsTable()`	✅ Yes	Applies physical partitioning at write time.
`bucketBy()` + `.saveAsTable()`	✅ Yes (Hive-compatible)	Only works when writing to a Hive table.
`bucketBy()` + `.parquet()`	❌ No	Bucketing does not work with `.parquet()` — only with `saveAsTable`.

✅ Correct Usage Examples

✅ Partitioning (works on files like Parquet, Delta)

df.write.partitionBy("country").parquet("/path/to/data")

✅ Bucketing (must use `saveAsTable`)

df.write.bucketBy(8, "user_id").sortBy("user_id").saveAsTable("user_bucketed")

❌ Invalid (logical temp view – no effect)

df.createOrReplaceTempView("temp_users")
# Cannot bucket or partition a temp view

🧠 Why This Matters

Partitioning: Affects physical layout — helps filter pruning and scan performance.
Bucketing: Affects physical layout + join planning — requires Hive table for effectiveness.
Temp views: Only for logical queries, no physical optimizations.

✅ Summary

Can You Apply On	`partitionBy()`	`bucketBy()`
Temp view	❌ No	❌ No
Hive table	✅ Yes	✅ Yes
Parquet file	✅ Yes	❌ No
Delta table	✅ Yes	⚠️ Not natively
In-memory only	❌ No	❌ No

Great nuance you’re bringing in — let’s clarify how defaultParallelism, repartition, and partitioning/bucketing relate and where they do or do not apply.

🔧 TL;DR:

Concept	Scope	Affects Write?	Affects Join?	Persisted?	Main Use
`defaultParallelism`	Cluster-level	❌ No	❌ No	❌ No	Controls task parallelism
`df.repartition(n)`	DataFrame-level	❌ No*	✅ Yes (shuffle aware)	❌ No	Improves load balancing
`partitionBy("col")`	Write-level	✅ Yes	❌ Not directly	✅ Yes	Write optimization (scan/filter)
`bucketBy(n, "col")`	Write-level	✅ Yes	✅ Yes (avoids shuffle)	✅ Yes	Join optimization

🔍 1. `spark.default.parallelism`

Sets default number of partitions for RDDs created via parallelize, joins, groupBy, etc.
Does not affect how DataFrames are written or bucketed.
Set via:

spark.sparkContext.defaultParallelism

👉 Use to understand parallelism but not for layout control.

🔄 2. `df.repartition(n)` or `.repartition("col")`

Triggers a full shuffle to evenly redistribute data.
Helps if you have a skewed DataFrame before a join.
Use before .join() to balance shuffle size:

df = df.repartition("join_key")

✅ Works well before joins to balance skew.
❌ But this does not result in persisted partitions on disk.

🧱 3. `partitionBy()` — Only affects writes

df.write.partitionBy("col").parquet(".../path")

Affects layout on disk, not in-memory.
Helps filter pruning (e.g., scan only data for month = ‘2024-05’).
No effect on shuffle or joins unless filter is pushed down.

📦 4. `bucketBy()` — Join Optimization (but only on save)

df.write.bucketBy(8, "user_id").sortBy("user_id").saveAsTable("my_table")

Must be written as Hive table.
Required to optimize shuffle-less joins.
Doesn’t affect DataFrame behavior unless read via spark.table().

🧠 So, To Answer Your Question:

❓ Can defaultParallelism or repartition() replace partitioning or bucketing?

✅ They help in:

Improving in-memory join performance by balancing data.
Avoiding skew during shuffle stage.

❌ They do not:

Persist data layout
Help with long-term read/join optimizations
Enable partition pruning or bucket-based joins

🧪 Practical Tip

Before joining a skewed dataset:

df_large = df_large.repartition("user_id")
joined_df = df_large.join(df_small, "user_id")

Before writing to storage:

df.write.partitionBy("dt").parquet("...")   # Optimized for filters
df.write.bucketBy(8, "user_id").sortBy("user_id").saveAsTable("bucketed_tbl")  # Optimized for joins

HintsToday

recent posts

about