- HintsToday

June 7, 2025 at 1:57 pm #6499

Keymaster

Absolutely! Here’s a crisp, sharp snippet-style answer for each of the commonly asked questions:

—

### 🔹 **How do you handle corrupt records while reading JSON/CSV?**

# For JSON
df = spark.read.option(“badRecordsPath”, “s3://path/to/badRecords”).json(“s3://path/to/json”)

# For CSV
df = spark.read.option(“mode”, “PERMISSIVE”) \
.option(“columnNameOfCorruptRecord”, “_corrupt_record”) \
.csv(“s3://path/to/csv”)

# Options: PERMISSIVE (default), DROPMALFORMED, FAILFAST

### 🔹 **How do you handle duplicate rows in a DataFrame?**

# Remove exact duplicates
df = df.dropDuplicates()

# Remove duplicates based on specific columns
df = df.dropDuplicates([“col1”, “col2”])
`

—

### 🔹 **How to handle nulls before aggregation?**

# Replace nulls before groupBy
df = df.fillna(0) # or df.fillna({“col1”: 0, “col2”: “unknown”})

# Then safely aggregate
df.groupBy(“group_col”).agg({“metric_col”: “sum”})
`

—

### 🔹 **How do you read only specific columns from a Parquet file?**

df = spark.read.parquet(“s3://path/to/data”).select(“col1”, “col2”)
`

—

### 🔹 **How do you optimize wide transformations like joins and groupBy?**

# Use broadcast join if one side is small
from pyspark.sql.functions import broadcast
df = large_df.join(broadcast(small_df), “key”)

# Repartition on group key to reduce shuffle pressure
df = df.repartition(“group_key”)

# Cache intermediate results if reused
df.cache()
`

—

### 🔹 **How do you write partitioned Parquet files with overwrite behavior?**

df.write.mode(“overwrite”) \
.partitionBy(“year”, “month”) \
.parquet(“s3://path/output/”)
`

—

### 🔹 **How do you check for skew in Spark join keys?**

df.groupBy(“join_key”).count().orderBy(“count”, ascending=False).show(10)
`

—

Would you like a few more on file formats, cache vs persist, or checkpointing?

HintsToday