Hints and Answers for Everything
Absolutely! Here’s a crisp, sharp snippet-style answer for each of the commonly asked questions:
—
### 🔹 **How do you handle corrupt records while reading JSON/CSV?**
# For JSON
df = spark.read.option(“badRecordsPath”, “s3://path/to/badRecords”).json(“s3://path/to/json”)
# For CSV
df = spark.read.option(“mode”, “PERMISSIVE”) \
.option(“columnNameOfCorruptRecord”, “_corrupt_record”) \
.csv(“s3://path/to/csv”)
# Options: PERMISSIVE (default), DROPMALFORMED, FAILFAST
### 🔹 **How do you handle duplicate rows in a DataFrame?**
# Remove exact duplicates
df = df.dropDuplicates()
# Remove duplicates based on specific columns
df = df.dropDuplicates([“col1”, “col2”])
`
—
### 🔹 **How to handle nulls before aggregation?**
# Replace nulls before groupBy
df = df.fillna(0) # or df.fillna({“col1”: 0, “col2”: “unknown”})
# Then safely aggregate
df.groupBy(“group_col”).agg({“metric_col”: “sum”})
`
—
### 🔹 **How do you read only specific columns from a Parquet file?**
df = spark.read.parquet(“s3://path/to/data”).select(“col1”, “col2”)
`
—
### 🔹 **How do you optimize wide transformations like joins and groupBy?**
# Use broadcast join if one side is small
from pyspark.sql.functions import broadcast
df = large_df.join(broadcast(small_df), “key”)
# Repartition on group key to reduce shuffle pressure
df = df.repartition(“group_key”)
# Cache intermediate results if reused
df.cache()
`
—
### 🔹 **How do you write partitioned Parquet files with overwrite behavior?**
df.write.mode(“overwrite”) \
.partitionBy(“year”, “month”) \
.parquet(“s3://path/output/”)
`
—
### 🔹 **How do you check for skew in Spark join keys?**
df.groupBy(“join_key”).count().orderBy(“count”, ascending=False).show(10)
`
—
Would you like a few more on file formats, cache vs persist, or checkpointing?
Designed with WordPress