HintsToday

Hints and Answers for Everything

- HintsToday

Reply To: SET 1

#6499
lochan2014
Keymaster

    Absolutely! Here’s a crisp, sharp snippet-style answer for each of the commonly asked questions:

    ### 🔹 **How do you handle corrupt records while reading JSON/CSV?**

    # For JSON
    df = spark.read.option(“badRecordsPath”, “s3://path/to/badRecords”).json(“s3://path/to/json”)

    # For CSV
    df = spark.read.option(“mode”, “PERMISSIVE”) \
    .option(“columnNameOfCorruptRecord”, “_corrupt_record”) \
    .csv(“s3://path/to/csv”)

    # Options: PERMISSIVE (default), DROPMALFORMED, FAILFAST

    ### 🔹 **How do you handle duplicate rows in a DataFrame?**

    # Remove exact duplicates
    df = df.dropDuplicates()

    # Remove duplicates based on specific columns
    df = df.dropDuplicates([“col1”, “col2”])
    `

    ### 🔹 **How to handle nulls before aggregation?**

    # Replace nulls before groupBy
    df = df.fillna(0) # or df.fillna({“col1”: 0, “col2”: “unknown”})

    # Then safely aggregate
    df.groupBy(“group_col”).agg({“metric_col”: “sum”})
    `

    ### 🔹 **How do you read only specific columns from a Parquet file?**

    df = spark.read.parquet(“s3://path/to/data”).select(“col1”, “col2”)
    `

    ### 🔹 **How do you optimize wide transformations like joins and groupBy?**

    # Use broadcast join if one side is small
    from pyspark.sql.functions import broadcast
    df = large_df.join(broadcast(small_df), “key”)

    # Repartition on group key to reduce shuffle pressure
    df = df.repartition(“group_key”)

    # Cache intermediate results if reused
    df.cache()
    `

    ### 🔹 **How do you write partitioned Parquet files with overwrite behavior?**

    df.write.mode(“overwrite”) \
    .partitionBy(“year”, “month”) \
    .parquet(“s3://path/output/”)
    `

    ### 🔹 **How do you check for skew in Spark join keys?**

    df.groupBy(“join_key”).count().orderBy(“count”, ascending=False).show(10)
    `

    Would you like a few more on file formats, cache vs persist, or checkpointing?