Category: Interview Prep

  • Certainly! Here’s the complete crisp PySpark Interview Q&A Cheat Sheet with all your questions so far, formatted consistently for flashcards, Excel, or cheat sheet use: Question Answer How do you handle schema mismatch when reading multiple JSON/Parquet files with different structures? Use .option(“mergeSchema”, “true”) when reading Parquet files; for JSON, unify schemas by selecting common…

  • Explain a scenario on schema evolution in data pipelines Here’s an automated Python script using PySpark that performs schema evolution between two datasets (e.g., two Parquet files or DataFrames): ✅ Features: 🔧 Prerequisites: 🧠 Script: Schema Evolution Handler 🔍 Output: 💡 Notes: Automated script for schema evolution. first to check what fields are missing or…

  • Challenging Question on Website Visits /Pageviews You are given a table website_visits with the following columns: Column Name Data Type Description visit_id int Unique visit ID user_id int Unique user ID page_id int Unique page ID visit_date date Date of visit visit_time timestamp Timestamp of visit page_view_time int Time spent on page (in seconds) Your…

  • To find the second-highest salary for each department in SQL? Here are multiple methods to find the second-highest salary for each department in SQL, using different approaches. 1. Using Correlated Subquery This approach involves using a subquery to find the highest salary for each department and then excluding it. The subquery in SQL statement: is…