Data Engineer Interview Questions Set2

Advanced-level PySpark, Big Data systems, and backend engineering—here’s a breakdown of what questions you can expect, based on industry trends.

✅ Topic-wise Breakdown of Likely Questions

🔹 PySpark & Big Data (Core Focus)

Area	Sample Questions
PySpark DataFrame APIs	– How is `selectExpr` different from `select`?- Use `withColumn`, `explode`, `filter` in one chain.- Convert nested JSON to flat table.- Difference between `collect()`, `show()`, `toPandas()`
Performance Optimization	– When to use caching, checkpointing?- What is broadcast join? When is it risky?- How to reduce shuffle?- What is the impact of increasing partitions?
Partitioning	– How to write partitioned parquet in PySpark?- Static vs dynamic partitioning use case?- Partition pruning behavior?
File Formats	– Compare Avro, ORC, and Parquet.- Why use Delta Lake?- How to handle corrupt JSON files?
PySpark vs SQL	– When to prefer PySpark DataFrame API over Spark SQL?- Write both SQL and DataFrame logic to solve the same task.

🔹 Delta Lake & Azure Databricks

Area	Sample Questions
Delta Lake	– What is the use of Delta Lake?- Explain `MERGE`, `OPTIMIZE`, `ZORDER`.- Difference between Delta table and managed Parquet?
Azure Databricks	– What is Auto Loader and how does it work?- How do you implement schema evolution in Databricks?- Explain Unity Catalog and data lineage.
DevOps	– How do you deploy a PySpark job in Azure Databricks?- Describe the role of notebooks, jobs, and workflows.

🔹 Backend Systems & REST APIs

Area	Sample Questions
API Integration	– Have you exposed PySpark jobs via REST API?- How to trigger ETL via API?- Use of Flask/FastAPI for job orchestration?
Data Pipelines	– Explain a typical backend ETL you’ve designed.- How do you ensure fault-tolerant pipelines?- What are retry and recovery strategies?

🔹 SQL & Data Modeling

Area	Sample Questions
SQL Skills	– Window function to find top 2 orders per customer.- Use `GROUPING SETS`, `ROLLUP`, `CUBE` in an example.- Join 3+ tables with edge cases (nulls, outer joins).
Data Modeling	– Star vs Snowflake schema?- Normalize a denormalized product-order table.

🔹 Behavioral / Citi-Specific Questions

Type	Sample Questions
Behavioral	– Tell me about a time you improved a system’s performance.- Describe a failure in a pipeline you built and how you resolved it.- How do you balance innovation with risk in data engineering?
Team Fit	– Experience working in a regulated environment (like banking)?- How do you manage code review, version control in PySpark projects?

📌 Recommended Preparation Tips

Revise your recent PySpark/Databricks projects – focus on challenges, performance tuning, and business impact.
Practice coding in PySpark (not just SQL) – especially complex transformations, joins, and UDFs.
Go through Delta Lake commands and syntax (MERGE, UPDATE, OPTIMIZE, VACUUM, etc.)
Rehearse behavioral answers using STAR format (Situation, Task, Action, Result).
If there’s a timed assessment, expect a mix of:
- PySpark + SQL code problems
- MCQs on Spark internals, joins, and performance
- Logical or backend system design questions

Perfect — here’s your complete prep package for the Citi Sr. Back-End Developer (PySpark Big Data) assessment:

✅ 1. Mock Test / Quiz (40 mins)

🔸 Section A: MCQs (10 Questions)

Choose the correct option. (1 mark each)

Q1. What’s the default join type in PySpark DataFrame API?
A. Left Outer Join
B. Inner Join
C. Full Outer Join
D. Cross Join

Q2. Which of the following increases shuffle?
A. Caching
B. Filtering
C. GroupBy
D. Repartition with fewer partitions

Q3. What does .coalesce(1) do?
A. Increases partitions
B. Forces one partition
C. Adds cache
D. Converts RDD to DataFrame

Q4. Which file format supports schema evolution in Delta Lake?
A. CSV
B. ORC
C. Delta
D. Avro

Q5. Which of the following is NOT a benefit of broadcast joins?
A. Reduces data shuffle
B. Suitable for small tables
C. Always memory-efficient
D. Faster joins with big + small tables

🔸 Section B: PySpark Code Challenge (Short Answers)

Q6. Write PySpark code to explode a column named items (array) and keep the rest of the columns unchanged.

Q7. How would you cast column price to FloatType in a DataFrame?

Q8. Given a nested JSON file with structure:

{
  "order": {
    "id": 123,
    "items": [
      {"product": "A", "qty": 2},
      {"product": "B", "qty": 1}
    ]
  }
}

Write code to flatten it into columns: order_id, product, qty.

🔸 Section C: SQL Design (1–2 line answers)

Q9. Write a query to fetch the top 2 recent transactions per user.

Q10. What’s the use of GROUPING SETS? Give an example.

✅ 2. Cheat Sheet – Last Minute Revision

🔹 PySpark Transformations

df.select("col1", "col2")
df.withColumn("new_col", df.old_col + 1)
df.filter(df.age > 30)
df.groupBy("dept").agg(F.avg("salary"))
df.join(df2, "id", "left")
df.orderBy("timestamp", ascending=False)

🔹 Performance Tips

Broadcast join: broadcast(df_small)
Repartition: df.repartition("col")
Cache: df.cache() before reusing in multiple actions
Coalesce: Use before writing to reduce output files

🔹 Delta Lake (Databricks)

MERGE INTO target USING source ON target.id = source.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

OPTIMIZE my_table ZORDER BY (customer_id)
VACUUM my_table RETAIN 168 HOURS

🔹 Spark File Formats

Format	Compression	Schema Support	Best Use Case
Parquet	Yes	Yes	Analytics
Delta	Yes	Yes + ACID	Slowly changing data
Avro	Yes	Yes	Kafka pipelines

✅ 3. Hands-on PySpark Notebook for Databricks

Here’s what the notebook will cover:

✅ Read JSON/CSV/Parquet from mounted storage
✅ Explode nested JSONs
✅ Broadcast Join example
✅ Delta Lake MERGE, OPTIMIZE, VACUUM
✅ Caching, Repartitioning
✅ Schema evolution & Rescue column
✅ Notebook Widgets to simulate ad-hoc parameter inputs

HintsToday

recent posts

about