Advanced-level PySpark, Big Data systems, and backend engineering—here’s a breakdown of what questions you can expect, based on industry trends.
✅ Topic-wise Breakdown of Likely Questions
🔹 PySpark & Big Data (Core Focus)
Area | Sample Questions |
---|---|
PySpark DataFrame APIs | – How is selectExpr different from select ?- Use withColumn , explode , filter in one chain.- Convert nested JSON to flat table.- Difference between collect() , show() , toPandas() |
Performance Optimization | – When to use caching, checkpointing?- What is broadcast join? When is it risky?- How to reduce shuffle?- What is the impact of increasing partitions? |
Partitioning | – How to write partitioned parquet in PySpark?- Static vs dynamic partitioning use case?- Partition pruning behavior? |
File Formats | – Compare Avro, ORC, and Parquet.- Why use Delta Lake?- How to handle corrupt JSON files? |
PySpark vs SQL | – When to prefer PySpark DataFrame API over Spark SQL?- Write both SQL and DataFrame logic to solve the same task. |
🔹 Delta Lake & Azure Databricks
Area | Sample Questions |
---|---|
Delta Lake | – What is the use of Delta Lake?- Explain MERGE , OPTIMIZE , ZORDER .- Difference between Delta table and managed Parquet? |
Azure Databricks | – What is Auto Loader and how does it work?- How do you implement schema evolution in Databricks?- Explain Unity Catalog and data lineage. |
DevOps | – How do you deploy a PySpark job in Azure Databricks?- Describe the role of notebooks, jobs, and workflows. |
🔹 Backend Systems & REST APIs
Area | Sample Questions |
---|---|
API Integration | – Have you exposed PySpark jobs via REST API?- How to trigger ETL via API?- Use of Flask/FastAPI for job orchestration? |
Data Pipelines | – Explain a typical backend ETL you’ve designed.- How do you ensure fault-tolerant pipelines?- What are retry and recovery strategies? |
🔹 SQL & Data Modeling
Area | Sample Questions |
---|---|
SQL Skills | – Window function to find top 2 orders per customer.- Use GROUPING SETS , ROLLUP , CUBE in an example.- Join 3+ tables with edge cases (nulls, outer joins). |
Data Modeling | – Star vs Snowflake schema?- Normalize a denormalized product-order table. |
🔹 Behavioral / Citi-Specific Questions
Type | Sample Questions |
---|---|
Behavioral | – Tell me about a time you improved a system’s performance.- Describe a failure in a pipeline you built and how you resolved it.- How do you balance innovation with risk in data engineering? |
Team Fit | – Experience working in a regulated environment (like banking)?- How do you manage code review, version control in PySpark projects? |
📌 Recommended Preparation Tips
- Revise your recent PySpark/Databricks projects – focus on challenges, performance tuning, and business impact.
- Practice coding in PySpark (not just SQL) – especially complex transformations, joins, and UDFs.
- Go through Delta Lake commands and syntax (
MERGE
,UPDATE
,OPTIMIZE
,VACUUM
, etc.) - Rehearse behavioral answers using STAR format (Situation, Task, Action, Result).
- If there’s a timed assessment, expect a mix of:
- PySpark + SQL code problems
- MCQs on Spark internals, joins, and performance
- Logical or backend system design questions
Perfect — here’s your complete prep package for the Citi Sr. Back-End Developer (PySpark Big Data) assessment:
✅ 1. Mock Test / Quiz (40 mins)
🔸 Section A: MCQs (10 Questions)
Choose the correct option. (1 mark each)
Q1. What’s the default join type in PySpark DataFrame API?
A. Left Outer Join
B. Inner Join
C. Full Outer Join
D. Cross Join
Q2. Which of the following increases shuffle?
A. Caching
B. Filtering
C. GroupBy
D. Repartition with fewer partitions
Q3. What does .coalesce(1)
do?
A. Increases partitions
B. Forces one partition
C. Adds cache
D. Converts RDD to DataFrame
Q4. Which file format supports schema evolution in Delta Lake?
A. CSV
B. ORC
C. Delta
D. Avro
Q5. Which of the following is NOT a benefit of broadcast joins?
A. Reduces data shuffle
B. Suitable for small tables
C. Always memory-efficient
D. Faster joins with big + small tables
🔸 Section B: PySpark Code Challenge (Short Answers)
Q6. Write PySpark code to explode a column named items
(array) and keep the rest of the columns unchanged.
Q7. How would you cast column price
to FloatType
in a DataFrame?
Q8. Given a nested JSON file with structure:
{
"order": {
"id": 123,
"items": [
{"product": "A", "qty": 2},
{"product": "B", "qty": 1}
]
}
}
Write code to flatten it into columns: order_id
, product
, qty
.
🔸 Section C: SQL Design (1–2 line answers)
Q9. Write a query to fetch the top 2 recent transactions per user.
Q10. What’s the use of GROUPING SETS
? Give an example.
✅ 2. Cheat Sheet – Last Minute Revision
🔹 PySpark Transformations
df.select("col1", "col2")
df.withColumn("new_col", df.old_col + 1)
df.filter(df.age > 30)
df.groupBy("dept").agg(F.avg("salary"))
df.join(df2, "id", "left")
df.orderBy("timestamp", ascending=False)
🔹 Performance Tips
- Broadcast join:
broadcast(df_small)
- Repartition:
df.repartition("col")
- Cache:
df.cache()
before reusing in multiple actions - Coalesce: Use before writing to reduce output files
🔹 Delta Lake (Databricks)
MERGE INTO target USING source ON target.id = source.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
OPTIMIZE my_table ZORDER BY (customer_id)
VACUUM my_table RETAIN 168 HOURS
🔹 Spark File Formats
Format | Compression | Schema Support | Best Use Case |
---|---|---|---|
Parquet | Yes | Yes | Analytics |
Delta | Yes | Yes + ACID | Slowly changing data |
Avro | Yes | Yes | Kafka pipelines |
✅ 3. Hands-on PySpark Notebook for Databricks
Here’s what the notebook will cover:
- ✅ Read JSON/CSV/Parquet from mounted storage
- ✅ Explode nested JSONs
- ✅ Broadcast Join example
- ✅ Delta Lake
MERGE
,OPTIMIZE
,VACUUM
- ✅ Caching, Repartitioning
- ✅ Schema evolution & Rescue column
- ✅ Notebook Widgets to simulate ad-hoc parameter inputs
Leave a Reply