Advanced-level PySpark, Big Data systems, and backend engineering—here’s a breakdown of what questions you can expect, based on industry trends.


Topic-wise Breakdown of Likely Questions


🔹 PySpark & Big Data (Core Focus)

AreaSample Questions
PySpark DataFrame APIs– How is selectExpr different from select?- Use withColumn, explode, filter in one chain.- Convert nested JSON to flat table.- Difference between collect(), show(), toPandas()
Performance Optimization– When to use caching, checkpointing?- What is broadcast join? When is it risky?- How to reduce shuffle?- What is the impact of increasing partitions?
Partitioning– How to write partitioned parquet in PySpark?- Static vs dynamic partitioning use case?- Partition pruning behavior?
File Formats– Compare Avro, ORC, and Parquet.- Why use Delta Lake?- How to handle corrupt JSON files?
PySpark vs SQL– When to prefer PySpark DataFrame API over Spark SQL?- Write both SQL and DataFrame logic to solve the same task.

🔹 Delta Lake & Azure Databricks

AreaSample Questions
Delta Lake– What is the use of Delta Lake?- Explain MERGE, OPTIMIZE, ZORDER.- Difference between Delta table and managed Parquet?
Azure Databricks– What is Auto Loader and how does it work?- How do you implement schema evolution in Databricks?- Explain Unity Catalog and data lineage.
DevOps– How do you deploy a PySpark job in Azure Databricks?- Describe the role of notebooks, jobs, and workflows.

🔹 Backend Systems & REST APIs

AreaSample Questions
API Integration– Have you exposed PySpark jobs via REST API?- How to trigger ETL via API?- Use of Flask/FastAPI for job orchestration?
Data Pipelines– Explain a typical backend ETL you’ve designed.- How do you ensure fault-tolerant pipelines?- What are retry and recovery strategies?

🔹 SQL & Data Modeling

AreaSample Questions
SQL Skills– Window function to find top 2 orders per customer.- Use GROUPING SETS, ROLLUP, CUBE in an example.- Join 3+ tables with edge cases (nulls, outer joins).
Data Modeling– Star vs Snowflake schema?- Normalize a denormalized product-order table.

🔹 Behavioral / Citi-Specific Questions

TypeSample Questions
Behavioral– Tell me about a time you improved a system’s performance.- Describe a failure in a pipeline you built and how you resolved it.- How do you balance innovation with risk in data engineering?
Team Fit– Experience working in a regulated environment (like banking)?- How do you manage code review, version control in PySpark projects?

📌 Recommended Preparation Tips

  1. Revise your recent PySpark/Databricks projects – focus on challenges, performance tuning, and business impact.
  2. Practice coding in PySpark (not just SQL) – especially complex transformations, joins, and UDFs.
  3. Go through Delta Lake commands and syntax (MERGE, UPDATE, OPTIMIZE, VACUUM, etc.)
  4. Rehearse behavioral answers using STAR format (Situation, Task, Action, Result).
  5. If there’s a timed assessment, expect a mix of:
    • PySpark + SQL code problems
    • MCQs on Spark internals, joins, and performance
    • Logical or backend system design questions

Perfect — here’s your complete prep package for the Citi Sr. Back-End Developer (PySpark Big Data) assessment:


✅ 1. Mock Test / Quiz (40 mins)

🔸 Section A: MCQs (10 Questions)

Choose the correct option. (1 mark each)

Q1. What’s the default join type in PySpark DataFrame API?
A. Left Outer Join
B. Inner Join
C. Full Outer Join
D. Cross Join

Q2. Which of the following increases shuffle?
A. Caching
B. Filtering
C. GroupBy
D. Repartition with fewer partitions

Q3. What does .coalesce(1) do?
A. Increases partitions
B. Forces one partition
C. Adds cache
D. Converts RDD to DataFrame

Q4. Which file format supports schema evolution in Delta Lake?
A. CSV
B. ORC
C. Delta
D. Avro

Q5. Which of the following is NOT a benefit of broadcast joins?
A. Reduces data shuffle
B. Suitable for small tables
C. Always memory-efficient
D. Faster joins with big + small tables


🔸 Section B: PySpark Code Challenge (Short Answers)

Q6. Write PySpark code to explode a column named items (array) and keep the rest of the columns unchanged.

Q7. How would you cast column price to FloatType in a DataFrame?

Q8. Given a nested JSON file with structure:

{
  "order": {
    "id": 123,
    "items": [
      {"product": "A", "qty": 2},
      {"product": "B", "qty": 1}
    ]
  }
}

Write code to flatten it into columns: order_id, product, qty.


🔸 Section C: SQL Design (1–2 line answers)

Q9. Write a query to fetch the top 2 recent transactions per user.

Q10. What’s the use of GROUPING SETS? Give an example.


✅ 2. Cheat Sheet – Last Minute Revision

🔹 PySpark Transformations

df.select("col1", "col2")
df.withColumn("new_col", df.old_col + 1)
df.filter(df.age > 30)
df.groupBy("dept").agg(F.avg("salary"))
df.join(df2, "id", "left")
df.orderBy("timestamp", ascending=False)

🔹 Performance Tips

  • Broadcast join: broadcast(df_small)
  • Repartition: df.repartition("col")
  • Cache: df.cache() before reusing in multiple actions
  • Coalesce: Use before writing to reduce output files

🔹 Delta Lake (Databricks)

MERGE INTO target USING source ON target.id = source.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

OPTIMIZE my_table ZORDER BY (customer_id)
VACUUM my_table RETAIN 168 HOURS

🔹 Spark File Formats

FormatCompressionSchema SupportBest Use Case
ParquetYesYesAnalytics
DeltaYesYes + ACIDSlowly changing data
AvroYesYesKafka pipelines

✅ 3. Hands-on PySpark Notebook for Databricks

Here’s what the notebook will cover:

  • ✅ Read JSON/CSV/Parquet from mounted storage
  • ✅ Explode nested JSONs
  • ✅ Broadcast Join example
  • ✅ Delta Lake MERGE, OPTIMIZE, VACUUM
  • ✅ Caching, Repartitioning
  • ✅ Schema evolution & Rescue column
  • ✅ Notebook Widgets to simulate ad-hoc parameter inputs

Pages: 1 2 3 4 5 6 7


Discover more from HintsToday

Subscribe to get the latest posts sent to your email.

Posted in

Leave a Reply

Discover more from HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading