π₯ Curated Video Playlist (Azure Databricks + Spark)
π’ Beginner Videos
- Databricks Introduction for Beginners β freeCodeCamp
- What is Delta Lake? | Databricks
- Apache Spark Fundamentals β Simplilearn
π‘ Intermediate
- Delta Lake Deep Dive β Databricks
- Building Medallion Architecture β Advancing Analytics
- Auto Loader Explained
π΄ Advanced + Use Cases
- Optimize Delta Tables β Databricks
- Databricks Workflow (Jobs) Tutorial
- End-to-End Pipeline Using Azure Databricks
Here’s an expanded Mock Interview Q&A Sheet with 40+ high-impact questions across core Databricks topics, tailored specifically for Data Engineer interviews in India.
π Azure Databricks Data Engineer β Mock Interview Q&A Sheet (Extended)
# | Category | Question | Short Answer |
---|---|---|---|
1 | Basics | What is Azure Databricks? | A unified analytics platform built on Apache Spark, optimized for Azure, enabling big data processing, ML, and data engineering pipelines. |
2 | Architecture | Explain Lakehouse architecture in Databricks. | Combines data warehouse reliability (schema, ACID) with data lake scalability using Delta Lake. |
3 | Cluster | Types of clusters in Databricks? | Standard, High Concurrency, Job Clusters. |
4 | Cluster Config | Key configurations for production Spark cluster? | Autoscaling, Photon-enabled, spot instance policy, init scripts for dependencies. |
5 | Workspace | What are DBFS and Repos in Databricks? | DBFS is Databricks File System; Repos is for Git-based source control. |
6 | Spark | Difference between RDD, DataFrame, Dataset? | RDD is low-level, DataFrame is optimized and schema-aware, Dataset is JVM-only. |
7 | Spark | What is lazy evaluation in Spark? | Transformations are not executed until an action is triggered. |
8 | Delta | What is Delta Lake? | Open-source storage layer that adds ACID, versioning, schema enforcement to Parquet files. |
9 | Delta | How does Delta support time travel? | Via versioning with versionAsOf or timestampAsOf . |
10 | Delta | How does MERGE INTO work in Delta? | Allows UPSERT logic based on matching keys. |
11 | Delta | How to handle schema evolution? | Enable option "mergeSchema" = "true" when writing. |
12 | Delta | Whatβs the role of _delta_log ? | Stores transaction logs (JSON and checkpoint Parquet files). |
13 | Performance | How do you optimize large Delta tables? | Use OPTIMIZE with ZORDER BY, manage file sizes, use caching. |
14 | Performance | How to reduce small file problems? | Repartition input data, OPTIMIZE in Delta, tune ingestion logic. |
15 | Performance | When to use broadcast joins? | When one side is small (e.g., <10MB). Use broadcast(df) hint. |
16 | Streaming | Difference: Auto Loader vs Structured Streaming? | Auto Loader ingests files (cloud), Streaming processes data in motion (Kafka, Event Hubs). |
17 | Streaming | What is a checkpoint directory in streaming? | Stores metadata to resume streaming after failures. |
18 | Streaming | How does trigger work in structured streaming? | Controls execution frequency: once , availableNow , processingTime . |
19 | SQL | Can you use SQL in Databricks? | Yes, via %sql magic or Spark SQL APIs. |
20 | Orchestration | How to automate pipelines in Databricks? | Using Workflows (Jobs), Tasks, or external orchestrators (ADF, Airflow). |
21 | Workflows | How do you handle task dependencies in Workflows? | Define tasks with depends_on relationships and retry logic. |
22 | GitOps | How does Git integration work in Databricks? | Repos support GitHub, GitLab, Azure DevOps for versioned notebooks. |
23 | DevOps | Best practices for CI/CD with Databricks? | Use notebooks in Git, automated tests, Workflows for deploy, Secrets for credentials. |
24 | Integration | How to connect Databricks with ADLS Gen2? | Mount storage or use ABFSS paths with SAS or service principal authentication. |
25 | Integration | Read/write data from SQL DB in Databricks? | Use JDBC connection with .read.format("jdbc") . |
26 | Unity Catalog | What is Unity Catalog? | Governance layer for managing data access, metadata, and auditing. |
27 | Unity Catalog | Difference: Unity Catalog vs Hive Metastore? | Unity is centralized, secure, and spans all workspaces. Hive is workspace-local. |
28 | Secrets | How do you manage secrets in Databricks? | Using dbutils.secrets.get() with Secret Scopes. |
29 | Cost Mgmt | How to optimize Databricks cost? | Use job clusters, spot instances, autoscaling, limit cluster lifetime. |
30 | Logs | Where do you check job failures? | In Workflows β Runs β Logs or use spark.sparkContext.setLogLevel() . |
31 | ML | How is ML handled in Databricks? | With MLflow for experiment tracking, model registry, and deployment. |
32 | Libraries | How to install external packages? | Use %pip install in notebook or cluster Libraries tab. |
33 | UDF | When to use UDFs and when to avoid? | Use only when built-in functions arenβt enough; they hurt performance. |
34 | Data Skew | How to handle data skew in joins? | Salting keys, broadcasting small table, repartitioning. |
35 | File Types | Supported file formats in Databricks? | CSV, Parquet, Delta, Avro, JSON, ORC. |
36 | Security | How is data secured in Databricks? | Role-based access, Unity Catalog, encryption at rest/transit. |
37 | Versioning | What is Delta log checkpointing? | Parquet checkpoint files every N commits for faster reads. |
38 | Catalog | Difference between catalog, schema, table? | Catalog contains schemas (databases); schemas contain tables/views. |
39 | Use Case | Build a 3-tier pipeline for clickstream logs? | Ingest via Auto Loader β Clean in Silver β Aggregate in Gold using Delta Lake. |
40 | Use Case | Daily 1TB ingestion β how would you design it? | Auto Loader for Bronze, partitioned Delta tables, optimized workflows, ZORDER, CI/CD for maintenance. |