Azure Databricks tutorial roadmap (Beginner β†’ Advanced), tailored for Data Engineering interviews in India



πŸŽ₯ Curated Video Playlist (Azure Databricks + Spark)

🟒 Beginner Videos

🟑 Intermediate

πŸ”΄ Advanced + Use Cases


Here’s an expanded Mock Interview Q&A Sheet with 40+ high-impact questions across core Databricks topics, tailored specifically for Data Engineer interviews in India.


πŸ“„ Azure Databricks Data Engineer – Mock Interview Q&A Sheet (Extended)

#CategoryQuestionShort Answer
1BasicsWhat is Azure Databricks?A unified analytics platform built on Apache Spark, optimized for Azure, enabling big data processing, ML, and data engineering pipelines.
2ArchitectureExplain Lakehouse architecture in Databricks.Combines data warehouse reliability (schema, ACID) with data lake scalability using Delta Lake.
3ClusterTypes of clusters in Databricks?Standard, High Concurrency, Job Clusters.
4Cluster ConfigKey configurations for production Spark cluster?Autoscaling, Photon-enabled, spot instance policy, init scripts for dependencies.
5WorkspaceWhat are DBFS and Repos in Databricks?DBFS is Databricks File System; Repos is for Git-based source control.
6SparkDifference between RDD, DataFrame, Dataset?RDD is low-level, DataFrame is optimized and schema-aware, Dataset is JVM-only.
7SparkWhat is lazy evaluation in Spark?Transformations are not executed until an action is triggered.
8DeltaWhat is Delta Lake?Open-source storage layer that adds ACID, versioning, schema enforcement to Parquet files.
9DeltaHow does Delta support time travel?Via versioning with versionAsOf or timestampAsOf.
10DeltaHow does MERGE INTO work in Delta?Allows UPSERT logic based on matching keys.
11DeltaHow to handle schema evolution?Enable option "mergeSchema" = "true" when writing.
12DeltaWhat’s the role of _delta_log?Stores transaction logs (JSON and checkpoint Parquet files).
13PerformanceHow do you optimize large Delta tables?Use OPTIMIZE with ZORDER BY, manage file sizes, use caching.
14PerformanceHow to reduce small file problems?Repartition input data, OPTIMIZE in Delta, tune ingestion logic.
15PerformanceWhen to use broadcast joins?When one side is small (e.g., <10MB). Use broadcast(df) hint.
16StreamingDifference: Auto Loader vs Structured Streaming?Auto Loader ingests files (cloud), Streaming processes data in motion (Kafka, Event Hubs).
17StreamingWhat is a checkpoint directory in streaming?Stores metadata to resume streaming after failures.
18StreamingHow does trigger work in structured streaming?Controls execution frequency: once, availableNow, processingTime.
19SQLCan you use SQL in Databricks?Yes, via %sql magic or Spark SQL APIs.
20OrchestrationHow to automate pipelines in Databricks?Using Workflows (Jobs), Tasks, or external orchestrators (ADF, Airflow).
21WorkflowsHow do you handle task dependencies in Workflows?Define tasks with depends_on relationships and retry logic.
22GitOpsHow does Git integration work in Databricks?Repos support GitHub, GitLab, Azure DevOps for versioned notebooks.
23DevOpsBest practices for CI/CD with Databricks?Use notebooks in Git, automated tests, Workflows for deploy, Secrets for credentials.
24IntegrationHow to connect Databricks with ADLS Gen2?Mount storage or use ABFSS paths with SAS or service principal authentication.
25IntegrationRead/write data from SQL DB in Databricks?Use JDBC connection with .read.format("jdbc").
26Unity CatalogWhat is Unity Catalog?Governance layer for managing data access, metadata, and auditing.
27Unity CatalogDifference: Unity Catalog vs Hive Metastore?Unity is centralized, secure, and spans all workspaces. Hive is workspace-local.
28SecretsHow do you manage secrets in Databricks?Using dbutils.secrets.get() with Secret Scopes.
29Cost MgmtHow to optimize Databricks cost?Use job clusters, spot instances, autoscaling, limit cluster lifetime.
30LogsWhere do you check job failures?In Workflows β†’ Runs β†’ Logs or use spark.sparkContext.setLogLevel().
31MLHow is ML handled in Databricks?With MLflow for experiment tracking, model registry, and deployment.
32LibrariesHow to install external packages?Use %pip install in notebook or cluster Libraries tab.
33UDFWhen to use UDFs and when to avoid?Use only when built-in functions aren’t enough; they hurt performance.
34Data SkewHow to handle data skew in joins?Salting keys, broadcasting small table, repartitioning.
35File TypesSupported file formats in Databricks?CSV, Parquet, Delta, Avro, JSON, ORC.
36SecurityHow is data secured in Databricks?Role-based access, Unity Catalog, encryption at rest/transit.
37VersioningWhat is Delta log checkpointing?Parquet checkpoint files every N commits for faster reads.
38CatalogDifference between catalog, schema, table?Catalog contains schemas (databases); schemas contain tables/views.
39Use CaseBuild a 3-tier pipeline for clickstream logs?Ingest via Auto Loader β†’ Clean in Silver β†’ Aggregate in Gold using Delta Lake.
40Use CaseDaily 1TB ingestion β€” how would you design it?Auto Loader for Bronze, partitioned Delta tables, optimized workflows, ZORDER, CI/CD for maintenance.

Pages: 1 2 3 4 5 6 7 8