Hereβs a crisp explanation of core technical terms in Azure Databricks, tailored for interviews and hands-on clarity:
π Databricks Key Technical Terms
π§ Workspace
UI and environment where users organize notebooks, jobs, data, repos, and libraries.
- Like a folder system for all Databricks assets.
- Shared across users for collaboration.
- Can link to GitHub/DevOps via Repos tab.
βοΈ Cluster
A Spark compute environment managed by Databricks.
- Has one Driver and multiple Workers
- Types:
- Interactive: For dev & notebooks
- Job: For scheduled/automated pipelines
- You can configure autoscaling, libraries, runtime version, spot vs on-demand instances.
π§± DBFS (Databricks File System)
Databricks-managed distributed storage layer on top of your cloud (e.g., ADLS/Blob in Azure).
- Path:
/dbfs/...
- Upload files via UI or code.
- Use for temporary or staging data.
- Example:
spark.read.csv(\"/dbfs/tmp/data.csv\")
π Notebooks
Interactive code documents in Databricks.
- Support Python, SQL, Scala, R, Markdown
- Execute code per-cell
- Track versions, output, and visualizations
- Can be parameterized for jobs and workflows
π οΈ Databricks Runtime
Pre-configured Spark environment with optimized libraries and performance tuning.
- Comes in various types:
- Standard
- ML (with scikit-learn, TensorFlow, etc.)
- Photon (for SQL performance)
- Delta (for Lakehouse support)
- Versioned (e.g., 13.3 LTS, 14.0 ML)
π¦ Libraries
External packages you can install on a cluster.
- Types:
- Maven: For Scala/Java
- PyPI: Python packages like pandas, nltk
- Jar/Egg/Wheel: Custom uploads
- Can be cluster-scoped or notebook-scoped
- Installed via UI or
%pip install
in notebook
π Driver vs Worker
Component | Role | Characteristics |
---|---|---|
Driver | Master node | – Manages SparkContext- Coordinates workers- Returns results |
Worker | Executor node | – Executes tasks- Stores shuffle data- Processes partitions |
π The Driver runs on 1 machine, and Workers scale based on cluster size.
π οΈ Jobs
A scheduled or triggered pipeline that runs a notebook, JAR, Python file, or Delta Live Table.
- Can have parameters, dependencies, retry logic, alerts
- Supports multi-task workflows
- Runs on job clusters or existing clusters
π Pools
Instance pools reduce cluster startup time and cost by pre-warming VMs.
- Create once β Reuse across multiple jobs
- Ideal for high-frequency jobs
- Saves cost on job cluster creation
Summary Table
Term | Key Idea |
---|---|
Workspace | Dev UI and project folder system |
Cluster | Spark runtime compute environment |
DBFS | Built-in cloud-backed file system |
Notebook | Code + output document |
Runtime | Pre-installed Spark + tools |
Library | Packages added to cluster/notebook |
Driver | Orchestrates job |
Worker | Executes Spark tasks |
Job | Automated pipeline/task |
Pool | Reusable instance group for fast spin-up |
# Databricks Notebook: Getting Started with PySpark
# COMMAND ----------
# π 1. Read CSV & Parquet from DBFS
# DBFS Path (upload a file via sidebar > Data > Add Data)
csv_path = "/dbfs/tmp/sample.csv"
parquet_path = "/dbfs/tmp/sample.parquet"
# Read CSV
csv_df = spark.read.option("header", True).csv(csv_path)
csv_df.show()
# Write as Parquet
csv_df.write.mode("overwrite").parquet(parquet_path)
# COMMAND ----------
# π 2. Connect to Azure Data Lake (ADLS Gen2) securely
# Step 1: Use Azure Key Vault-backed secret scope for credentials
# You must create secret scope via UI or CLI, example:
# dbutils.secrets.get(scope="kv-scope", key="adls-key")
storage_account = "your_storage_account_name"
container = "your_container"
mount_point = "/mnt/adls_mount"
configs = {
"fs.azure.account.key.%s.dfs.core.windows.net" % storage_account: dbutils.secrets.get(scope="kv-scope", key="adls-key")
}
# Mount ADLS Gen2 (run once per cluster)
dbx_path = "abfss://%s@%s.dfs.core.windows.net/" % (container, storage_account)
try:
dbutils.fs.mount(source=dbx_path, mount_point=mount_point, extra_configs=configs)
except:
print("Already mounted")
# List files
display(dbutils.fs.ls(mount_point))
# COMMAND ----------
# π 3. Simple PySpark Transformations
# Add column, filter, groupBy
df = csv_df.withColumnRenamed("amount", "sales_amount")
df_filtered = df.filter(df.sales_amount > 100)
df_filtered.groupBy("region").count().show()
# COMMAND ----------
# β
END OF HANDS-ON WALKTHROUGH
Hereβs the full hands-on Databricks Notebook (First Databricks Notebook
) with:
- β Read/write CSV & Parquet from DBFS
- π Securely connect to ADLS Gen2 via Key Vault
- π Perform simple PySpark transformations
π― Interview Questions & Answers
1. What is Databricks and how is it different from Azure Synapse?
Feature | Databricks | Azure Synapse |
---|---|---|
Engine | Apache Spark (Optimized) | T-SQL engine + Spark + Pipelines |
Language Support | Python, Scala, SQL, R, Java | Primarily SQL + some Spark support |
ML Integration | MLflow, Notebooks, AutoML | Limited |
Ideal Use Cases | Big Data, AI/ML, Streaming | BI, SQL DW, Reporting |
Delta Lake Support | Native (by Databricks) | Supported but not tightly integrated |
π‘ Databricks excels in big data & AI, Synapse is stronger in SQL + BI integration.
2. Explain the Databricks Workspace structure.
- Workspace: UI to organize assets
- π Notebooks: Code documents
- π Repos: Git-backed version control
- π Jobs: Automation pipelines
- π DBFS: Built-in cloud-backed file store
- π Secrets: Secure credential storage
3. What are the components of a Databricks cluster?
- Driver Node: Manages SparkContext, job coordination
- Worker Nodes: Execute Spark tasks
- Libraries: Installed per cluster (PyPI, Maven)
- Runtime: Pre-built Spark environment (e.g., ML, Photon)
π A cluster = 1 driver + N workers running Spark jobs.
4. How do you connect ADLS with Databricks securely?
- Use Azure Key Vault-backed Secret Scope:
- Store keys/secrets in Azure Key Vault
- Reference using
dbutils.secrets.get(...)
- Mount ADLS Gen2 using configs:
configs = { f"fs.azure.account.key.{account}.dfs.core.windows.net": dbutils.secrets.get(scope="kv-scope", key="adls-key") } dbutils.fs.mount(source="abfss://...", mount_point="/mnt/...", extra_configs=configs)
- Use ABFS URI or
/mnt/
mount in Spark read/write.