Here’s a crisp explanation of core technical terms in Azure Databricks, tailored for interviews and hands-on clarity:


πŸš€ Databricks Key Technical Terms


🧭 Workspace

UI and environment where users organize notebooks, jobs, data, repos, and libraries.

  • Like a folder system for all Databricks assets.
  • Shared across users for collaboration.
  • Can link to GitHub/DevOps via Repos tab.

βš™οΈ Cluster

A Spark compute environment managed by Databricks.

  • Has one Driver and multiple Workers
  • Types:
    • Interactive: For dev & notebooks
    • Job: For scheduled/automated pipelines
  • You can configure autoscaling, libraries, runtime version, spot vs on-demand instances.

🧱 DBFS (Databricks File System)

Databricks-managed distributed storage layer on top of your cloud (e.g., ADLS/Blob in Azure).

  • Path: /dbfs/...
  • Upload files via UI or code.
  • Use for temporary or staging data.
  • Example: spark.read.csv(\"/dbfs/tmp/data.csv\")

πŸ““ Notebooks

Interactive code documents in Databricks.

  • Support Python, SQL, Scala, R, Markdown
  • Execute code per-cell
  • Track versions, output, and visualizations
  • Can be parameterized for jobs and workflows

πŸ› οΈ Databricks Runtime

Pre-configured Spark environment with optimized libraries and performance tuning.

  • Comes in various types:
    • Standard
    • ML (with scikit-learn, TensorFlow, etc.)
    • Photon (for SQL performance)
    • Delta (for Lakehouse support)
  • Versioned (e.g., 13.3 LTS, 14.0 ML)

πŸ“¦ Libraries

External packages you can install on a cluster.

  • Types:
    • Maven: For Scala/Java
    • PyPI: Python packages like pandas, nltk
    • Jar/Egg/Wheel: Custom uploads
  • Can be cluster-scoped or notebook-scoped
  • Installed via UI or %pip install in notebook

πŸš— Driver vs Worker

ComponentRoleCharacteristics
DriverMaster node– Manages SparkContext- Coordinates workers- Returns results
WorkerExecutor node– Executes tasks- Stores shuffle data- Processes partitions

πŸ“Œ The Driver runs on 1 machine, and Workers scale based on cluster size.


πŸ› οΈ Jobs

A scheduled or triggered pipeline that runs a notebook, JAR, Python file, or Delta Live Table.

  • Can have parameters, dependencies, retry logic, alerts
  • Supports multi-task workflows
  • Runs on job clusters or existing clusters

🌊 Pools

Instance pools reduce cluster startup time and cost by pre-warming VMs.

  • Create once β†’ Reuse across multiple jobs
  • Ideal for high-frequency jobs
  • Saves cost on job cluster creation

Summary Table

TermKey Idea
WorkspaceDev UI and project folder system
ClusterSpark runtime compute environment
DBFSBuilt-in cloud-backed file system
NotebookCode + output document
RuntimePre-installed Spark + tools
LibraryPackages added to cluster/notebook
DriverOrchestrates job
WorkerExecutes Spark tasks
JobAutomated pipeline/task
PoolReusable instance group for fast spin-up

# Databricks Notebook: Getting Started with PySpark

# COMMAND ----------
# πŸ“˜ 1. Read CSV & Parquet from DBFS

# DBFS Path (upload a file via sidebar > Data > Add Data)
csv_path = "/dbfs/tmp/sample.csv"
parquet_path = "/dbfs/tmp/sample.parquet"

# Read CSV
csv_df = spark.read.option("header", True).csv(csv_path)
csv_df.show()

# Write as Parquet
csv_df.write.mode("overwrite").parquet(parquet_path)

# COMMAND ----------
# πŸ“˜ 2. Connect to Azure Data Lake (ADLS Gen2) securely

# Step 1: Use Azure Key Vault-backed secret scope for credentials
# You must create secret scope via UI or CLI, example:
# dbutils.secrets.get(scope="kv-scope", key="adls-key")

storage_account = "your_storage_account_name"
container = "your_container"
mount_point = "/mnt/adls_mount"

configs = {
  "fs.azure.account.key.%s.dfs.core.windows.net" % storage_account: dbutils.secrets.get(scope="kv-scope", key="adls-key")
}

# Mount ADLS Gen2 (run once per cluster)
dbx_path = "abfss://%s@%s.dfs.core.windows.net/" % (container, storage_account)

try:
  dbutils.fs.mount(source=dbx_path, mount_point=mount_point, extra_configs=configs)
except:
  print("Already mounted")

# List files
display(dbutils.fs.ls(mount_point))

# COMMAND ----------
# πŸ“˜ 3. Simple PySpark Transformations

# Add column, filter, groupBy
df = csv_df.withColumnRenamed("amount", "sales_amount")
df_filtered = df.filter(df.sales_amount > 100)
df_filtered.groupBy("region").count().show()

# COMMAND ----------
# βœ… END OF HANDS-ON WALKTHROUGH

Here’s the full hands-on Databricks Notebook (First Databricks Notebook) with:

  • βœ… Read/write CSV & Parquet from DBFS
  • πŸ” Securely connect to ADLS Gen2 via Key Vault
  • πŸ”„ Perform simple PySpark transformations

🎯 Interview Questions & Answers


1. What is Databricks and how is it different from Azure Synapse?

FeatureDatabricksAzure Synapse
EngineApache Spark (Optimized)T-SQL engine + Spark + Pipelines
Language SupportPython, Scala, SQL, R, JavaPrimarily SQL + some Spark support
ML IntegrationMLflow, Notebooks, AutoMLLimited
Ideal Use CasesBig Data, AI/ML, StreamingBI, SQL DW, Reporting
Delta Lake SupportNative (by Databricks)Supported but not tightly integrated

🟑 Databricks excels in big data & AI, Synapse is stronger in SQL + BI integration.


2. Explain the Databricks Workspace structure.

  • Workspace: UI to organize assets
    • πŸ““ Notebooks: Code documents
    • πŸ›  Repos: Git-backed version control
    • πŸ—ƒ Jobs: Automation pipelines
    • πŸ“ DBFS: Built-in cloud-backed file store
    • πŸ” Secrets: Secure credential storage

3. What are the components of a Databricks cluster?

  • Driver Node: Manages SparkContext, job coordination
  • Worker Nodes: Execute Spark tasks
  • Libraries: Installed per cluster (PyPI, Maven)
  • Runtime: Pre-built Spark environment (e.g., ML, Photon)

πŸ“Œ A cluster = 1 driver + N workers running Spark jobs.


4. How do you connect ADLS with Databricks securely?

  1. Use Azure Key Vault-backed Secret Scope:
    • Store keys/secrets in Azure Key Vault
    • Reference using dbutils.secrets.get(...)
  2. Mount ADLS Gen2 using configs: configs = { f"fs.azure.account.key.{account}.dfs.core.windows.net": dbutils.secrets.get(scope="kv-scope", key="adls-key") } dbutils.fs.mount(source="abfss://...", mount_point="/mnt/...", extra_configs=configs)
  3. Use ABFS URI or /mnt/ mount in Spark read/write.

Pages: 1 2 3 4 5 6 7 8


Discover more from HintsToday

Subscribe to get the latest posts sent to your email.

Discover more from HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading