Here’s Post 3: Creating Your First Notebook, Loading Data & Performing EDA with Spark & Delta — the perfect hands-on follow-up in your Databricks tutorial series.

📓 Post 3: Creating Your First Notebook, Loading Data & Performing EDA with Spark & Delta

Welcome to Post 3 of our Databricks Tutorial Series. Now that your workspace and cloud storage (S3 or ADLS) are ready, it’s time to jump into real work—building your first notebook, loading raw data, and doing basic EDA (Exploratory Data Analysis) using Spark + Delta Lake.

This hands-on post will:

Walk you through the UI
Load real data (CSV → Delta format)
Run EDA: schema, nulls, value counts, distribution
Save data in Delta format with versioning

🎯 Goals for This Post

🔧 Set up a Databricks notebook (Python or SQL)
📥 Load sample CSV/JSON from S3 or ADLS
🔎 Perform EDA with PySpark
💾 Save clean data as a Delta Table
🧬 Track changes using Delta features like Time Travel

🛠️ Step 1: Create a New Notebook in Databricks

Go to your Databricks Workspace
Click on Workspace > Users > your_user
Click Create → Notebook
- Name: eda_customer_data
- Language: Python (or SQL)
- Attach to: Your running cluster

✅ Notebook ready!

📂 Step 2: Upload Sample Data (CSV or JSON)

We’ll use a simple customer dataset. You can either upload manually or mount it from cloud storage.

🔹 Option A: Upload Manually

Go to Data > Create Table > Upload File
Choose a local CSV file (e.g., customers.csv)
Click “Create Table With UI” or “Create Table in Notebook”

✅ File gets stored at /FileStore/tables/customers.csv

🔹 Option B: Load from ADLS/S3

df = spark.read.csv("/mnt/data/customers.csv", header=True, inferSchema=True)

🔍 Step 3: Perform Basic EDA (Exploratory Data Analysis)

Let’s start examining our dataset.

🔹 Read the CSV into a DataFrame

df = spark.read.option("header", True).option("inferSchema", True).csv("/FileStore/tables/customers.csv")
df.printSchema()
df.show(5)

🔹 Check Row Count

df.count()

🔹 Null Check

from pyspark.sql.functions import col

null_counts = df.select([col(c).isNull().alias(c) for c in df.columns]).groupBy().sum().show()

🔹 Summary Statistics

df.describe().show()

🔹 Value Distribution (e.g., by Gender)

df.groupBy("gender").count().show()

🔹 Unique Values

df.select("city").distinct().show()

💾 Step 4: Save as Delta Table

Convert your cleaned data to Delta format for reliability, schema enforcement, and time travel.

df.write.format("delta").mode("overwrite").save("/delta/customers")

✅ This creates Delta transaction logs in /delta/customers/_delta_log

🔁 Step 5: Register Delta Table in Metastore

Make your table queryable in SQL:

spark.sql("CREATE TABLE IF NOT EXISTS customers USING DELTA LOCATION '/delta/customers'")
spark.sql("SELECT * FROM customers LIMIT 10").show()

Now, it can be used like any SQL table!

🔄 Step 6: Delta Features in Action (Time Travel + Upserts)

🔹 Check Table History

spark.sql("DESCRIBE HISTORY customers").show()

🔹 Make Changes (Update or Delete)

df.filter("gender = 'Male'").write.format("delta").mode("overwrite").save("/delta/customers")

🔹 Time Travel (Access Older Version)

df_old = spark.read.format("delta").option("versionAsOf", 0).load("/delta/customers")
df_old.show()

📊 Bonus: Display EDA Graphs in Databricks

Try this for interactive visuals:

display(df.groupBy("city").count().orderBy("count", ascending=False))

Databricks will auto-render a bar chart from your grouped data.

📦 Summary

You now know how to:

Create a notebook
Load CSV or JSON into a Spark DataFrame
Perform core EDA (nulls, summaries, distributions)
Save to Delta Lake for transactional power
Use Time Travel + SQL on Delta Tables

HintsToday

recent posts

about

DataBricks Tutorial for Beginner to Advanced