Here’s Post 3: Creating Your First Notebook, Loading Data & Performing EDA with Spark & Delta — the perfect hands-on follow-up in your Databricks tutorial series.


📓 Post 3: Creating Your First Notebook, Loading Data & Performing EDA with Spark & Delta

Welcome to Post 3 of our Databricks Tutorial Series. Now that your workspace and cloud storage (S3 or ADLS) are ready, it’s time to jump into real work—building your first notebook, loading raw data, and doing basic EDA (Exploratory Data Analysis) using Spark + Delta Lake.

This hands-on post will:

  • Walk you through the UI
  • Load real data (CSV → Delta format)
  • Run EDA: schema, nulls, value counts, distribution
  • Save data in Delta format with versioning

🎯 Goals for This Post

  • 🔧 Set up a Databricks notebook (Python or SQL)
  • 📥 Load sample CSV/JSON from S3 or ADLS
  • 🔎 Perform EDA with PySpark
  • 💾 Save clean data as a Delta Table
  • 🧬 Track changes using Delta features like Time Travel

🛠️ Step 1: Create a New Notebook in Databricks

  1. Go to your Databricks Workspace
  2. Click on Workspace > Users > your_user
  3. Click Create → Notebook
    • Name: eda_customer_data
    • Language: Python (or SQL)
    • Attach to: Your running cluster

✅ Notebook ready!


📂 Step 2: Upload Sample Data (CSV or JSON)

We’ll use a simple customer dataset. You can either upload manually or mount it from cloud storage.

🔹 Option A: Upload Manually

  1. Go to Data > Create Table > Upload File
  2. Choose a local CSV file (e.g., customers.csv)
  3. Click “Create Table With UI” or “Create Table in Notebook”

✅ File gets stored at /FileStore/tables/customers.csv

🔹 Option B: Load from ADLS/S3

df = spark.read.csv("/mnt/data/customers.csv", header=True, inferSchema=True)

🔍 Step 3: Perform Basic EDA (Exploratory Data Analysis)

Let’s start examining our dataset.

🔹 Read the CSV into a DataFrame

df = spark.read.option("header", True).option("inferSchema", True).csv("/FileStore/tables/customers.csv")
df.printSchema()
df.show(5)

🔹 Check Row Count

df.count()

🔹 Null Check

from pyspark.sql.functions import col

null_counts = df.select([col(c).isNull().alias(c) for c in df.columns]).groupBy().sum().show()

🔹 Summary Statistics

df.describe().show()

🔹 Value Distribution (e.g., by Gender)

df.groupBy("gender").count().show()

🔹 Unique Values

df.select("city").distinct().show()

💾 Step 4: Save as Delta Table

Convert your cleaned data to Delta format for reliability, schema enforcement, and time travel.

df.write.format("delta").mode("overwrite").save("/delta/customers")

✅ This creates Delta transaction logs in /delta/customers/_delta_log


🔁 Step 5: Register Delta Table in Metastore

Make your table queryable in SQL:

spark.sql("CREATE TABLE IF NOT EXISTS customers USING DELTA LOCATION '/delta/customers'")
spark.sql("SELECT * FROM customers LIMIT 10").show()

Now, it can be used like any SQL table!


🔄 Step 6: Delta Features in Action (Time Travel + Upserts)

🔹 Check Table History

spark.sql("DESCRIBE HISTORY customers").show()

🔹 Make Changes (Update or Delete)

df.filter("gender = 'Male'").write.format("delta").mode("overwrite").save("/delta/customers")

🔹 Time Travel (Access Older Version)

df_old = spark.read.format("delta").option("versionAsOf", 0).load("/delta/customers")
df_old.show()

📊 Bonus: Display EDA Graphs in Databricks

Try this for interactive visuals:

display(df.groupBy("city").count().orderBy("count", ascending=False))

Databricks will auto-render a bar chart from your grouped data.


📦 Summary

You now know how to:

  • Create a notebook
  • Load CSV or JSON into a Spark DataFrame
  • Perform core EDA (nulls, summaries, distributions)
  • Save to Delta Lake for transactional power
  • Use Time Travel + SQL on Delta Tables


Pages: 1 2 3 4 5 6


Discover more from HintsToday

Subscribe to get the latest posts sent to your email.

Posted in

Discover more from HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading