Traditional RDBMS (like Oracle, Postgres, MySQL) vs. Vanilla PySpark (with Parquet/ORC) vs. PySpark with Delta Lake

Absolutely! Let’s break it down clearly by comparing:

💡 Traditional RDBMS (like Oracle, Postgres, MySQL) vs. Vanilla PySpark (with Parquet/ORC) vs. PySpark with Delta Lake

We’ll explain each feature, its purpose in traditional SQL DBs, and whether it’s supported in Vanilla PySpark, and how Delta Lake fills the gap.

🔁 1. ACID Transactions

RDBMS	✅ Supported via transaction log
PySpark + Parquet	❌ Not supported
Delta Lake	✅ Supported

What It Means: ACID = Atomicity, Consistency, Isolation, Durability. Ensures data integrity (e.g., all-or-nothing writes, rollback on error).

Why PySpark Can’t Do It: Parquet/ORC are just files on disk (HDFS/S3) – no transaction coordination.

Delta Lake adds a _delta_log that tracks atomic operations.

Use Case: Writing multiple updates – you want either all to succeed or none (e.g., financial transactions, ETL job failure rollback).

🕰️ 2. Time Travel

RDBMS	✅ Via transaction/version logs or flashback
Vanilla PySpark	❌ Not supported
Delta Lake	✅ Supported

What It Means: You can query old versions of a table — useful for debugging, audits, rollback.

Delta Lake uses versioned snapshots stored in _delta_log.

-- Delta Lake: Read previous version
SELECT * FROM table VERSION AS OF 5

Use Case: Accidentally deleted data? Just roll back to version 4.

🔀 3. Concurrent Writes

RDBMS	✅ Supported with row-level locks
Vanilla PySpark	❌ Dangerous
Delta Lake	✅ Transactional locking using Delta log

What It Means: Multiple users or jobs can safely update the same table.

Why PySpark Fails: If two Spark jobs write the same file path, one may overwrite the other — no locking mechanism.

Delta uses optimistic concurrency control (OCC) — detects conflicting changes and throws errors.

📐 4. Schema Evolution

RDBMS	✅ Supports ALTER TABLE
Vanilla PySpark	⚠️ Limited
Delta Lake	✅ Supports automatic schema merge

What It Means: You can evolve a schema — add/remove columns dynamically.

PySpark Issue: Manual reconciliation of schemas.

# Delta Lake schema evolution
df.write.option("mergeSchema", "true").format("delta").mode("append").save(...)

Use Case: JSON or streaming data where new fields get introduced over time.

🔄 5. MERGE / UPSERT (SCD2 Support)

RDBMS	✅ `MERGE INTO`, `UPDATE`, `DELETE`
Vanilla PySpark	❌ Manual logic
Delta Lake	✅ Supports native `MERGE INTO`

What It Means: Efficiently update or insert based on conditions (Slowly Changing Dimensions, SCD2).

Vanilla Way:

# Manual Upsert: join → overwrite
existing = spark.read.parquet("path")
updated = new_data.join(existing, "id", "leftanti").union(new_data)
updated.write.mode("overwrite").parquet("path")

Delta Way:

MERGE INTO customers USING updates ON customers.id = updates.id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...

🧽 6. Deletes / Updates

RDBMS	✅ Efficient
Vanilla PySpark	❌ Full overwrite needed
Delta Lake	✅ Native support

Problem: Want to delete all records where status = 'inactive'.

PySpark:

Read table
Filter out records
Overwrite entire dataset

Delta Lake:

DELETE FROM my_table WHERE status = 'inactive'

📋 7. Change Tracking / Audit Logs

RDBMS	✅ System-managed or via triggers
Vanilla PySpark	❌ Not available
Delta Lake	✅ `_delta_log` keeps change history

Use Case: Who changed what and when?

🔍 8. Indexing

RDBMS	✅ Indexes on columns
Vanilla PySpark	❌ Not supported
Delta Lake	⚠️ Z-Ordering (helps but not real index)

Problem: No B-Trees or hash indexes like in Oracle.

Delta Lake:

Supports Z-Ordering — physically clusters data by column to improve query performance.

OPTIMIZE table_name ZORDER BY (user_id)

⚙️ 9. Stored Procedures / Triggers

RDBMS	✅ Supports procedural logic
Vanilla PySpark	❌ Not available
Delta Lake	❌ Also not supported natively

Spark doesn’t support SQL procedural logic (loops, if/else). You have to do this in PySpark/Python/Scala code.

🔐 10. Access Control

RDBMS	✅ Role-based, column-level security
Vanilla PySpark	❌ Controlled by storage layer (e.g., HDFS/S3)
Delta Lake + Unity Catalog (Databricks)	✅ Fine-grained access control

Why This Matters: If you store data on S3, access control happens at bucket or object level — not at row or column level like in databases.

Databricks Unity Catalog:

Table/Column/Row level access
RBAC, Data Lineage

✅ Summary Table

Feature	Traditional RDBMS	Vanilla PySpark	Delta Lake
ACID Transactions	✅	❌	✅
Time Travel	✅	❌	✅
Concurrent Writes	✅	❌	✅
Schema Evolution	✅	⚠️	✅
Merge / Upsert (SCD2)	✅	❌	✅
Deletes / Updates	✅	❌	✅
Change Tracking	✅	❌	✅
Indexing	✅	❌	⚠️ (Z-Ordering)
Stored Procedures	✅	❌	❌
Access Control	✅	❌	✅ (via Unity Catalog)

📌 Conclusion

Vanilla PySpark (Parquet, ORC) is not a database. It’s file-based: no transaction log, no versioning, no indexing.
Delta Lake brings the power of databases (ACID, merge, delete, time travel) to data lakes using Parquet under the hood.
Databricks with Unity Catalog adds governance, RBAC, lineage — making it a true enterprise Lakehouse.

Excellent — let’s break this down in-depth. We’ll explain key database features like Atomicity (ACID), SCD1/SCD2, and then compare Traditional RDBMS vs. Vanilla PySpark vs. Delta Lake/Databricks for each of them, with examples.

🔥 1. Atomicity (from ACID)

Atomicity = All or Nothing

✅ Traditional RDBMS (e.g., Oracle, Postgres)

Uses transactions:
Example: BEGIN; UPDATE accounts SET balance = balance - 500 WHERE id = 1; UPDATE accounts SET balance = balance + 500 WHERE id = 2; COMMIT;
If any step fails, entire block rolls back.
Supported by built-in transaction logs + locks.

❌ Vanilla PySpark (Parquet/CSV/ORC)

Filesystem-based (HDFS, S3): No concept of transaction logs or rollback.
Example: df1.write.mode('overwrite').parquet("s3://data/step1/") df2.write.mode('overwrite').parquet("s3://data/step2/") # If this fails, step1 still written.
If one write fails, others remain — not atomic.

✅ Delta Lake (on Databricks or OSS)

Maintains _delta_log/ directory for transaction logs.
Atomic actions: from delta.tables import DeltaTable spark.sql("BEGIN") spark.sql(""" MERGE INTO target USING updates ON target.id = updates.id WHEN MATCHED THEN UPDATE SET ... WHEN NOT MATCHED THEN INSERT ... """) spark.sql("COMMIT")
If any step fails: rollback using Delta transaction logs.

📘 2. Slowly Changing Dimensions (SCD)

SCD1 (overwrite old data with new)

Use Case: Keep only the latest customer address.

✅ Traditional SQL:

UPDATE customers SET address = 'New York' WHERE customer_id = 101;

❌ Vanilla PySpark:

Manual steps:

df_existing = spark.read.parquet("s3://customers/")
df_updates = spark.read.parquet("s3://new_updates/")

df_merged = df_existing.join(df_updates, "customer_id", "left_outer")\
    .withColumn("address", F.coalesce(df_updates.address, df_existing.address))

df_merged.write.mode('overwrite').parquet("s3://customers/")

No native MERGE ⇒ must read → join → overwrite.

✅ Delta Lake (Databricks):

MERGE INTO customers USING updates
ON customers.customer_id = updates.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

OR (Python API):

DeltaTable.forPath(spark, "s3://delta/customers").alias("t") \
    .merge(df_updates.alias("u"), "t.customer_id = u.customer_id") \
    .whenMatchedUpdateAll() \
    .whenNotMatchedInsertAll() \
    .execute()

SCD2 (history tracking)

Use Case: Track every change with start_date, end_date, and is_active.

✅ Traditional SQL:

-- Close old record
UPDATE customer_dim SET end_date = CURRENT_DATE, is_active = 0
WHERE customer_id = 101 AND is_active = 1;

-- Insert new version
INSERT INTO customer_dim (id, address, start_date, is_active)
VALUES (101, 'New York', CURRENT_DATE, 1);

❌ Vanilla PySpark:

Manual logic:

# Read old + new
existing = spark.read.parquet("s3://customer_dim/")
incoming = spark.read.parquet("s3://incoming/")

# Close old
closed = existing.join(incoming, "id")\
    .withColumn("end_date", F.current_date())\
    .withColumn("is_active", F.lit(0))

# Add new rows
new_rows = incoming.withColumn("start_date", F.current_date())\
    .withColumn("end_date", F.lit(None))\
    .withColumn("is_active", F.lit(1))

# Combine and write
final = closed.union(new_rows)
final.write.mode('overwrite').parquet("s3://customer_dim/")

✅ Delta Lake:

from delta.tables import DeltaTable

delta = DeltaTable.forPath(spark, "s3://delta/customer_dim")

# Step 1: Mark old records inactive
delta.alias("target").merge(
    updates.alias("source"),
    "target.id = source.id AND target.is_active = 1"
).whenMatchedUpdate(set={
    "end_date": "current_date()",
    "is_active": "0"
}).execute()

# Step 2: Insert new versions
updates.withColumn("start_date", F.current_date())\
    .withColumn("is_active", F.lit(1))\
    .write.format("delta").mode("append").save("s3://delta/customer_dim")

🔍 Feature-by-Feature Comparison Table

Feature	RDBMS	PySpark (Vanilla)	PySpark + Delta Lake
ACID Transactions	✅ Yes	❌ No	✅ Yes
SCD1	✅ Easy	⚠️ Manual Logic	✅ Native MERGE
SCD2	✅ Easy	⚠️ Manual + Union	✅ Built-in support
MERGE/UPSERT	✅ Yes	❌ No	✅ Yes
DELETE/UPDATE	✅ Yes	❌ No (overwrite only)	✅ Yes
Time Travel	⚠️ Point-in-time recovery	❌ No	✅ Versioned reads
Indexing / Stats	✅ B-tree, Bitmap	❌ Not supported	⚠️ Z-Order
Schema Evolution	✅ ALTER TABLE	⚠️ Manual handling	✅ Auto evolution
Stored Procedures	✅ PL/SQL	❌ None	⚠️ Not supported
Access Control	✅ Roles & Privs	❌ Depends on HDFS/S3	✅ Unity Catalog
Audit Logs	✅ Triggers	❌ None	✅ _delta_log

💡 Summary: Key Takeaways

System	Strengths	Weaknesses
Traditional RDBMS	Strong transactional guarantees, mature SQL support, easy SCD1/SCD2	Not distributed, not suited for Big Data scale
Vanilla PySpark (Parquet)	Scales massively, good for batch processing	No ACID, SCDs, updates, concurrency guarantees
Databricks (Delta Lake)	Distributed + ACID, Upserts, Time Travel, Audit	Slight learning curve, needs Delta format

HintsToday

recent posts

about

Traditional RDBMS (like Oracle, Postgres, MySQL) vs. Vanilla PySpark (with Parquet/ORC) vs. PySpark with Delta Lake

💡 Traditional RDBMS (like Oracle, Postgres, MySQL) vs. Vanilla PySpark (with Parquet/ORC) vs. PySpark with Delta Lake

🔁 1. ACID Transactions

🕰️ 2. Time Travel

🔀 3. Concurrent Writes

📐 4. Schema Evolution

🔄 5. MERGE / UPSERT (SCD2 Support)

🧽 6. Deletes / Updates

📋 7. Change Tracking / Audit Logs

🔍 8. Indexing

⚙️ 9. Stored Procedures / Triggers

🔐 10. Access Control

✅ Summary Table

📌 Conclusion

🔥 1. Atomicity (from ACID)

✅ Traditional RDBMS (e.g., Oracle, Postgres)

❌ Vanilla PySpark (Parquet/CSV/ORC)

✅ Delta Lake (on Databricks or OSS)

📘 2. Slowly Changing Dimensions (SCD)

SCD1 (overwrite old data with new)

✅ Traditional SQL:

❌ Vanilla PySpark:

✅ Delta Lake (Databricks):

SCD2 (history tracking)

✅ Traditional SQL:

❌ Vanilla PySpark:

✅ Delta Lake:

🔍 Feature-by-Feature Comparison Table

💡 Summary: Key Takeaways

Discover more from HintsToday