Here’s a clear breakdown of PIVOT
and UNPIVOT
in SQL and Spark SQL, along with use cases and examples.
π What is PIVOT?
PIVOT transforms rows into columns.
It is useful for summarizing or grouping data to make it more readable.
β Use Case:
Show total sales by region across different quarters as columns.
π§± Sample Input Table: sales_data
region | quarter | sales |
---|---|---|
East | Q1 | 100 |
East | Q2 | 120 |
West | Q1 | 90 |
West | Q2 | 130 |
β SQL PIVOT (SQL Server, Oracle syntax):
SELECT *
FROM (
SELECT region, quarter, sales
FROM sales_data
) AS src
PIVOT (
SUM(sales)
FOR quarter IN ([Q1], [Q2])
) AS pvt;
β€ Output:
region | Q1 | Q2 |
---|---|---|
East | 100 | 120 |
West | 90 | 130 |
β Spark SQL (2 ways)
π °οΈ SQL-style (since Spark 2.4):
-- register sales_data as temp view first
SELECT *
FROM (
SELECT region, quarter, sales
FROM sales_data
)
PIVOT (
SUM(sales) FOR quarter IN ('Q1', 'Q2')
)
π ±οΈ DataFrame API:
df.groupBy("region") \
.pivot("quarter", ["Q1", "Q2"]) \
.agg(F.sum("sales"))
π What is UNPIVOT?
UNPIVOT transforms columns into rows.
Itβs helpful when you want to normalize or denormalize wide tables.
β Use Case:
Convert a wide report (with Q1, Q2, Q3…) into a long format for easier analysis or plotting.
π§± Input Table (Wide format):
region | Q1 | Q2 |
---|---|---|
East | 100 | 120 |
West | 90 | 130 |
β SQL UNPIVOT (SQL Server):
SELECT region, quarter, sales
FROM sales_data
UNPIVOT (
sales FOR quarter IN (Q1, Q2)
) AS unpvt;
β€ Output:
region | quarter | sales |
---|---|---|
East | Q1 | 100 |
East | Q2 | 120 |
West | Q1 | 90 |
West | Q2 | 130 |
β Spark SQL does not support UNPIVOT
directly.
β
Alternative: stack()
function or explode(array(...))
-- Spark SQL using `stack`
SELECT region,
stack(2, 'Q1', Q1, 'Q2', Q2) AS (quarter, sales)
FROM sales_data
Or using DataFrame API:
df.selectExpr("region", "stack(2, 'Q1', Q1, 'Q2', Q2) as (quarter, sales)")
β Summary Table
Operation | SQL Syntax | Spark SQL | PySpark API |
---|---|---|---|
PIVOT | PIVOT(...) | PIVOT(...) | groupBy().pivot().agg() |
UNPIVOT | UNPIVOT(...) | β Not supported | β
Use stack() or explode() |
π‘ Real-World Use Cases
Use Case | PIVOT or UNPIVOT |
---|---|
Monthly sales summary by region | PIVOT |
Reshaping wide Excel files | UNPIVOT |
Preparing input for ML algorithms | UNPIVOT |
Showing KPI trends across departments | PIVOT |
Dashboard reports (time-series) | PIVOT |
Yes, both PIVOT and UNPIVOT operations can be done in the PySpark DataFrame API, although:
- β PIVOT is directly supported.
- β οΈ UNPIVOT is not a built-in function, but can be mimicked using
selectExpr
,stack()
, orexplode()
.
β 1. PIVOT in PySpark DataFrame API
Example:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
data = [
("East", "Q1", 100),
("East", "Q2", 120),
("West", "Q1", 90),
("West", "Q2", 130),
]
df = spark.createDataFrame(data, ["region", "quarter", "sales"])
# Pivot the quarters into columns
pivot_df = df.groupBy("region").pivot("quarter", ["Q1", "Q2"]).agg(F.sum("sales"))
pivot_df.show()
Output:
+------+---+---+
|region| Q1| Q2|
+------+---+---+
| East|100|120|
| West| 90|130|
+------+---+---+
π 2. UNPIVOT in PySpark (workaround using selectExpr
+ stack()
)
PySpark doesn’t have .unpivot()
, but we can simulate it.
Example:
data = [
("East", 100, 120),
("West", 90, 130),
]
df = spark.createDataFrame(data, ["region", "Q1", "Q2"])
# Use stack to unpivot
unpivot_df = df.selectExpr(
"region",
"stack(2, 'Q1', Q1, 'Q2', Q2) as (quarter, sales)"
)
unpivot_df.show()
Output:
+------+-------+-----+
|region|quarter|sales|
+------+-------+-----+
| East | Q1 | 100|
| East | Q2 | 120|
| West | Q1 | 90|
| West | Q2 | 130|
+------+-------+-----+
Explanation:
stack(2, 'Q1', Q1, 'Q2', Q2)
2
β number of columns to unpivot'Q1', Q1
β label + column'Q2', Q2
β label + column
π§ Alternative UNPIVOT with explode()
from pyspark.sql.functions import expr, array, struct, explode
df2 = df.select("region", explode(array(
struct(F.lit("Q1").alias("quarter"), F.col("Q1").alias("sales")),
struct(F.lit("Q2").alias("quarter"), F.col("Q2").alias("sales"))
)).alias("pivoted"))
df2 = df2.select("region", "pivoted.quarter", "pivoted.sales")
df2.show()
π Summary
Operation | PySpark DataFrame API |
---|---|
β PIVOT | .groupBy().pivot().agg() |
β οΈ UNPIVOT | Use selectExpr + stack() or explode() |
Let me know if you want a reusable function for pivot/unpivot or want to generalize it to many columns dynamically.
Leave a Reply