Pivot & unpivot in Spark SQL - How to translate SAS Proc Transpose to Spark SQL - Page 2 of 2

Here’s a clear breakdown of PIVOT and UNPIVOT in SQL and Spark SQL, along with use cases and examples.

🔄 What is PIVOT?

PIVOT transforms rows into columns.
It is useful for summarizing or grouping data to make it more readable.

✅ Use Case:

Show total sales by region across different quarters as columns.

🧱 Sample Input Table: `sales_data`

region	quarter	sales
East	Q1	100
East	Q2	120
West	Q1	90
West	Q2	130

✅ SQL PIVOT (SQL Server, Oracle syntax):

SELECT *
FROM (
  SELECT region, quarter, sales
  FROM sales_data
) AS src
PIVOT (
  SUM(sales)
  FOR quarter IN ([Q1], [Q2])
) AS pvt;

➤ Output:

region	Q1	Q2
East	100	120
West	90	130

✅ Spark SQL (2 ways)

🅰️ SQL-style (since Spark 2.4):

-- register sales_data as temp view first
SELECT *
FROM (
  SELECT region, quarter, sales
  FROM sales_data
)
PIVOT (
  SUM(sales) FOR quarter IN ('Q1', 'Q2')
)

🅱️ DataFrame API:

df.groupBy("region") \
  .pivot("quarter", ["Q1", "Q2"]) \
  .agg(F.sum("sales"))

🔁 What is UNPIVOT?

UNPIVOT transforms columns into rows.
It’s helpful when you want to normalize or denormalize wide tables.

✅ Use Case:

Convert a wide report (with Q1, Q2, Q3…) into a long format for easier analysis or plotting.

🧱 Input Table (Wide format):

region	Q1	Q2
East	100	120
West	90	130

✅ SQL UNPIVOT (SQL Server):

SELECT region, quarter, sales
FROM sales_data
UNPIVOT (
  sales FOR quarter IN (Q1, Q2)
) AS unpvt;

➤ Output:

region	quarter	sales
East	Q1	100
East	Q2	120
West	Q1	90
West	Q2	130

❌ Spark SQL does not support `UNPIVOT` directly.

✅ Alternative: `stack()` function or `explode(array(...))`

-- Spark SQL using `stack`
SELECT region,
       stack(2, 'Q1', Q1, 'Q2', Q2) AS (quarter, sales)
FROM sales_data

Or using DataFrame API:

df.selectExpr("region", "stack(2, 'Q1', Q1, 'Q2', Q2) as (quarter, sales)")

✅ Summary Table

Operation	SQL Syntax	Spark SQL	PySpark API
PIVOT	`PIVOT(...)`	`PIVOT(...)`	`groupBy().pivot().agg()`
UNPIVOT	`UNPIVOT(...)`	❌ Not supported	✅ Use `stack()` or `explode()`

💡 Real-World Use Cases

Use Case	PIVOT or UNPIVOT
Monthly sales summary by region	PIVOT
Reshaping wide Excel files	UNPIVOT
Preparing input for ML algorithms	UNPIVOT
Showing KPI trends across departments	PIVOT
Dashboard reports (time-series)	PIVOT

Yes, both PIVOT and UNPIVOT operations can be done in the PySpark DataFrame API, although:

✅ PIVOT is directly supported.
⚠️ UNPIVOT is not a built-in function, but can be mimicked using selectExpr, stack(), or explode().

✅ 1. PIVOT in PySpark DataFrame API

Example:

from pyspark.sql import SparkSession, functions as F

spark = SparkSession.builder.getOrCreate()

data = [
    ("East", "Q1", 100),
    ("East", "Q2", 120),
    ("West", "Q1", 90),
    ("West", "Q2", 130),
]

df = spark.createDataFrame(data, ["region", "quarter", "sales"])

# Pivot the quarters into columns
pivot_df = df.groupBy("region").pivot("quarter", ["Q1", "Q2"]).agg(F.sum("sales"))

pivot_df.show()

Output:

+------+---+---+
|region| Q1| Q2|
+------+---+---+
|  East|100|120|
|  West| 90|130|
+------+---+---+

🔁 2. UNPIVOT in PySpark (workaround using `selectExpr` + `stack()`)

PySpark doesn’t have .unpivot(), but we can simulate it.

Example:

data = [
    ("East", 100, 120),
    ("West", 90, 130),
]
df = spark.createDataFrame(data, ["region", "Q1", "Q2"])

# Use stack to unpivot
unpivot_df = df.selectExpr(
    "region",
    "stack(2, 'Q1', Q1, 'Q2', Q2) as (quarter, sales)"
)

unpivot_df.show()

Output:

+------+-------+-----+
|region|quarter|sales|
+------+-------+-----+
| East |    Q1 |  100|
| East |    Q2 |  120|
| West |    Q1 |   90|
| West |    Q2 |  130|
+------+-------+-----+

Explanation:

stack(2, 'Q1', Q1, 'Q2', Q2)

2 → number of columns to unpivot
'Q1', Q1 → label + column
'Q2', Q2 → label + column

🧠 Alternative UNPIVOT with `explode()`

from pyspark.sql.functions import expr, array, struct, explode

df2 = df.select("region", explode(array(
    struct(F.lit("Q1").alias("quarter"), F.col("Q1").alias("sales")),
    struct(F.lit("Q2").alias("quarter"), F.col("Q2").alias("sales"))
)).alias("pivoted"))

df2 = df2.select("region", "pivoted.quarter", "pivoted.sales")
df2.show()

🔚 Summary

Operation	PySpark DataFrame API
✅ PIVOT	`.groupBy().pivot().agg()`
⚠️ UNPIVOT	Use `selectExpr + stack()` or `explode()`

Let me know if you want a reusable function for pivot/unpivot or want to generalize it to many columns dynamically.

HintsToday

recent posts

about