Here’s a clear breakdown of PIVOT and UNPIVOT in SQL and Spark SQL, along with use cases and examples.


πŸ”„ What is PIVOT?

PIVOT transforms rows into columns.
It is useful for summarizing or grouping data to make it more readable.

βœ… Use Case:

Show total sales by region across different quarters as columns.


🧱 Sample Input Table: sales_data

regionquartersales
EastQ1100
EastQ2120
WestQ190
WestQ2130

βœ… SQL PIVOT (SQL Server, Oracle syntax):

SELECT *
FROM (
  SELECT region, quarter, sales
  FROM sales_data
) AS src
PIVOT (
  SUM(sales)
  FOR quarter IN ([Q1], [Q2])
) AS pvt;

➀ Output:

regionQ1Q2
East100120
West90130

βœ… Spark SQL (2 ways)

πŸ…°οΈ SQL-style (since Spark 2.4):

-- register sales_data as temp view first
SELECT *
FROM (
  SELECT region, quarter, sales
  FROM sales_data
)
PIVOT (
  SUM(sales) FOR quarter IN ('Q1', 'Q2')
)

πŸ…±οΈ DataFrame API:

df.groupBy("region") \
  .pivot("quarter", ["Q1", "Q2"]) \
  .agg(F.sum("sales"))

πŸ” What is UNPIVOT?

UNPIVOT transforms columns into rows.
It’s helpful when you want to normalize or denormalize wide tables.


βœ… Use Case:

Convert a wide report (with Q1, Q2, Q3…) into a long format for easier analysis or plotting.


🧱 Input Table (Wide format):

regionQ1Q2
East100120
West90130

βœ… SQL UNPIVOT (SQL Server):

SELECT region, quarter, sales
FROM sales_data
UNPIVOT (
  sales FOR quarter IN (Q1, Q2)
) AS unpvt;

➀ Output:

regionquartersales
EastQ1100
EastQ2120
WestQ190
WestQ2130

❌ Spark SQL does not support UNPIVOT directly.

βœ… Alternative: stack() function or explode(array(...))

-- Spark SQL using `stack`
SELECT region,
       stack(2, 'Q1', Q1, 'Q2', Q2) AS (quarter, sales)
FROM sales_data

Or using DataFrame API:

df.selectExpr("region", "stack(2, 'Q1', Q1, 'Q2', Q2) as (quarter, sales)")

βœ… Summary Table

OperationSQL SyntaxSpark SQLPySpark API
PIVOTPIVOT(...)PIVOT(...)groupBy().pivot().agg()
UNPIVOTUNPIVOT(...)❌ Not supportedβœ… Use stack() or explode()

πŸ’‘ Real-World Use Cases

Use CasePIVOT or UNPIVOT
Monthly sales summary by regionPIVOT
Reshaping wide Excel filesUNPIVOT
Preparing input for ML algorithmsUNPIVOT
Showing KPI trends across departmentsPIVOT
Dashboard reports (time-series)PIVOT

Yes, both PIVOT and UNPIVOT operations can be done in the PySpark DataFrame API, although:

  • βœ… PIVOT is directly supported.
  • ⚠️ UNPIVOT is not a built-in function, but can be mimicked using selectExpr, stack(), or explode().

βœ… 1. PIVOT in PySpark DataFrame API

Example:

from pyspark.sql import SparkSession, functions as F

spark = SparkSession.builder.getOrCreate()

data = [
    ("East", "Q1", 100),
    ("East", "Q2", 120),
    ("West", "Q1", 90),
    ("West", "Q2", 130),
]

df = spark.createDataFrame(data, ["region", "quarter", "sales"])

# Pivot the quarters into columns
pivot_df = df.groupBy("region").pivot("quarter", ["Q1", "Q2"]).agg(F.sum("sales"))

pivot_df.show()

Output:

+------+---+---+
|region| Q1| Q2|
+------+---+---+
|  East|100|120|
|  West| 90|130|
+------+---+---+

πŸ” 2. UNPIVOT in PySpark (workaround using selectExpr + stack())

PySpark doesn’t have .unpivot(), but we can simulate it.

Example:

data = [
    ("East", 100, 120),
    ("West", 90, 130),
]
df = spark.createDataFrame(data, ["region", "Q1", "Q2"])

# Use stack to unpivot
unpivot_df = df.selectExpr(
    "region",
    "stack(2, 'Q1', Q1, 'Q2', Q2) as (quarter, sales)"
)

unpivot_df.show()

Output:

+------+-------+-----+
|region|quarter|sales|
+------+-------+-----+
| East |    Q1 |  100|
| East |    Q2 |  120|
| West |    Q1 |   90|
| West |    Q2 |  130|
+------+-------+-----+

Explanation:

stack(2, 'Q1', Q1, 'Q2', Q2)
  • 2 β†’ number of columns to unpivot
  • 'Q1', Q1 β†’ label + column
  • 'Q2', Q2 β†’ label + column

🧠 Alternative UNPIVOT with explode()

from pyspark.sql.functions import expr, array, struct, explode

df2 = df.select("region", explode(array(
    struct(F.lit("Q1").alias("quarter"), F.col("Q1").alias("sales")),
    struct(F.lit("Q2").alias("quarter"), F.col("Q2").alias("sales"))
)).alias("pivoted"))

df2 = df2.select("region", "pivoted.quarter", "pivoted.sales")
df2.show()

πŸ”š Summary

OperationPySpark DataFrame API
βœ… PIVOT.groupBy().pivot().agg()
⚠️ UNPIVOTUse selectExpr + stack() or explode()

Let me know if you want a reusable function for pivot/unpivot or want to generalize it to many columns dynamically.

Pages: 1 2


Discover more from HintsToday

Subscribe to get the latest posts sent to your email.

Posted in ,

Leave a Reply

Discover more from HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading