Pivot & unpivot in Spark SQL - How to translate SAS Proc Transpose to Spark SQL

PIVOT Clause in Spark sql or Mysql or Oracle Pl sql or Hive QL

The PIVOT clause is a powerful tool in SQL that allows you to rotate rows into columns, making it easier to analyze and report data. Here’s how to use the PIVOT clause in Spark SQL, MySQL, Oracle PL/SQL, and Hive QL:

Spark SQL

In Spark SQL, you can use the pivot function to rotate rows into columns.

SELECT *
FROM (
  SELECT column1, column2, value
  FROM your_table
)
PIVOT (
  SUM(value) FOR column2 IN (${list_of_values})
) AS pivot_table;

Replace ${list_of_values} with a comma-separated list of values from column2 that you want to pivot.

Example:

SELECT *
FROM (
  SELECT name, year, sales
  FROM sales_table
)
PIVOT (
  SUM(sales) FOR year IN (2018, 2019, 2020)
) AS pivot_table;

MySQL

MySQL does not support the PIVOT clause directly. However, you can achieve the same result using conditional aggregation.

SELECT
  column1,
  SUM(IF(column2 = 'value1', value, 0)) AS value1,
  SUM(IF(column2 = 'value2', value, 0)) AS value2,
  ...
FROM your_table
GROUP BY column1;

Replace value1, value2, etc. with the values from column2 that you want to pivot.

Example:

SELECT
  name,
  SUM(IF(year = 2018, sales, 0)) AS sales_2018,
  SUM(IF(year = 2019, sales, 0)) AS sales_2019,
  SUM(IF(year = 2020, sales, 0)) AS sales_2020
FROM sales_table
GROUP BY name;

In Oracle PL/SQL, you can use the PIVOT clause.

SELECT *
FROM (
  SELECT column1, column2, value
  FROM your_table
)
PIVOT (
  SUM(value) FOR column2 IN (${list_of_values})
);

Replace ${list_of_values} with a comma-separated list of values from column2 that you want to pivot.

Example:

SELECT *
FROM (
  SELECT name, year, sales
  FROM sales_table
)
PIVOT (
  SUM(sales) FOR year IN (2018, 2019, 2020)
);

Hive QL

In Hive QL, you can use the pivot clause.

SELECT *
FROM (
  SELECT column1, column2, value
  FROM your_table
)
PIVOT (
  SUM(value) FOR column2 IN (${list_of_values})
) AS pivot_table;

Replace ${list_of_values} with a comma-separated list of values from column2 that you want to pivot.

Example:

SELECT *
FROM (
  SELECT name, year, sales
  FROM sales_table
)
PIVOT (
  SUM(sales) FOR year IN (2018, 2019, 2020)
) AS pivot_table;

Common considerations

Make sure to replace column1, column2, value, and your_table with the actual column names and table name.
The list of values for the IN clause should be enclosed in parentheses and separated by commas.
The PIVOT clause can be combined with other SQL clauses, such as WHERE, GROUP BY, and HAVING.
Be mindful of data types and formatting when using the PIVOT clause.

When to use PIVOT:

When you need to transform rows into columns for reporting or analysis purposes.
When you need to create a crosstab or matrix report.
When you need to perform aggregations on multiple categories.

When not to use PIVOT:

When the number of distinct values in the pivot column is very large.
When the data is too complex or nested.
When you need to perform complex calculations or transformations.

This is how PIVOT and UNPIVOT work in Spark SQL using official syntax.

1. PIVOT in Spark SQL/ or in Pyspark

The PIVOT operation in Spark SQL allows you to convert rows into columns based on values in a specific column. It helps to summarize and reshape the data in a tabular format.

Syntax:

SELECT * FROM table_name
  PIVOT (
    <aggregate_function>(<column_name>)
    FOR <pivot_column> IN (<pivot_values>)
  )

Example: Pivoting Sales Data

Consider the following sales table:

region	product	sales_amount	year
North	Product A	100	2023
North	Product B	150	2023
South	Product A	200	2023
South	Product B	250	2023

You can pivot this data by product to create columns for Product A and Product B.

SELECT * FROM sales_data
  PIVOT (
    SUM(sales_amount)
    FOR product IN ('Product A', 'Product B')
  )

Result:

region	`Product A`	`Product B`
North	100	150
South	200	250

In this query:

We are pivoting the product column, creating two new columns: Product A and Product B.
The aggregation function used is SUM() to summarize the sales_amount for each product.

2. UNPIVOT in Spark SQL

The UNPIVOT operation in Spark SQL allows you to convert columns into rows. It’s typically used to transform a wide table back into a tall table.

Syntax:

SELECT * FROM table_name
  UNPIVOT (
    <column_name>
    FOR <unpivot_column> IN (<columns_to_unpivot>)
  )

Example: Unpivoting the Pivoted Sales Data

Let’s take the pivoted table from the previous example and unpivot it back to its original format.

region	`Product A`	`Product B`
North	100	150
South	200	250

We want to unpivot the Product A and Product B columns into rows.

SELECT * FROM pivoted_sales
  UNPIVOT (
    sales_amount
    FOR product IN (`Product A`, `Product B`)
  )

Result:

region	product	sales_amount
North	Product A	100
North	Product B	150
South	Product A	200
South	Product B	250

In this query:

We’re converting the columns Product A and Product B back into rows with product as a new column.
The sales_amount is the value for each unpivoted row.

CREATE TABLE person (id INT, name STRING, age INT, class INT, address STRING);
INSERT INTO person VALUES
    (100, 'John', 30, 1, 'Street 1'),
    (200, 'Mary', NULL, 1, 'Street 2'),
    (300, 'Mike', 80, 3, 'Street 3'),
    (400, 'Dan', 50, 4, 'Street 4');

SELECT * FROM person
    PIVOT (
        SUM(age) AS a, AVG(class) AS c
        FOR name IN ('John' AS john, 'Mike' AS mike)
    );
+------+-----------+---------+---------+---------+---------+
|  id  |  address  | john_a  | john_c  | mike_a  | mike_c  |
+------+-----------+---------+---------+---------+---------+
| 200  | Street 2  | NULL    | NULL    | NULL    | NULL    |
| 100  | Street 1  | 30      | 1.0     | NULL    | NULL    |
| 300  | Street 3  | NULL    | NULL    | 80      | 3.0     |
| 400  | Street 4  | NULL    | NULL    | NULL    | NULL    |
+------+-----------+---------+---------+---------+---------+

SELECT * FROM person
    PIVOT (
        SUM(age) AS a, AVG(class) AS c
        FOR (name, age) IN (('John', 30) AS c1, ('Mike', 40) AS c2)
    );
+------+-----------+-------+-------+-------+-------+
|  id  |  address  | c1_a  | c1_c  | c2_a  | c2_c  |
+------+-----------+-------+-------+-------+-------+
| 200  | Street 2  | NULL  | NULL  | NULL  | NULL  |
| 100  | Street 1  | 30    | 1.0   | NULL  | NULL  |
| 300  | Street 3  | NULL  | NULL  | NULL  | NULL  |
| 400  | Street 4  | NULL  | NULL  | NULL  | NULL  |
+------+-----------+-------+-------+-------+-------+











CREATE TABLE sales_quarterly (year INT, q1 INT, q2 INT, q3 INT, q4 INT);
INSERT INTO sales_quarterly VALUES
    (2020, null, 1000, 2000, 2500),
    (2021, 2250, 3200, 4200, 5900),
    (2022, 4200, 3100, null, null);

-- column names are used as unpivot columns
SELECT * FROM sales_quarterly
    UNPIVOT (
        sales FOR quarter IN (q1, q2, q3, q4)
    );
+------+---------+-------+
| year | quarter | sales |
+------+---------+-------+
| 2020 | q2      | 1000  |
| 2020 | q3      | 2000  |
| 2020 | q4      | 2500  |
| 2021 | q1      | 2250  |
| 2021 | q2      | 3200  |
| 2021 | q3      | 4200  |
| 2021 | q4      | 5900  |
| 2022 | q1      | 4200  |
| 2022 | q2      | 3100  |
+------+---------+-------+

-- NULL values are excluded by default, they can be included
-- unpivot columns can be alias
-- unpivot result can be referenced via its alias
SELECT up.* FROM sales_quarterly
    UNPIVOT INCLUDE NULLS (
        sales FOR quarter IN (q1 AS Q1, q2 AS Q2, q3 AS Q3, q4 AS Q4)
    ) AS up;
+------+---------+-------+
| year | quarter | sales |
+------+---------+-------+
| 2020 | Q1      | NULL  |
| 2020 | Q2      | 1000  |
| 2020 | Q3      | 2000  |
| 2020 | Q4      | 2500  |
| 2021 | Q1      | 2250  |
| 2021 | Q2      | 3200  |
| 2021 | Q3      | 4200  |
| 2021 | Q4      | 5900  |
| 2022 | Q1      | 4200  |
| 2022 | Q2      | 3100  |
| 2022 | Q3      | NULL  |
| 2022 | Q4      | NULL  |
+------+---------+-------+

-- multiple value columns can be unpivoted per row
SELECT * FROM sales_quarterly
    UNPIVOT EXCLUDE NULLS (
        (first_quarter, second_quarter)
        FOR half_of_the_year IN (
            (q1, q2) AS H1,
            (q3, q4) AS H2
        )
    );
+------+------------------+---------------+----------------+
|  id  | half_of_the_year | first_quarter | second_quarter |
+------+------------------+---------------+----------------+
| 2020 | H1               | NULL          | 1000           |
| 2020 | H2               | 2000          | 2500           |
| 2021 | H1               | 2250          | 3200           |
| 2021 | H2               | 4200          | 5900           |
| 2022 | H1               | 4200          | 3100           |
+------+------------------+---------------+----------------+



--Example taken from Official Spark SQL Documentation

SAS PROC TRANSPOSE and Its Equivalent in PySpark

SAS PROC TRANSPOSE is commonly used for pivoting and unpivoting in SAS. As we now know, we can achieve this functionality in PySpark with the PIVOT and UNPIVOT operations.

SAS PROC TRANSPOSE Example:

PROC TRANSPOSE DATA=sales_data OUT=pivoted_sales;
    BY region;
    ID product;
    VAR sales_amount;
RUN;

Equivalent Spark SQL PIVOT:

SELECT * FROM sales_data
  PIVOT (
    SUM(sales_amount)
    FOR product IN ('Product A', 'Product B')
  )

SAS PROC TRANSPOSE for Unpivot:

To unpivot in SAS, you can reverse the process with multiple steps. In Spark SQL, you can use the UNPIVOT operation directly:

SELECT * FROM pivoted_sales
  UNPIVOT (
    sales_amount
    FOR product IN (`Product A`, `Product B`)
  )

PySpark and Spark SQL indeed have PIVOT and UNPIVOT functionalities, and they can be used in a similar way to other databases or tools like SAS PROC TRANSPOSE.
The PIVOT function reshapes data by turning rows into columns, while UNPIVOT reverses this process by converting columns back into rows.

We can refer to the official Spark documentation for more details:

HintsToday

recent posts

about