This is how PIVOT
and UNPIVOT
work in PySpark with Spark SQL using official syntax.
1. PIVOT in Spark SQL
The PIVOT
operation in Spark SQL allows you to convert rows into columns based on values in a specific column. It helps to summarize and reshape the data in a tabular format.
Syntax:
SELECT * FROM table_name
PIVOT (
<aggregate_function>(<column_name>)
FOR <pivot_column> IN (<pivot_values>)
)
Example: Pivoting Sales Data
Consider the following sales table:
region | product | sales_amount | year |
---|---|---|---|
North | Product A | 100 | 2023 |
North | Product B | 150 | 2023 |
South | Product A | 200 | 2023 |
South | Product B | 250 | 2023 |
You can pivot this data by product to create columns for Product A
and Product B
.
SELECT * FROM sales_data
PIVOT (
SUM(sales_amount)
FOR product IN ('Product A', 'Product B')
)
Result:
region | Product A | Product B |
---|---|---|
North | 100 | 150 |
South | 200 | 250 |
In this query:
- We are pivoting the
product
column, creating two new columns:Product A
andProduct B
. - The aggregation function used is
SUM()
to summarize thesales_amount
for each product.
2. UNPIVOT in Spark SQL
The UNPIVOT
operation in Spark SQL allows you to convert columns into rows. It’s typically used to transform a wide table back into a tall table.
Syntax:
SELECT * FROM table_name
UNPIVOT (
<column_name>
FOR <unpivot_column> IN (<columns_to_unpivot>)
)
Example: Unpivoting the Pivoted Sales Data
Let’s take the pivoted table from the previous example and unpivot it back to its original format.
region | Product A | Product B |
---|---|---|
North | 100 | 150 |
South | 200 | 250 |
We want to unpivot the Product A
and Product B
columns into rows.
SELECT * FROM pivoted_sales
UNPIVOT (
sales_amount
FOR product IN (`Product A`, `Product B`)
)
Result:
region | product | sales_amount |
---|---|---|
North | Product A | 100 |
North | Product B | 150 |
South | Product A | 200 |
South | Product B | 250 |
In this query:
- We’re converting the columns
Product A
andProduct B
back into rows withproduct
as a new column. - The
sales_amount
is the value for each unpivoted row.
SAS PROC TRANSPOSE and Its Equivalent in PySpark
SAS PROC TRANSPOSE is commonly used for pivoting and unpivoting in SAS. As we now know, we can achieve this functionality in PySpark with the PIVOT
and UNPIVOT
operations.
SAS PROC TRANSPOSE Example:
PROC TRANSPOSE DATA=sales_data OUT=pivoted_sales;
BY region;
ID product;
VAR sales_amount;
RUN;
Equivalent Spark SQL PIVOT:
SELECT * FROM sales_data
PIVOT (
SUM(sales_amount)
FOR product IN ('Product A', 'Product B')
)
SAS PROC TRANSPOSE for Unpivot:
To unpivot in SAS, you can reverse the process with multiple steps. In Spark SQL, you can use the UNPIVOT
operation directly:
SELECT * FROM pivoted_sales
UNPIVOT (
sales_amount
FOR product IN (`Product A`, `Product B`)
)
- PySpark and Spark SQL indeed have
PIVOT
andUNPIVOT
functionalities, and they can be used in a similar way to other databases or tools like SAS PROC TRANSPOSE. - The
PIVOT
function reshapes data by turning rows into columns, whileUNPIVOT
reverses this process by converting columns back into rows.
We can refer to the official Spark documentation for more details:
Discover more from AI HintsToday
Subscribe to get the latest posts sent to your email.