Window functions in PySpark allow you to perform operations on a subset of your data using a “window” that defines a range of rows. These functions are similar to SQL window functions and are useful for tasks like ranking, cumulative sums, and moving averages. Let’s go through various PySpark DataFrame window functions, compare them with Spark SQL window functions, and provide examples with a large sample dataset.
Contents
Setting Up the Environment
First, let’s set up the environment and create a sample dataset.
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number, rank, dense_rank, percent_rank, ntile, lag, lead, sum, avg
# Initialize Spark session
spark = SparkSession.builder \
.appName("PySpark Window Functions") \
.getOrCreate()
# Create a sample dataset
data = [(1, "Alice", 1000),
(2, "Bob", 1200),
(3, "Catherine", 1200),
(4, "David", 800),
(5, "Eve", 950),
(6, "Frank", 800),
(7, "George", 1200),
(8, "Hannah", 1000),
(9, "Ivy", 950),
(10, "Jack", 1200)]
columns = ["id", "name", "salary"]
df = spark.createDataFrame(data, schema=columns)
df.show()
PySpark Window Functions
1. Row Number
The row_number
function assigns a unique number to each row within a window partition.
windowSpec = Window.partitionBy("salary").orderBy("id")
df.withColumn("row_number", row_number().over(windowSpec)).show()
2. Rank
The rank
function provides ranks to rows within a window partition, with gaps in ranking.
df.withColumn("rank", rank().over(windowSpec)).show()
3. Dense Rank
The dense_rank
function provides ranks to rows within a window partition, without gaps in ranking.
df.withColumn("dense_rank", dense_rank().over(windowSpec)).show()
4. Percent Rank
The percent_rank
function calculates the percentile rank of rows within a window partition.
df.withColumn("percent_rank", percent_rank().over(windowSpec)).show()
5. NTile
The ntile
function divides the rows within a window partition into n
buckets.
df.withColumn("ntile", ntile(4).over(windowSpec)).show()
6. Lag
The lag
function provides access to a row at a given physical offset before the current row within a window partition.
df.withColumn("lag", lag("salary", 1).over(windowSpec)).show()
7. Lead
The lead
function provides access to a row at a given physical offset after the current row within a window partition.
df.withColumn("lead", lead("salary", 1).over(windowSpec)).show()
8. Cumulative Sum
The sum
function calculates the cumulative sum of values within a window partition.
pythonCopy codedf.withColumn("cumulative_sum", sum("salary").over(windowSpec)).show()
9. Moving Average
The avg
function calculates the moving average of values within a window partition.
pythonCopy codedf.withColumn("moving_avg", avg("salary").over(windowSpec)).show()
Comparison with Spark SQL Window Functions
All the above operations can also be performed using Spark SQL. Here are the equivalent SQL queries:
1. Row Number
SELECT id, name, salary,
ROW_NUMBER() OVER (PARTITION BY salary ORDER BY id) AS row_number
FROM df
2. Rank
SELECT id, name, salary,
RANK() OVER (PARTITION BY salary ORDER BY id) AS rank
FROM df
3. Dense Rank
SELECT id, name, salary,
DENSE_RANK() OVER (PARTITION BY salary ORDER BY id) AS dense_rank
FROM df
4. Percent Rank
SELECT id, name, salary,
PERCENT_RANK() OVER (PARTITION BY salary ORDER BY id) AS percent_rank
FROM df
5. NTile
SELECT id, name, salary,
NTILE(4) OVER (PARTITION BY salary ORDER BY id) AS ntile
FROM df
6. Lag
SELECT id, name, salary,
LAG(salary, 1) OVER (PARTITION BY salary ORDER BY id) AS lag
FROM df
7. Lead
SELECT id, name, salary,
LEAD(salary, 1) OVER (PARTITION BY salary ORDER BY id) AS lead
FROM df
8. Cumulative Sum
SELECT id, name, salary,
SUM(salary) OVER (PARTITION BY salary ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sum
FROM df
9. Moving Average
SELECT id, name, salary,
AVG(salary) OVER (PARTITION BY salary ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS moving_avg
FROM df
Large Sample Dataset Example
Let’s create a larger dataset and apply window functions.
import random
# Create a larger dataset
large_data = [(i, f"Name_{i}", random.choice([1000, 1200, 950, 800])) for i in range(1, 101)]
large_df = spark.createDataFrame(large_data, schema=columns)
large_df.show(10)
# Apply window functions
large_windowSpec = Window.partitionBy("salary").orderBy("id")
large_df.withColumn("row_number", row_number().over(large_windowSpec)).show(10)
large_df.withColumn("rank", rank().over(large_windowSpec)).show(10)
large_df.withColumn("dense_rank", dense_rank().over(large_windowSpec)).show(10)
large_df.withColumn("percent_rank", percent_rank().over(large_windowSpec)).show(10)
large_df.withColumn("ntile", ntile(4).over(large_windowSpec)).show(10)
large_df.withColumn("lag", lag("salary", 1).over(large_windowSpec)).show(10)
large_df.withColumn("lead", lead("salary", 1).over(large_windowSpec)).show(10)
large_df.withColumn("cumulative_sum", sum("salary").over(large_windowSpec)).show(10)
large_df.withColumn("moving_avg", avg("salary").over(large_windowSpec)).show(10)
Window functions in PySpark and Spark SQL are powerful tools for data analysis. They allow you to perform complex calculations and transformations on subsets of your data, similar to SQL window functions. By using window functions, you can easily implement features like ranking, cumulative sums, and moving averages in your PySpark applications.
Examples:-
1.pyspark dataframes Remove duplicates based on specific columns and then order by different columns
To remove duplicates from a PySpark DataFrame based on specific columns and order the remaining rows by different columns, you can use a combination of the dropDuplicates()
function and the orderBy()
(or sort()
) function.
Here is an example that demonstrates this process:
- Remove duplicates based on specific columns.
- Order the resulting DataFrame by different columns.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# Initialize SparkSession
spark = SparkSession.builder.appName("RemoveDuplicatesAndOrder").getOrCreate()
# Sample data
data = [
(1, "Alice", 29),
(2, "Bob", 30),
(3, "Alice", 29),
(4, "David", 35),
(5, "Alice", 25)
]
# Create DataFrame
columns = ["id", "name", "age"]
df = spark.createDataFrame(data, columns)
# Show the original DataFrame
print("Original DataFrame:")
df.show()
# Step 1: Remove duplicates based on specific columns (e.g., "name", "age")
df_no_duplicates = df.dropDuplicates(["name", "age"])
# Step 2: Order the resulting DataFrame by different columns (e.g., "age" in descending order)
df_ordered = df_no_duplicates.orderBy(col("age").desc())
# Show the resulting DataFrame
print("DataFrame after removing duplicates and ordering:")
df_ordered.show()
# Stop SparkSession
spark.stop()
Explanation:
- Initialization and Data Preparation:
- A
SparkSession
is created. - Sample data is provided, and a DataFrame is created from this data.
- A
- Removing Duplicates:
- The
dropDuplicates()
function is used to remove rows that have the same values in the specified columns ("name"
and"age"
in this case).
- The
- Ordering Data:
- The
orderBy()
function is used to sort the DataFrame by the specified columns. In this case, the DataFrame is ordered by"age"
in descending order.
- The
- Displaying Results:
- The original and resulting DataFrames are displayed using the
show()
function.
- The original and resulting DataFrames are displayed using the
Example Output:
Original DataFrame:
+---+-----+---+
| id| name|age|
+---+-----+---+
| 1|Alice| 29|
| 2| Bob| 30|
| 3|Alice| 29|
| 4|David| 35|
| 5|Alice| 25|
+---+-----+---+
DataFrame after removing duplicates and ordering:
+---+-----+---+
| id| name|age|
+---+-----+---+
| 4|David| 35|
| 2| Bob| 30|
| 1|Alice| 29|
| 5|Alice| 25|
+---+-----+---+
Additional Notes:
- dropDuplicates(): This function removes duplicate rows based on the specified columns. If no columns are specified, it removes rows that are identical in all columns.
- orderBy()/sort(): These functions are used to order the DataFrame. You can specify multiple columns and the sorting order (ascending or descending) for each column.
- You can chain multiple DataFrame operations together. For example, you can combine
dropDuplicates()
andorderBy()
in a single statement:df.dropDuplicates(["name", "age"]).orderBy(col("age").desc())
2.pyspark dataframes Remove duplicates based on specific columns while ordering by different columns
To remove duplicates from a PySpark DataFrame based on specific columns while ensuring the ordering of the data based on other columns, you can use the window
functions in PySpark. This approach allows you to specify how to handle duplicates and maintain the desired order.
Below is an example to demonstrate how to remove duplicates based on specific columns (name
, age
) while ordering the rows by different columns (age
descending and id
ascending):
pythonCopy codefrom pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number
# Initialize SparkSession
spark = SparkSession.builder.appName("RemoveDuplicatesWithOrdering").getOrCreate()
# Sample data
data = [
(1, "Alice", 29),
(2, "Bob", 30),
(3, "Alice", 29),
(4, "David", 35),
(5, "Alice", 25)
]
# Create DataFrame
columns = ["id", "name", "age"]
df = spark.createDataFrame(data, columns)
# Show the original DataFrame
print("Original DataFrame:")
df.show()
# Define a window specification
windowSpec = Window.partitionBy("name", "age").orderBy(col("age").desc(), col("id").asc())
# Add a row number to each partition
df_with_row_number = df.withColumn("row_number", row_number().over(windowSpec))
# Filter the rows to keep only the first occurrence of each combination of name and age
df_no_duplicates = df_with_row_number.filter(col("row_number") == 1).drop("row_number")
# Show the resulting DataFrame
print("DataFrame after removing duplicates and ordering:")
df_no_duplicates.show()
# Stop SparkSession
spark.stop()
Explanation:
- Initialization and Data Preparation:
- A
SparkSession
is created. - Sample data is provided, and a DataFrame is created from this data.
- A
- Define Window Specification:
- A window specification is defined using
Window.partitionBy("name", "age").orderBy(col("age").desc(), col("id").asc())
. This specifies that the data should be partitioned by the columnsname
andage
and ordered within each partition byage
in descending order andid
in ascending order.
- A window specification is defined using
- Add Row Number:
- The
row_number()
function is used to add a row number to each row within the specified window. This row number helps to identify and keep only the first occurrence of each partition.
- The
- Filter Rows:
- The DataFrame is filtered to keep only the rows where the
row_number
is 1. This effectively removes duplicates while maintaining the specified order.
- The DataFrame is filtered to keep only the rows where the
- Display Results:
- The original and resulting DataFrames are displayed using the
show()
function.
- The original and resulting DataFrames are displayed using the
Example Output:
Original DataFrame:
+---+-----+---+
| id| name|age|
+---+-----+---+
| 1|Alice| 29|
| 2| Bob| 30|
| 3|Alice| 29|
| 4|David| 35|
| 5|Alice| 25|
+---+-----+---+
DataFrame after removing duplicates and ordering:
+---+-----+---+
| id| name|age|
+---+-----+---+
| 4|David| 35|
| 2| Bob| 30|
| 1|Alice| 29|
| 5|Alice| 25|
+---+-----+---+
Notes:
- Window Specification: The
Window
specification defines how the data should be partitioned and ordered. - Row Number: The
row_number()
function assigns a unique row number within each window partition. - Filter and Drop: The resulting DataFrame is filtered to keep only the rows where the
row_number
is 1, and the temporaryrow_number
column is dropped.
This approach ensures that duplicates are removed based on the specified columns while maintaining the desired order of the data. You can adjust the partitioning and ordering columns according to your specific requirements.