to filter sales data for last month using pyspark date functions of date_sales column
To filter your sales data in PySpark for records from last month based on a date_sales
column, you can use the current_date()
, add_months()
, last_day()
, and comparison logic. Here’s how:
from pyspark.sql import functions as F
# Assuming df_sales has a column `date_sales` of DateType (or castable to it)
# 1. Compute the first and last day of the previous month:
prev_month_last = F.last_day(F.add_months(F.current_date(), -1))
prev_month_first = F.date_sub(prev_month_last, F.dayofmonth(prev_month_last) - 1)
# 2. Filter the DataFrame:
df_last_month = df_sales.filter(
(F.col("date_sales") >= prev_month_first) &
(F.col("date_sales") <= prev_month_last)
)
# Optional: verify bounds
df_last_month.select(prev_month_first.alias("start_date"),
prev_month_last.alias("end_date")).distinct().show()
π§ Why this works:
add_months(current_date(), -1)
shifts to the same day last month.last_day(...)
gives the last day of that month.- Subtracting
(day-of-month ββ―1)
gives the 1st of the same month.
Handling timestamp or string input
If date_sales
is a string, cast it first:
df_sales = df_sales.withColumn("date_sales", F.to_date("date_sales", "yyyy-MM-dd"))
If it’s a timestamp, you can still compare dates directly (Spark auto-casts):
df_sales = df_sales.withColumn("date_sales_dt", F.to_date("date_sales"))
Then filter on date_sales_dt
instead.
Full example with schema
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df_sales = spark.read.parquet("sales_data.parquet") # or any source
# cast if needed
df_sales = df_sales.withColumn("date_sales", F.to_date("date_sales"))
prev_month_last = F.last_day(F.add_months(F.current_date(), -1))
prev_month_first = F.date_sub(prev_month_last, F.dayofmonth(prev_month_last) - 1)
df_last_month = df_sales.filter(
(F.col("date_sales") >= prev_month_first) &
(F.col("date_sales") <= prev_month_last)
)
df_last_month.show()
β Quick check
- Todayβs date (June 14, 2025):
add_months(..., -1)
β May 14, 2025last_day(...)
β May 31, 2025prev_month_first
β May 1, 2025
- Final filter = dates between 2025-05-01 and 2025-05-31
To get the first day of the month in PySpark using a date column (e.g., date_sales
), you can use the last_day()
and date_sub()
functions with a little trick:
β Method 1: First day of the same month
from pyspark.sql import functions as F
df = df.withColumn(
"first_day_of_month",
F.date_sub(F.col("date_sales"), F.dayofmonth(F.col("date_sales")) - 1)
)
This subtracts (day - 1)
days from the date, leaving the first day of that month.
β Method 2: Using trunc() for same result
df = df.withColumn(
"first_day_of_month",
F.trunc("date_sales", "month")
)
F.trunc(date_col, 'month')
returns the first day of the month.- This is cleaner and more readable.
π§ͺ Example:
# Sample input
+------------+
| date_sales |
+------------+
| 2025-06-14 |
| 2025-05-27 |
+------------+
# Output after applying F.trunc(...)
+------------+------------------+
| date_sales | first_day_of_month |
+------------+------------------+
| 2025-06-14 | 2025-06-01 |
| 2025-05-27 | 2025-05-01 |
+------------+------------------+
F.add_months(F.trunc(F.current_date(), “MM”), 1) # β First day of NEXT month
F.add_months(F.trunc(F.current_date(), “MM”), -1) # β First day of PREVIOUS month
breakdown of whether PySpark date functions accept string inputs and what happens behind the scenes.
β Summary Table: PySpark Date Functions Input Compatibility
Function | Accepts String date? | Behavior |
---|---|---|
to_date() | β Yes | Converts a String to DateType |
trunc() | β No (must be Date) | Will fail or auto-cast only if safe; best to use to_date() explicitly |
add_months() | β No (needs Date) | Requires DateType or TimestampType , not a string |
last_day() | β No (needs Date) | Same as above |
date_sub() / date_add() | β No (needs Date) | Same as above |
date_format() | β Yes | Accepts Date/Timestamp, not pure string; auto-cast if safe |
π§ Best Practice: Always Convert String to Date First
Before using any of these functions, convert your string column:
from pyspark.sql import functions as F
df = df.withColumn("date_sales", F.to_date("date_sales", "yyyy-MM-dd"))
β Safe Usage Example:
df = df.withColumn("date_sales", F.to_date("date_sales", "yyyy-MM-dd"))
df = df.withColumn("month_start", F.trunc("date_sales", "MM"))
df = df.withColumn("prev_month_start", F.add_months(F.col("month_start"), -1))
df = df.withColumn("last_day_prev_month", F.last_day("prev_month_start"))
β Unsafe (Error-Prone) Example:
# This will throw error or behave unexpectedly
df = df.withColumn("month_start", F.trunc("2025-06-14", "MM"))
Because "2025-06-14"
is a literal string β not a DateType
.
π§ͺ If you want to use a literal date string:
from pyspark.sql.functions import to_date, lit, trunc
df = df.withColumn("truncated", trunc(to_date(lit("2025-06-14")), "MM"))
β Rule of Thumb:
Always do this before applying date functions:
F.to_date("date_col", "yyyy-MM-dd")
Leave a Reply