to filter sales data for last month using pyspark date functions of date_sales column

To filter your sales data in PySpark for records from last month based on a date_sales column, you can use the current_date(), add_months(), last_day(), and comparison logic. Here’s how:

from pyspark.sql import functions as F

# Assuming df_sales has a column `date_sales` of DateType (or castable to it)

# 1. Compute the first and last day of the previous month:
prev_month_last = F.last_day(F.add_months(F.current_date(), -1))
prev_month_first = F.date_sub(prev_month_last, F.dayofmonth(prev_month_last) - 1)

# 2. Filter the DataFrame:
df_last_month = df_sales.filter(
    (F.col("date_sales") >= prev_month_first) &
    (F.col("date_sales") <= prev_month_last)
)

# Optional: verify bounds
df_last_month.select(prev_month_first.alias("start_date"),
                     prev_month_last.alias("end_date")).distinct().show()

🧠 Why this works:

  • add_months(current_date(), -1) shifts to the same day last month.
  • last_day(...) gives the last day of that month.
  • Subtracting (day-of-month βˆ’β€―1) gives the 1st of the same month.

Handling timestamp or string input

If date_sales is a string, cast it first:

df_sales = df_sales.withColumn("date_sales", F.to_date("date_sales", "yyyy-MM-dd"))

If it’s a timestamp, you can still compare dates directly (Spark auto-casts):

df_sales = df_sales.withColumn("date_sales_dt", F.to_date("date_sales"))

Then filter on date_sales_dt instead.


Full example with schema

from pyspark.sql import SparkSession, functions as F

spark = SparkSession.builder.getOrCreate()
df_sales = spark.read.parquet("sales_data.parquet")  # or any source

# cast if needed
df_sales = df_sales.withColumn("date_sales", F.to_date("date_sales"))

prev_month_last = F.last_day(F.add_months(F.current_date(), -1))
prev_month_first = F.date_sub(prev_month_last, F.dayofmonth(prev_month_last) - 1)

df_last_month = df_sales.filter(
    (F.col("date_sales") >= prev_month_first) &
    (F.col("date_sales") <= prev_month_last)
)

df_last_month.show()

βœ… Quick check

  • Today’s date (June 14, 2025):
    • add_months(..., -1) β†’ May 14, 2025
    • last_day(...) β†’ May 31, 2025
    • prev_month_first β†’ May 1, 2025
  • Final filter = dates between 2025-05-01 and 2025-05-31

To get the first day of the month in PySpark using a date column (e.g., date_sales), you can use the last_day() and date_sub() functions with a little trick:


βœ… Method 1: First day of the same month

from pyspark.sql import functions as F

df = df.withColumn(
    "first_day_of_month",
    F.date_sub(F.col("date_sales"), F.dayofmonth(F.col("date_sales")) - 1)
)

This subtracts (day - 1) days from the date, leaving the first day of that month.


βœ… Method 2: Using trunc() for same result

df = df.withColumn(
    "first_day_of_month",
    F.trunc("date_sales", "month")
)
  • F.trunc(date_col, 'month') returns the first day of the month.
  • This is cleaner and more readable.

πŸ§ͺ Example:

# Sample input
+------------+
| date_sales |
+------------+
| 2025-06-14 |
| 2025-05-27 |
+------------+

# Output after applying F.trunc(...)
+------------+------------------+
| date_sales | first_day_of_month |
+------------+------------------+
| 2025-06-14 | 2025-06-01       |
| 2025-05-27 | 2025-05-01       |
+------------+------------------+

F.add_months(F.trunc(F.current_date(), “MM”), 1) # β†’ First day of NEXT month
F.add_months(F.trunc(F.current_date(), “MM”), -1) # β†’ First day of PREVIOUS month

breakdown of whether PySpark date functions accept string inputs and what happens behind the scenes.


βœ… Summary Table: PySpark Date Functions Input Compatibility

FunctionAccepts String date?Behavior
to_date()βœ… YesConverts a String to DateType
trunc()❌ No (must be Date)Will fail or auto-cast only if safe; best to use to_date() explicitly
add_months()❌ No (needs Date)Requires DateType or TimestampType, not a string
last_day()❌ No (needs Date)Same as above
date_sub() / date_add()❌ No (needs Date)Same as above
date_format()βœ… YesAccepts Date/Timestamp, not pure string; auto-cast if safe

πŸ”§ Best Practice: Always Convert String to Date First

Before using any of these functions, convert your string column:

from pyspark.sql import functions as F

df = df.withColumn("date_sales", F.to_date("date_sales", "yyyy-MM-dd"))

βœ… Safe Usage Example:

df = df.withColumn("date_sales", F.to_date("date_sales", "yyyy-MM-dd"))

df = df.withColumn("month_start", F.trunc("date_sales", "MM"))
df = df.withColumn("prev_month_start", F.add_months(F.col("month_start"), -1))
df = df.withColumn("last_day_prev_month", F.last_day("prev_month_start"))

❌ Unsafe (Error-Prone) Example:

# This will throw error or behave unexpectedly
df = df.withColumn("month_start", F.trunc("2025-06-14", "MM"))

Because "2025-06-14" is a literal string β€” not a DateType.


πŸ§ͺ If you want to use a literal date string:

from pyspark.sql.functions import to_date, lit, trunc

df = df.withColumn("truncated", trunc(to_date(lit("2025-06-14")), "MM"))

βœ… Rule of Thumb:

Always do this before applying date functions:

F.to_date("date_col", "yyyy-MM-dd")

Pages: 1 2 3 4 5

Posted in ,

Leave a Reply

Your email address will not be published. Required fields are marked *