Date and Time Functions- Pyspark Dataframes & Pyspark Sql Queries

to filter sales data for last month using pyspark date functions of date_sales column

To filter your sales data in PySpark for records from last month based on a date_sales column, you can use the current_date(), add_months(), last_day(), and comparison logic. Here’s how:

from pyspark.sql import functions as F

# Assuming df_sales has a column `date_sales` of DateType (or castable to it)

# 1. Compute the first and last day of the previous month:
prev_month_last = F.last_day(F.add_months(F.current_date(), -1))
prev_month_first = F.date_sub(prev_month_last, F.dayofmonth(prev_month_last) - 1)

# 2. Filter the DataFrame:
df_last_month = df_sales.filter(
    (F.col("date_sales") >= prev_month_first) &
    (F.col("date_sales") <= prev_month_last)
)

# Optional: verify bounds
df_last_month.select(prev_month_first.alias("start_date"),
                     prev_month_last.alias("end_date")).distinct().show()

🧠 Why this works:

add_months(current_date(), -1) shifts to the same day last month.
last_day(...) gives the last day of that month.
Subtracting (day-of-month − 1) gives the 1st of the same month.

Handling timestamp or string input

If date_sales is a string, cast it first:

df_sales = df_sales.withColumn("date_sales", F.to_date("date_sales", "yyyy-MM-dd"))

If it’s a timestamp, you can still compare dates directly (Spark auto-casts):

df_sales = df_sales.withColumn("date_sales_dt", F.to_date("date_sales"))

Then filter on date_sales_dt instead.

Full example with schema

from pyspark.sql import SparkSession, functions as F

spark = SparkSession.builder.getOrCreate()
df_sales = spark.read.parquet("sales_data.parquet")  # or any source

# cast if needed
df_sales = df_sales.withColumn("date_sales", F.to_date("date_sales"))

prev_month_last = F.last_day(F.add_months(F.current_date(), -1))
prev_month_first = F.date_sub(prev_month_last, F.dayofmonth(prev_month_last) - 1)

df_last_month = df_sales.filter(
    (F.col("date_sales") >= prev_month_first) &
    (F.col("date_sales") <= prev_month_last)
)

df_last_month.show()

✅ Quick check

Today’s date (June 14, 2025):
- add_months(..., -1) → May 14, 2025
- last_day(...) → May 31, 2025
- prev_month_first → May 1, 2025
Final filter = dates between 2025-05-01 and 2025-05-31

To get the first day of the month in PySpark using a date column (e.g., date_sales), you can use the last_day() and date_sub() functions with a little trick:

✅ Method 1: First day of the same month

from pyspark.sql import functions as F

df = df.withColumn(
    "first_day_of_month",
    F.date_sub(F.col("date_sales"), F.dayofmonth(F.col("date_sales")) - 1)
)

This subtracts (day - 1) days from the date, leaving the first day of that month.

✅ Method 2: Using trunc() for same result

df = df.withColumn(
    "first_day_of_month",
    F.trunc("date_sales", "month")
)

F.trunc(date_col, 'month') returns the first day of the month.
This is cleaner and more readable.

🧪 Example:

# Sample input
+------------+
| date_sales |
+------------+
| 2025-06-14 |
| 2025-05-27 |
+------------+

# Output after applying F.trunc(...)
+------------+------------------+
| date_sales | first_day_of_month |
+------------+------------------+
| 2025-06-14 | 2025-06-01       |
| 2025-05-27 | 2025-05-01       |
+------------+------------------+

F.add_months(F.trunc(F.current_date(), “MM”), 1) # → First day of NEXT month
F.add_months(F.trunc(F.current_date(), “MM”), -1) # → First day of PREVIOUS month

breakdown of whether PySpark date functions accept string inputs and what happens behind the scenes.

✅ Summary Table: PySpark Date Functions Input Compatibility

Function	Accepts `String` date?	Behavior
`to_date()`	✅ Yes	Converts a `String` to `DateType`
`trunc()`	❌ No (must be Date)	Will fail or auto-cast only if safe; best to use `to_date()` explicitly
`add_months()`	❌ No (needs Date)	Requires `DateType` or `TimestampType`, not a string
`last_day()`	❌ No (needs Date)	Same as above
`date_sub()` / `date_add()`	❌ No (needs Date)	Same as above
`date_format()`	✅ Yes	Accepts Date/Timestamp, not pure string; auto-cast if safe

🔧 Best Practice: Always Convert String to Date First

Before using any of these functions, convert your string column:

from pyspark.sql import functions as F

df = df.withColumn("date_sales", F.to_date("date_sales", "yyyy-MM-dd"))

✅ Safe Usage Example:

df = df.withColumn("date_sales", F.to_date("date_sales", "yyyy-MM-dd"))

df = df.withColumn("month_start", F.trunc("date_sales", "MM"))
df = df.withColumn("prev_month_start", F.add_months(F.col("month_start"), -1))
df = df.withColumn("last_day_prev_month", F.last_day("prev_month_start"))

❌ Unsafe (Error-Prone) Example:

# This will throw error or behave unexpectedly
df = df.withColumn("month_start", F.trunc("2025-06-14", "MM"))

Because "2025-06-14" is a literal string — not a DateType.

🧪 If you want to use a literal date string:

from pyspark.sql.functions import to_date, lit, trunc

df = df.withColumn("truncated", trunc(to_date(lit("2025-06-14")), "MM"))

✅ Rule of Thumb:

Always do this before applying date functions:

F.to_date("date_col", "yyyy-MM-dd")

HintsToday

recent posts

about