Q. 41 .How we can filter rows for the last 12 months or 6 months using PySpark?
Here’s how you can filter rows for the last 12 months or 6 months using PySpark — fully vectorized and Spark-native (no Python if/else
on rows).
✅ Step-by-Step: Filter for Recent Months
Assume you have a DataFrame with a date
column in DateType
or TimestampType
.
📦 Example Setup
from pyspark.sql import functions as F
from datetime import datetime, timedelta
# Sample DataFrame
data = [
("A", "2023-09-15"), ("B", "2024-03-10"), ("C", "2024-06-01"),
("D", "2023-07-01"), ("E", "2024-01-25")
]
df = spark.createDataFrame(data, ["id", "date_str"]) \
.withColumn("date", F.to_date("date_str")) \
.drop("date_str")
df.show()
🔁 Filter for Last 12 Months
today = datetime.today()
last_12_months = today - timedelta(days=365)
df_last_12 = df.filter(F.col("date") >= F.lit(last_12_months))
df_last_12.show()
🔁 Filter for Last 6 Months
last_6_months = today - timedelta(days=183) # approx half year
df_last_6 = df.filter(F.col("date") >= F.lit(last_6_months))
df_last_6.show()
🧠 Tip: Use current_date()
for Runtime-Safe Spark Jobs
If you want fully dynamic jobs (e.g., in scheduled pipelines), use:
df.filter(F.col("date") >= F.add_months(F.current_date(), -12)).show()
df.filter(F.col("date") >= F.add_months(F.current_date(), -6)).show()
This avoids using datetime.today()
from Python and keeps everything Spark-native.
✅ Final Best Practice
Want… | Use… |
---|---|
Fixed 12 months | F.col("date") >= F.lit(datetime - 365d) |
Rolling 12 months | F.add_months(F.current_date(), -12) |
Precise month matching | F.trunc("date", "MM") , F.month() , etc. |
reusable functions :
def filter_last_n_months(df, date_col, months=6):
return df.filter(F.col(date_col) >= F.add_months(F.current_date(), -months))
Q.42. to dynamically filter data where the
date
falls between: End of last month & Start of 13 months agoPerfect! You want to dynamically filter data where the
date
falls between:
- Start of 13 months ago
- End of last month
This is commonly used in rolling 12-month reporting windows.
✅ Desired Date Range:
Bound Description Lower Start of 13th month ago (e.g., if today is July 11, 2025 → June 1, 2024) Upper End of last month (e.g., June 30, 2025) 🧠 Spark Code – Fully Dynamic
from pyspark.sql import functions as F # Define start of 13th month ago start_date = F.trunc(F.add_months(F.current_date(), -13), "MM") # Define end of last month end_date = F.last_day(F.add_months(F.current_date(), -1)) # Apply filter filtered_df = df.filter((F.col("date") >= start_date) & (F.col("date") <= end_date)) filtered_df.show()
🧪 Example for July 11, 2025:
start_date
→2024-06-01
end_date
→2025-06-30
So it includes full months: June 2024 to June 2025
✅ Wrap It in a Reusable Function
def filter_last_12_full_months(df, date_col="date"): start_date = F.trunc(F.add_months(F.current_date(), -13), "MM") end_date = F.last_day(F.add_months(F.current_date(), -1)) return df.filter((F.col(date_col) >= start_date) & (F.col(date_col) <= end_date))
Use it as:
result = filter_last_12_full_months(df) result.show()
# PySpark Advanced DataFrame Tricks & Interview Scenarios
# ---------------------------------------------------------
# Covers Row-level Tricks, Summary Aggregations, Column Manipulation,
# Cross-row Comparisons, Schema Tricks, Join + Group + Window Combos,
# Dynamic Queries, Null Handling, Map Types, Error Handling,
# Date & String Manipulations, and Advanced Window Functions
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType, MapType, TimestampType
spark = SparkSession.builder.appName("AdvancedPySpark").getOrCreate()
# Sample DataFrame for date, amount, and category
from datetime import datetime
data = [
("A", "2023-09-15"), ("B", "2024-03-10"), ("C", "2024-06-01"),
("D", "2023-07-01"), ("E", "2024-01-25")
]
df = spark.createDataFrame(data, ["id", "date_str"]) \
.withColumn("date", F.to_date("date_str")) \
.drop("date_str")
df.show()
# ----------------------------------
# 43. Dynamic Date Filtering: Last N Full Months (default = 12)
# ----------------------------------
def filter_last_n_full_months(df, date_col="date", months_back=12):
start_date = F.trunc(F.add_months(F.current_date(), -months_back - 1), "MM")
end_date = F.last_day(F.add_months(F.current_date(), -1))
return df.filter((F.col(date_col) >= start_date) & (F.col(date_col) <= end_date))
# Apply the filter
filtered_df = filter_last_n_full_months(df)
filtered_df.show()
# ----------------------------------
# 44. Return Formatted Month-Year Labels (MMM-YYYY)
# ----------------------------------
filtered_df.withColumn("month_label", F.date_format("date", "MMM-yyyy")).show()