PySpark Coding Practice Questions

Q. 41 .How we can filter rows for the last 12 months or 6 months using PySpark?

Here’s how you can filter rows for the last 12 months or 6 months using PySpark — fully vectorized and Spark-native (no Python if/else on rows).


✅ Step-by-Step: Filter for Recent Months

Assume you have a DataFrame with a date column in DateType or TimestampType.


📦 Example Setup

from pyspark.sql import functions as F
from datetime import datetime, timedelta

# Sample DataFrame
data = [
    ("A", "2023-09-15"), ("B", "2024-03-10"), ("C", "2024-06-01"),
    ("D", "2023-07-01"), ("E", "2024-01-25")
]
df = spark.createDataFrame(data, ["id", "date_str"]) \
          .withColumn("date", F.to_date("date_str")) \
          .drop("date_str")

df.show()

🔁 Filter for Last 12 Months

today = datetime.today()
last_12_months = today - timedelta(days=365)

df_last_12 = df.filter(F.col("date") >= F.lit(last_12_months))
df_last_12.show()

🔁 Filter for Last 6 Months

last_6_months = today - timedelta(days=183)  # approx half year

df_last_6 = df.filter(F.col("date") >= F.lit(last_6_months))
df_last_6.show()

🧠 Tip: Use current_date() for Runtime-Safe Spark Jobs

If you want fully dynamic jobs (e.g., in scheduled pipelines), use:

df.filter(F.col("date") >= F.add_months(F.current_date(), -12)).show()
df.filter(F.col("date") >= F.add_months(F.current_date(), -6)).show()

This avoids using datetime.today() from Python and keeps everything Spark-native.


✅ Final Best Practice

Want…Use…
Fixed 12 monthsF.col("date") >= F.lit(datetime - 365d)
Rolling 12 monthsF.add_months(F.current_date(), -12)
Precise month matchingF.trunc("date", "MM"), F.month(), etc.

reusable functions :

def filter_last_n_months(df, date_col, months=6):
    return df.filter(F.col(date_col) >= F.add_months(F.current_date(), -months))

Q.42. to dynamically filter data where the date falls between: End of last month & Start of 13 months ago

Perfect! You want to dynamically filter data where the date falls between:

  • Start of 13 months ago
  • End of last month

This is commonly used in rolling 12-month reporting windows.


✅ Desired Date Range:

BoundDescription
LowerStart of 13th month ago (e.g., if today is July 11, 2025 → June 1, 2024)
UpperEnd of last month (e.g., June 30, 2025)

🧠 Spark Code – Fully Dynamic

from pyspark.sql import functions as F

# Define start of 13th month ago
start_date = F.trunc(F.add_months(F.current_date(), -13), "MM")

# Define end of last month
end_date = F.last_day(F.add_months(F.current_date(), -1))

# Apply filter
filtered_df = df.filter((F.col("date") >= start_date) & (F.col("date") <= end_date))

filtered_df.show()

🧪 Example for July 11, 2025:

  • start_date2024-06-01
  • end_date2025-06-30

So it includes full months: June 2024 to June 2025


✅ Wrap It in a Reusable Function

def filter_last_12_full_months(df, date_col="date"):
    start_date = F.trunc(F.add_months(F.current_date(), -13), "MM")
    end_date = F.last_day(F.add_months(F.current_date(), -1))
    return df.filter((F.col(date_col) >= start_date) & (F.col(date_col) <= end_date))

Use it as:

result = filter_last_12_full_months(df)
result.show()

# PySpark Advanced DataFrame Tricks & Interview Scenarios
# ---------------------------------------------------------
# Covers Row-level Tricks, Summary Aggregations, Column Manipulation,
# Cross-row Comparisons, Schema Tricks, Join + Group + Window Combos,
# Dynamic Queries, Null Handling, Map Types, Error Handling,
# Date & String Manipulations, and Advanced Window Functions

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType, MapType, TimestampType

spark = SparkSession.builder.appName("AdvancedPySpark").getOrCreate()

# Sample DataFrame for date, amount, and category
from datetime import datetime

data = [
    ("A", "2023-09-15"), ("B", "2024-03-10"), ("C", "2024-06-01"),
    ("D", "2023-07-01"), ("E", "2024-01-25")
]
df = spark.createDataFrame(data, ["id", "date_str"]) \
          .withColumn("date", F.to_date("date_str")) \
          .drop("date_str")
df.show()


# ----------------------------------
# 43. Dynamic Date Filtering: Last N Full Months (default = 12)
# ----------------------------------
def filter_last_n_full_months(df, date_col="date", months_back=12):
    start_date = F.trunc(F.add_months(F.current_date(), -months_back - 1), "MM")
    end_date = F.last_day(F.add_months(F.current_date(), -1))
    return df.filter((F.col(date_col) >= start_date) & (F.col(date_col) <= end_date))

# Apply the filter
filtered_df = filter_last_n_full_months(df)
filtered_df.show()

# ----------------------------------
# 44. Return Formatted Month-Year Labels (MMM-YYYY)
# ----------------------------------
filtered_df.withColumn("month_label", F.date_format("date", "MMM-yyyy")).show()

Pages: 1 2 3 4 5 6 7