How to handle string Datetime variables in pyspark?
Handling string-formatted datetime variables in PySpark requires transforming them into PySpark-compatible datetime types (DateType
or TimestampType
) for effective processing. Below is a consolidated guide to handle string datetime variables in PySpark efficiently:
1. Convert String Datetime to PySpark Date/Time Types
Steps:
- Use
to_date
to convert a string toDateType
(YYYY-MM-DD format). - Use
to_timestamp
to convert a string toTimestampType
(YYYY-MM-DD HH:mm:ss format).
from pyspark.sql.functions import to_date, to_timestamp, col
# Sample DataFrame
data = [("2024-01-15", "2024-01-15 12:34:56")]
columns = ["string_date", "string_datetime"]
df = spark.createDataFrame(data, columns)
# Convert string to DateType
df = df.withColumn("date_col", to_date(col("string_date"), "yyyy-MM-dd"))
# Convert string to TimestampType
df = df.withColumn("timestamp_col", to_timestamp(col("string_datetime"), "yyyy-MM-dd HH:mm:ss"))
df.show()
Output:
diffCopy code+-----------+-------------------+----------+-------------------+
|string_date|string_datetime |date_col |timestamp_col |
+-----------+-------------------+----------+-------------------+
|2024-01-15 |2024-01-15 12:34:56|2024-01-15|2024-01-15 12:34:56|
+-----------+-------------------+----------+-------------------+
=========================================
’12-Feb-2024′ is a string How to convert it into a date Value in pyspark?
We can use the to_date()
function in PySpark to convert a string to a date. Here’s an example:
from pyspark.sql import SparkSession
from pyspark.sql.functions import to_date, col
# create a SparkSession
spark = SparkSession.builder.appName("Date Conversion").getOrCreate()
# create a sample DataFrame
data = [("12-Feb-2024",)]
df = spark.createDataFrame(data, ["date_str"])
# convert the string to a date
df = df.withColumn("date", to_date(col("date_str"), "dd-MMM-yyyy"))
# show the result
df.show()
In this example, the to_date()
function takes two arguments: the column to convert (date_str
) and the format of the date string ("dd-MMM-yyyy"
). The format string tells PySpark how to parse the date string.
The output of this code will be:
+-------------+----------+
| date_str| date|
+-------------+----------+
|12-Feb-2024|2024-02-12|
+-------------+----------+
Note that the resulting date column is in the format yyyy-MM-dd
, which is the default format for dates in PySpark.
=============================================
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date, to_timestamp
# Initialize Spark session
spark = SparkSession.builder.master("local").appName("Dynamic Formatting").getOrCreate()
# Sample data
data = [("01/15/2024", "15-Jan-2024 12:34:56")]
columns = ["date_format1", "datetime_format2"]
df = spark.createDataFrame(data, columns)
# Define formats in a dictionary
date_formats = {
"date_format1": "MM/dd/yyyy",
"datetime_format2": "dd-MMM-yyyy HH:mm:ss"
}
# Use the formats dynamically with f-strings
df = df.withColumn("parsed_date", to_date(col("date_format1"), f"{date_formats['date_format1']}"))
.withColumn("parsed_timestamp", to_timestamp(col("datetime_format2"), f"{date_formats['datetime_format2']}"))
# Show the resulting DataFrame
df.show(truncate=False)
2. Extract Components from Datetime
Extract useful components like year, month, day, hour, etc.
from pyspark.sql.functions import year, month, dayofmonth, hour, minute, second
df = df.withColumn("year", year(col("timestamp_col")))
.withColumn("month", month(col("timestamp_col")))
.withColumn("day", dayofmonth(col("timestamp_col")))
.withColumn("hour", hour(col("timestamp_col")))
.withColumn("minute", minute(col("timestamp_col")))
.withColumn("second", second(col("timestamp_col")))
df.show()
3. Filter Data Based on String DateTime
For string-based filtering, PySpark allows direct comparison without converting strings. For better accuracy, convert to date/time first.
Filter Examples:
# Filter by exact date
filtered_df = df.filter(col("string_date") == "2024-01-15")
# Filter by range (convert to date first for accuracy)
filtered_df = df.filter((col("date_col") >= "2024-01-01") & (col("date_col") <= "2024-12-31"))
filtered_df.show()
4. Format Dates into Custom Strings
Convert DateType
or TimestampType
into custom string formats.
from pyspark.sql.functions import date_format
# Format Timestamp into a custom string
df = df.withColumn("custom_format", date_format(col("timestamp_col"), "dd-MM-yyyy HH:mm:ss"))
df.show()
5. Handle Different String Formats
If datetime strings come in various formats, specify the exact format during conversion.
Example:
data = [("01/15/2024", "15-Jan-2024 12:34:56")]
columns = ["date_format1", "datetime_format2"]
df = spark.createDataFrame(data, columns)
df = df.withColumn("parsed_date", to_date(col("date_format1"), "MM/dd/yyyy"))
.withColumn("parsed_timestamp", to_timestamp(col("datetime_format2"), "dd-MMM-yyyy HH:mm:ss"))
df.show()
6. Handle Null or Invalid Datetime Strings
PySpark handles invalid datetime strings as null
during conversion. Use na.drop
or na.fill
to manage nulls.
Example:
data = [("2024-01-15", "invalid_date")]
columns = ["valid_date", "invalid_date"]
df = spark.createDataFrame(data, columns)
# Attempt to convert invalid strings
df = df.withColumn("converted_date", to_date(col("invalid_date"), "yyyy-MM-dd"))
# Handle nulls
df = df.na.fill({"converted_date": "1970-01-01"}) # Fill nulls with default date
df.show()
7. Aggregate Data Based on Datetime
Group data by date/time components or custom time intervals.
Example: Group by Year-Month:
df = df.withColumn("year_month", date_format(col("date_col"), "yyyy-MM"))
aggregated_df = df.groupBy("year_month").count()
aggregated_df.show()
8. Compare Dates Across Different Formats
If comparing string-formatted Oracle datetime with PySpark date:
- Convert Oracle string to PySpark date.
- Compare the two dates.
Example:
data = [("2024-01-15 12:34:56", "2024-01-15")]
columns = ["oracle_datetime", "hive_date"]
df = spark.createDataFrame(data, columns)
df = df.withColumn("parsed_oracle_date", to_date(col("oracle_datetime"), "yyyy-MM-dd HH:mm:ss"))
.withColumn("comparison", col("parsed_oracle_date") == col("hive_date"))
df.show()
Best Practices:
- Use Native Functions: Avoid Python UDFs to ensure Catalyst optimization.
- Validate Formats: Ensure all string datetime formats are pre-validated.
- Handle Nulls Gracefully: Use
na.fill
orna.drop
where applicable. - Leverage Partitioning: For ETL pipelines, partition large datasets by year/month/day for better performance.
These methods ensure robust and efficient handling of string datetime variables in PySpark for ETL processes.
Question-A table has a column program_date column which has datetime values in string format , it has to be compared with another table which has colum month_year whose values can be in integer( yyyymm) 201402 or string (yyyymmm) i.e 2014Feb, How can it be done in Pyspark Dataframe Api or Spark SQL?
We can achieve this by performing the following steps:
Step 1: Convert the program_date column to a date format
Use the to_date
function to convert the program_date
column to a date format.
from pyspark.sql.functions import to_date
df1 = df1.withColumn("program_date", to_date("program_date", "dd-MMM-yyyy"))
Step 2: Extract the year and month from the program_date column
Use the year
and month
functions to extract the year and month from the program_date
column.
from pyspark.sql.functions import year, month
df1 = df1.withColumn("year", year("program_date"))
.withColumn("month", month("program_date"))
Step 3: Convert the month_year column in the second table to a consistent format
Use the when
and otherwise
functions to check if the month_year
column is in integer or string format, and convert it accordingly.
from pyspark.sql.functions import when, otherwise, substr, length
df2 = df2.withColumn("month_year",
when(length("month_year") == 6, "month_year")
.otherwise(substr("month_year", 1, 4) * 100 +
when(substr("month_year", 5, 3) == "Jan", 1)
.when(substr("month_year", 5, 3) == "Feb", 2)
.when(substr("month_year", 5, 3) == "Mar", 3)
.when(substr("month_year", 5, 3) == "Apr", 4)
.when(substr("month_year", 5, 3) == "May", 5)
.when(substr("month_year", 5, 3) == "Jun", 6)
.when(substr("month_year", 5, 3) == "Jul", 7)
.when(substr("month_year", 5, 3) == "Aug", 8)
.when(substr("month_year", 5, 3) == "Sep", 9)
.when(substr("month_year", 5, 3) == "Oct", 10)
.when(substr("month_year", 5, 3) == "Nov", 11)
.when(substr("month_year", 5, 3) == "Dec", 12)
.otherwise(0)))
Step 4: Join the two tables based on the year and month columns
Use the join
function to join the two tables based on the year
and month
columns.
df_joined = df1.join(df2, (df1.year * 100 + df1.month) == df2.month_year)
We can achieve this in Spark SQL by using a combination of the unix_timestamp
function to convert the string datetime to a timestamp, and the year
and month
functions to extract the year and month from the timestamp. Then, you can format the month_year
column to match the desired format.
Here’s an example Spark SQL query:
SELECT *
FROM table1
JOIN table2
ON (
(YEAR(FROM_UNIXTIME(UNIX_TIMESTAMP(table1.program_date, 'dd-MMM-yyyy'), 'yyyy-MM-dd')) * 100) +
MONTH(FROM_UNIXTIME(UNIX_TIMESTAMP(table1.program_date, 'dd-MMM-yyyy'), 'yyyy-MM-dd'))
) =
CASE
WHEN table2.month_year LIKE '%[^0-9]%' THEN
CAST(SUBSTR(table2.month_year, 1, 4) AS INT) * 100 +
CASE
WHEN LOWER(SUBSTR(table2.month_year, 5, 3)) = 'jan' THEN 1
WHEN LOWER(SUBSTR(table2.month_year, 5, 3)) = 'feb' THEN 2
WHEN LOWER(SUBSTR(table2.month_year, 5, 3)) = 'mar' THEN 3
WHEN LOWER(SUBSTR(table2.month_year, 5, 3)) = 'apr' THEN 4
WHEN LOWER(SUBSTR(table2.month_year, 5, 3)) = 'may' THEN 5
WHEN LOWER(SUBSTR(table2.month_year, 5, 3)) = 'jun' THEN 6
WHEN LOWER(SUBSTR(table2.month_year, 5, 3)) = 'jul' THEN 7
WHEN LOWER(SUBSTR(table2.month_year, 5, 3)) = 'aug' THEN 8
WHEN LOWER(SUBSTR(table2.month_year, 5, 3)) = 'sep' THEN 9
WHEN LOWER(SUBSTR(table2.month_year, 5, 3)) = 'oct' THEN 10
WHEN LOWER(SUBSTR(table2.month_year, 5, 3)) = 'nov' THEN 11
WHEN LOWER(SUBSTR(table2.month_year, 5, 3)) = 'dec' THEN 12
END
ELSE CAST(table2.month_year AS INT)
END
To generate a dynamic list of the next and previous 12 months in
yyyyMM
format
you can generate a dynamic list of the next and previous 12 months in yyyyMM
format using PySpark SQL functions like date_format()
and date_add()
. PySpark provides built-in functions that make it easier to manipulate dates directly in SQL queries without needing to rely on Python’s datetime
module or manually handling date calculations.
Here’s how you can do it purely using PySpark SQL functions:
Example Code
from pyspark.sql import SparkSession
from pyspark.sql.functions import current_date, expr
# Initialize Spark session
spark = SparkSession.builder
.appName("Generate YearMonth List using SQL with .format()")
.getOrCreate()
# Create a DataFrame with a single row and the current date
df = spark.createDataFrame([(1,)], ["id"]).withColumn("current_date", current_date())
# Generate SQL expressions for the next 12 months and the previous 12 months using .format()
# We are using a range from -12 to +12 (last 12 months, current month, and next 12 months)
exprs = []
for i in range(-12, 13):
# Using .format() to create dynamic SQL expressions
exprs.append("date_format(date_add(current_date, {}), 'yyyyMM') as month_{}".format(i * 30, i))
# Select these dynamically generated columns in the PySpark DataFrame
months_df = df.selectExpr(*exprs)
# Show the DataFrame with all the generated months
months_df.show(truncate=False)
Explanation:
current_date()
: Generates the current date. This is a PySpark function that automatically gets the current system date.date_add(current_date, {i * 30})
: Adds or subtracts 30 days (approximating 1 month) to the current date. We use a range from-12
to12
to get the past and future months.date_format()
: Converts the resulting date to theyyyyMM
format.selectExpr()
: Dynamically selects columns using SQL expressions for each month in the range from -12 to +12, creating column names likemonth_-12
,month_0
,month_12
.
Output Example:
+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+
|month_-12|month_-11|month_-10|month_-9 |month_-8 |month_-7 |month_-6 |month_-5 |month_-4 |month_-3 |month_-2 |month_-1 |month_0 |month_1 |month_2 |month_3 |month_4 |month_5 |month_6 |month_7 |month_8 |month_9 |month_10 |month_11 |month_12 |
+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+
|202309 |202310 |202311 |202312 |202401 |202402 |202403 |202404 |202405 |202406 |202407 |202408 |202409 |202410 |202411 |202412 |202501 |202502 |202503 |202504 |202505 |202506 |202507 |202508 |202509 |
+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+
Explanation of Output:
month_-12
tomonth_12
: These columns represent the past 12 months, the current month, and the next 12 months inyyyyMM
format.
How to Access Individual Months:
You can select specific months or access them programmatically by collecting the results:
# Collect results into a Python list
months_list = months_df.collect()[0]
# Example: Accessing the current month and next month
current_month = months_list["month_0"]
next_month = months_list["month_1"]
print("Current Month:", current_month)
print("Next Month:", next_month)
Storing Results in Variables or DataFrame:
If you want to store the results as separate variables or use them for filtering, you can extract the values from the DataFrame as shown above and use them as needed.
Filtering Example:
If you want to filter your data using these month values, you can convert the months into a filter condition:
# Example: Filter another DataFrame by one of the generated months
data_df.filter(data_df["year_month_column"] == months_list["month_0"]).show()
Advantages of Using PySpark SQL Functions:
- Efficiency: PySpark’s built-in functions like
date_add()
anddate_format()
are optimized and can handle large datasets efficiently in a distributed environment. - Compatibility: This method ensures compatibility with older and newer versions of PySpark since SQL expressions and functions like
date_format()
have been available for a long time.
Above Task can also be done by combining python Date Functionality.
from datetime import datetime, timedelta
from pyspark.sql import SparkSession
from pyspark.sql import Row
# Initialize Spark session
spark = SparkSession.builder
.appName("Generate YearMonth List")
.getOrCreate()
# Get the current date
current_date = datetime.now()
# Function to generate year-month (yyyyMM) format for a given date
def generate_year_month(date):
return date.strftime('%Y%m')
# Create a list to store the year-month values
months_list = []
# Generate the next 12 months (including the current month)
for i in range(0, 13):
future_date = current_date + timedelta(days=30 * i) # Approximate to 30 days per month
months_list.append(generate_year_month(future_date))
# Generate the last 12 months
for i in range(1, 13):
past_date = current_date - timedelta(days=30 * i) # Approximate to 30 days per month
months_list.append(generate_year_month(past_date))
# Convert the list into a DataFrame for further operations in PySpark
months_rdd = spark.sparkContext.parallelize(months_list)
months_df = spark.createDataFrame(months_rdd.map(lambda x: Row(month=x)))
# Show the DataFrame
months_df.show()
# Example of accessing specific months in the list
print("Next month: ", months_list[1])
print("Previous month: ", months_list[-1])
Python’s built-in date functions, such as those in the datetime
module, are not designed to work in a distributed environment like PySpark.
In PySpark, when you use Python’s built-in date functions, they will only work on the driver node, not on the worker nodes. This is because the datetime
module is not serializable, meaning it can’t be sent to the worker nodes for execution.
To perform date-related operations in PySpark, you should use PySpark’s built-in date functions, which are designed to work in a distributed environment. These functions include:
pyspark.sql.functions.current_date()
pyspark.sql.functions.current_timestamp()
pyspark.sql.functions.date_format()
pyspark.sql.functions.date_trunc()
pyspark.sql.functions.dayofmonth()
pyspark.sql.functions.dayofyear()
pyspark.sql.functions.from_unixtime()
pyspark.sql.functions.hour()
pyspark.sql.functions.minute()
pyspark.sql.functions.month()
pyspark.sql.functions.quarter()
pyspark.sql.functions.second()
pyspark.sql.functions.to_date()
pyspark.sql.functions.to_timestamp()
pyspark.sql.functions.unix_timestamp()
pyspark.sql.functions.weekofyear()
pyspark.sql.functions.year()
Here are all the PySpark date functions:
add_months(date, months)
: Returns the date that ismonths
months afterdate
.current_date()
: Returns the current date.current_timestamp()
: Returns the current timestamp.date_format(date, format)
: Converts a date to a string using the specified format.date_trunc(format, timestamp)
: Truncates a timestamp to the specified format.dayofmonth(date)
: Returns the day of the month from a date.dayofyear(date)
: Returns the day of the year from a date.from_unixtime(unix_timestamp)
: Converts a Unix timestamp to a timestamp.from_utc_timestamp(timestamp, tz)
: Converts a timestamp from UTC to the specified timezone.hour(timestamp)
: Returns the hour from a timestamp.last_day(date)
: Returns the last day of the month from a date.minute(timestamp)
: Returns the minute from a timestamp.month(date)
: Returns the month from a date.next_day(date, dayOfWeek)
: Returns the next day of the week from a date.quarter(date)
: Returns the quarter from a date.second(timestamp)
: Returns the second from a timestamp.to_date(timestamp)
: Converts a timestamp to a date.to_timestamp(date)
: Converts a date to a timestamp.to_utc_timestamp(timestamp, tz)
: Converts a timestamp to UTC from the specified timezone.trunc(date, format)
: Truncates a date to the specified format.unix_timestamp(timestamp)
: Converts a timestamp to a Unix timestamp.weekofyear(date)
: Returns the week of the year from a date.year(date)
: Returns the year from a date.
And here are some examples of using these functions:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
# Create a SparkSession
spark = SparkSession.builder.getOrCreate()
# Create a sample DataFrame
data = [("2022-01-01",), ("2022-01-02",), ("2022-01-03",)]
df = spark.createDataFrame(data, ["date"])
# Use PySpark date functions
df = df.withColumn("day_of_month", dayofmonth("date"))
df = df.withColumn("month", month("date"))
df = df.withColumn("year", year("date"))
df = df.withColumn("next_day", next_day("date", "Sunday"))
df = df.withColumn("last_day", last_day("date"))
# Show the results
df.show()
To filter sales data for last month using pyspark date functions of date_sales column
To filter your sales data in PySpark for records from last month based on a date_sales
column, you can use the current_date()
, add_months()
, last_day()
, and comparison logic. Here’s how:
from pyspark.sql import functions as F
# Assuming df_sales has a column `date_sales` of DateType (or castable to it)
# 1. Compute the first and last day of the previous month:
prev_month_last = F.last_day(F.add_months(F.current_date(), -1))
prev_month_first = F.date_sub(prev_month_last, F.dayofmonth(prev_month_last) - 1)
# 2. Filter the DataFrame:
df_last_month = df_sales.filter(
(F.col("date_sales") >= prev_month_first) &
(F.col("date_sales") <= prev_month_last)
)
# Optional: verify bounds
df_last_month.select(prev_month_first.alias("start_date"),
prev_month_last.alias("end_date")).distinct().show()
🧠 Why this works:
add_months(current_date(), -1)
shifts to the same day last month.last_day(...)
gives the last day of that month.- Subtracting
(day-of-month − 1)
gives the 1st of the same month.
Handling timestamp or string input
If date_sales
is a string, cast it first:
df_sales = df_sales.withColumn("date_sales", F.to_date("date_sales", "yyyy-MM-dd"))
If it’s a timestamp, you can still compare dates directly (Spark auto-casts):
df_sales = df_sales.withColumn("date_sales_dt", F.to_date("date_sales"))
Then filter on date_sales_dt
instead.
Full example with schema
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df_sales = spark.read.parquet("sales_data.parquet") # or any source
# cast if needed
df_sales = df_sales.withColumn("date_sales", F.to_date("date_sales"))
prev_month_last = F.last_day(F.add_months(F.current_date(), -1))
prev_month_first = F.date_sub(prev_month_last, F.dayofmonth(prev_month_last) - 1)
df_last_month = df_sales.filter(
(F.col("date_sales") >= prev_month_first) &
(F.col("date_sales") <= prev_month_last)
)
df_last_month.show()
✅ Quick check
- Today’s date (June 14, 2025):
add_months(..., -1)
→ May 14, 2025last_day(...)
→ May 31, 2025prev_month_first
→ May 1, 2025
- Final filter = dates between 2025-05-01 and 2025-05-31
To get the first day of the month in PySpark using a date column (e.g.,
date_sales
), you can use thelast_day()
anddate_sub()
functions with a little trick:
✅ Method 1: First day of the same month
from pyspark.sql import functions as F
df = df.withColumn(
"first_day_of_month",
F.date_sub(F.col("date_sales"), F.dayofmonth(F.col("date_sales")) - 1)
)
This subtracts (day - 1)
days from the date, leaving the first day of that month.
✅ Method 2: Using trunc() for same result
df = df.withColumn(
"first_day_of_month",
F.trunc("date_sales", "month")
)
F.trunc(date_col, 'month')
returns the first day of the month.- This is cleaner and more readable.
🧪 Example:
# Sample input
+------------+
| date_sales |
+------------+
| 2025-06-14 |
| 2025-05-27 |
+------------+
# Output after applying F.trunc(...)
+------------+------------------+
| date_sales | first_day_of_month |
+------------+------------------+
| 2025-06-14 | 2025-06-01 |
| 2025-05-27 | 2025-05-01 |
+------------+------------------+
F.add_months(F.trunc(F.current_date(), “MM”), 1) # → First day of NEXT month
F.add_months(F.trunc(F.current_date(), “MM”), -1) # → First day of PREVIOUS month
breakdown of whether PySpark date functions accept string inputs and what happens behind the scenes.
✅ Summary Table: PySpark Date Functions Input Compatibility
Function | Accepts String date? | Behavior |
---|---|---|
to_date() | ✅ Yes | Converts a String to DateType |
trunc() | ❌ No (must be Date) | Will fail or auto-cast only if safe; best to use to_date() explicitly |
add_months() | ❌ No (needs Date) | Requires DateType or TimestampType , not a string |
last_day() | ❌ No (needs Date) | Same as above |
date_sub() / date_add() | ❌ No (needs Date) | Same as above |
date_format() | ✅ Yes | Accepts Date/Timestamp, not pure string; auto-cast if safe |
🔧 Best Practice: Always Convert String to Date First
Before using any of these functions, convert your string column:
from pyspark.sql import functions as F
df = df.withColumn("date_sales", F.to_date("date_sales", "yyyy-MM-dd"))
✅ Safe Usage Example:
df = df.withColumn("date_sales", F.to_date("date_sales", "yyyy-MM-dd"))
df = df.withColumn("month_start", F.trunc("date_sales", "MM"))
df = df.withColumn("prev_month_start", F.add_months(F.col("month_start"), -1))
df = df.withColumn("last_day_prev_month", F.last_day("prev_month_start"))
❌ Unsafe (Error-Prone) Example:
# This will throw error or behave unexpectedly
df = df.withColumn("month_start", F.trunc("2025-06-14", "MM"))
Because "2025-06-14"
is a literal string — not a DateType
.
🧪 If you want to use a literal date string:
from pyspark.sql.functions import to_date, lit, trunc
df = df.withColumn("truncated", trunc(to_date(lit("2025-06-14")), "MM"))
✅ Rule of Thumb:
Always do this before applying date functions:
F.to_date("date_col", "yyyy-MM-dd")
Q.Users Who Booked in Last 7 Days but Not in Previous 7
Table:
bookings(user_id, booking_date)
Goal: Identify users active this week but not in the previous week.
WITH last_week AS (
SELECT DISTINCT user_id FROM bookings
WHERE booking_date BETWEEN current_date - INTERVAL 14 DAY AND current_date - INTERVAL 8 DAY
),
this_week AS (
SELECT DISTINCT user_id FROM bookings
WHERE booking_date BETWEEN current_date - INTERVAL 7 DAY AND current_date
)
SELECT user_id FROM this_week
WHERE user_id NOT IN (SELECT user_id FROM last_week);
will this works in spark sql
above query is logically sound in standard SQL, but it needs minor changes to work correctly in Spark SQL, due to the way date intervals and current_date
arithmetic are handled in Spark.
❌ Issues with your Spark SQL:
- Spark SQL does not support
current_date - INTERVAL n DAY
like traditional SQL. - You need to use
date_sub()
instead.
✅ Correct Spark SQL version:
WITH last_week AS (
SELECT DISTINCT user_id FROM bookings
WHERE booking_date BETWEEN date_sub(current_date(), 14) AND date_sub(current_date(), 8)
),
this_week AS (
SELECT DISTINCT user_id FROM bookings
WHERE booking_date BETWEEN date_sub(current_date(), 7) AND current_date()
)
SELECT user_id FROM this_week
WHERE user_id NOT IN (SELECT user_id FROM last_week);
🔍 Explanation:
date_sub(current_date(), n)
subtractsn
days from today.- This query gives you:
last_week
: users who booked 8 to 14 days ago.this_week
: users who booked in the last 7 days.- Then selects users in
this_week
who did not appear inlast_week
.
✅ Optional optimization (avoid NOT IN with NULL issues):
To handle possible NULLs safely, use LEFT ANTI JOIN
instead:
WITH last_week AS (
SELECT DISTINCT user_id FROM bookings
WHERE booking_date BETWEEN date_sub(current_date(), 14) AND date_sub(current_date(), 8)
),
this_week AS (
SELECT DISTINCT user_id FROM bookings
WHERE booking_date BETWEEN date_sub(current_date(), 7) AND current_date()
)
SELECT tw.user_id
FROM this_week tw
LEFT ANTI JOIN last_week lw
ON tw.user_id = lw.user_id;
🔁 Summary:
Issue | Fix |
---|---|
current_date - INTERVAL | Use date_sub(current_date(), n) |
NOT IN | Prefer LEFT ANTI JOIN in Spark SQL for NULL-safe filtering |
Leave a Reply