A quick reference for date manipulation in PySpark:–
Function | Description | Works On | Example (Spark SQL) | Example (DataFrame API) |
---|---|---|---|---|
to_date | Converts string to date. | String | TO_DATE('2024-01-15', 'yyyy-MM-dd') | to_date(col("date_str"), "yyyy-MM-dd") |
to_timestamp | Converts string to timestamp. | String | TO_TIMESTAMP('2024-01-15 12:34:56', 'yyyy-MM-dd HH:mm:ss') | to_timestamp(col("timestamp_str"), "yyyy-MM-dd HH:mm:ss") |
date_format | Formats date or timestamp as a string. | Date, Timestamp | DATE_FORMAT(CURRENT_DATE, 'dd-MM-yyyy') | date_format(col("date_col"), "dd-MM-yyyy") |
current_date | Returns the current date. | – | CURRENT_DATE | current_date() |
current_timestamp | Returns the current timestamp. | – | CURRENT_TIMESTAMP | current_timestamp() |
date_add | Adds days to a date. | Date, Timestamp | DATE_ADD('2024-01-15', 10) | date_add(col("date_col"), 10) |
date_sub | Subtracts days from a date. | Date, Timestamp | DATE_SUB('2024-01-15', 10) | date_sub(col("date_col"), 10) |
months_between | Returns months between two dates or timestamps. | Date, Timestamp | MONTHS_BETWEEN('2024-01-15', '2023-12-15') | months_between(col("date1"), col("date2")) |
datediff | Returns difference in days between two dates/timestamps. | Date, Timestamp | DATEDIFF('2024-01-15', '2024-01-10') | datediff(col("date1"), col("date2")) |
year | Extracts year from a date or timestamp. | Date, Timestamp | YEAR('2024-01-15') | year(col("date_col")) |
month | Extracts month from a date or timestamp. | Date, Timestamp | MONTH('2024-01-15') | month(col("date_col")) |
day | Extracts day from a date or timestamp. | Date, Timestamp | DAY('2024-01-15') | day(col("date_col")) |
dayofweek | Returns the day of the week (1 = Sunday, 7 = Saturday). | Date, Timestamp | DAYOFWEEK('2024-01-15') | dayofweek(col("date_col")) |
dayofmonth | Returns the day of the month (1-31). | Date, Timestamp | DAYOFMONTH('2024-01-15') | dayofmonth(col("date_col")) |
dayofyear | Returns the day of the year (1-366). | Date, Timestamp | DAYOFYEAR('2024-01-15') | dayofyear(col("date_col")) |
weekofyear | Returns the week number of the year. | Date, Timestamp | WEEKOFYEAR('2024-01-15') | weekofyear(col("date_col")) |
quarter | Extracts quarter of the year. | Date, Timestamp | QUARTER('2024-01-15') | quarter(col("date_col")) |
hour | Extracts hour from a timestamp. | Timestamp | HOUR('2024-01-15 12:34:56') | hour(col("timestamp_col")) |
minute | Extracts minute from a timestamp. | Timestamp | MINUTE('2024-01-15 12:34:56') | minute(col("timestamp_col")) |
second | Extracts second from a timestamp. | Timestamp | SECOND('2024-01-15 12:34:56') | second(col("timestamp_col")) |
last_day | Returns the last day of the month for a date. | Date, Timestamp | LAST_DAY('2024-01-15') | last_day(col("date_col")) |
next_day | Returns the next day of the week after a date. | Date, Timestamp | NEXT_DAY('2024-01-15', 'Sunday') | next_day(col("date_col"), "Sunday") |
trunc | Truncates a date to the specified part (e.g., month). | Date, Timestamp | TRUNC('2024-01-15', 'MM') | trunc(col("date_col"), "MM") |
add_months | Adds months to a date. | Date, Timestamp | ADD_MONTHS('2024-01-15', 2) | add_months(col("date_col"), 2) |
unix_timestamp | Converts a string to UNIX timestamp. | String | UNIX_TIMESTAMP('2024-01-15', 'yyyy-MM-dd') | unix_timestamp(col("date_str"), "yyyy-MM-dd") |
from_unixtime | Converts UNIX timestamp to a formatted string. | Integer (UNIX timestamp) | FROM_UNIXTIME(1673827200, 'yyyy-MM-dd') | from_unixtime(col("unix_col"), "yyyy-MM-dd") |
Now starts our Post:-
PySpark date functions along with detailed explanations and examples on Dataframes
Here’s a comprehensive list of some common PySpark date functions along with detailed explanations and examples on Dataframes(We will again discuss these basis Pyspark sql Queries):
1. current_date()
Returns the current date.
from pyspark.sql.functions import current_date
df = spark.createDataFrame([(1,)], ["dummy"])
df.select(current_date().alias("current_date")).show()
2. current_timestamp()
Returns the current timestamp.
from pyspark.sql.functions import current_timestamp
df.select(current_timestamp().alias("current_timestamp")).show()
3. date_format()
Formats a date using the specified format.
from pyspark.sql.functions import date_format
df = spark.createDataFrame([("2010-02-05",)], ["date"])
df.select(date_format("date", "MM/dd/yyyy").alias("formatted_date")).show()
4. year()
, month()
, dayofmonth()
Extracts the year, month, and day from a date.
from pyspark.sql.functions import year, month, dayofmonth
df.select(year("date").alias("year"),
month("date").alias("month"),
dayofmonth("date").alias("day")).show()
5. date_add()
, date_sub()
Adds or subtracts a specified number of days to/from a date.
from pyspark.sql.functions import date_add, date_sub
df.select(date_add("date", 10).alias("date_add"),date_sub("date", 10).alias("date_sub")).show()
6. datediff()
Returns the difference in days between two dates.
from pyspark.sql.functions import datediff
df2 = spark.createDataFrame([("2010-02-15",)], ["date2"])
df.join(df2).select(datediff("date2", "date").alias("datediff")).show()
7. add_months()
Adds a specified number of months to a date.
from pyspark.sql.functions import add_months
df.select(add_months("date", 1).alias("add_months")).show()
8. months_between()
Returns the number of months between two dates.
from pyspark.sql.functions import months_between
df.join(df2).select(months_between("date2", "date").alias("months_between")).show()
9. next_day()
Returns the first date which is the specified day of the week after the given date.
from pyspark.sql.functions import next_day
df.select(next_day("date", "Sunday").alias("next_day")).show()
10. last_day()
Returns the last day of the month which the given date belongs to.
from pyspark.sql.functions import last_day
df.select(last_day("date").alias("last_day")).show()
11. trunc()
Truncates a date to the specified unit (‘year’, ‘month’).
from pyspark.sql.functions import trunc
df.select(trunc("date", "MM").alias("trunc_month"),
trunc("date", "YY").alias("trunc_year")).show()
12. date_trunc()
Truncates a timestamp to the specified unit (‘year’, ‘month’, ‘day’, ‘hour’, ‘minute’).
from pyspark.sql.functions import date_trunc
df = spark.createDataFrame([("2010-02-05 12:34:56",)], ["timestamp"])
df.select(date_trunc("hour", "timestamp").alias("date_trunc_hour")).show()
13. from_unixtime()
Converts the number of seconds from the Unix epoch to a string representing the timestamp.
from pyspark.sql.functions import from_unixtime
df = spark.createDataFrame([(1234567890,)], ["unix_time"])
df.select(from_unixtime("unix_time").alias("timestamp")).show()
14. unix_timestamp()
Converts a string timestamp to the number of seconds from the Unix epoch.
from pyspark.sql.functions import unix_timestamp
df = spark.createDataFrame([("2010-02-05 12:34:56",)], ["timestamp"])
df.select(unix_timestamp("timestamp").alias("unix_timestamp")).show()
15. to_date()
Converts a string to a date.
from pyspark.sql.functions import to_date
df = spark.createDataFrame([("2010-02-05",)], ["date_str"])
df.select(to_date("date_str").alias("date")).show()
16. to_timestamp()
Converts a string to a timestamp.
from pyspark.sql.functions import to_timestamp
df = spark.createDataFrame([("2010-02-05 12:34:56",)], ["timestamp_str"])
df.select(to_timestamp("timestamp_str").alias("timestamp")).show()
17. weekofyear()
Returns the week of the year for a date.
from pyspark.sql.functions import weekofyear
df = spark.createDataFrame([("2010-02-05",)], ["date"])
df.select(weekofyear("date").alias("weekofyear")).show()
18. quarter()
Returns the quarter of the year for a date.
from pyspark.sql.functions import quarter
df.select(quarter("date").alias("quarter")).show()
19. dayofweek()
Returns the day of the week for a date.
from pyspark.sql.functions import dayofweek
df.select(dayofweek("date").alias("dayofweek")).show()
Example Project: Combining Multiple Date Functions
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, date_format, year, month, dayofmonth, date_add, date_sub, datediff, add_months, months_between, next_day, last_day, trunc, date_trunc, from_unixtime, unix_timestamp, to_date, to_timestamp, weekofyear, quarter, dayofweek
# Initialize Spark session
spark = SparkSession.builder.appName("date_functions_example").getOrCreate()
# Create a DataFrame with a sample date
data = [("2010-02-05", "2010-02-15 12:34:56", 1234567890)]
columns = ["date", "timestamp", "unix_time"]
df = spark.createDataFrame(data, columns)
# Applying various date functions
result = df.select(
col("date"),
col("timestamp"),
col("unix_time"),
date_format("date", "MM/dd/yyyy").alias("formatted_date"),
year("date").alias("year"),
month("date").alias("month"),
dayofmonth("date").alias("day"),
date_add("date", 10).alias("date_add"),
date_sub("date", 10).alias("date_sub"),
datediff("timestamp", "date").alias("datediff"),
add_months("date", 1).alias("add_months"),
months_between("timestamp", "date").alias("months_between"),
next_day("date", "Sunday").alias("next_day"),
last_day("date").alias("last_day"),
trunc("date", "MM").alias("trunc_month"),
trunc("date", "YY").alias("trunc_year"),
date_trunc("hour", "timestamp").alias("date_trunc_hour"),
from_unixtime("unix_time").alias("from_unixtime"),
unix_timestamp("timestamp").alias("unix_timestamp"),
to_date("timestamp").alias("to_date"),
to_timestamp("timestamp").alias("to_timestamp"),
weekofyear("date").alias("weekofyear"),
quarter("date").alias("quarter"),
dayofweek("date").alias("dayofweek")
)
# Show the results
result.show(truncate=False)
Function | Description | Example |
---|---|---|
current_date() | Returns the current date as a DateType object | current_date() |
current_timestamp() | Returns the current timestamp (with microseconds) as a TimestampType object | current_timestamp() |
date_add(date, days) | Adds a specified number of days to a date | date_add('2023-07-05', 10) |
datediff(end_date, start_date) | Calculates the number of days between two dates | datediff('2024-01-01', '2023-12-31') |
months_between(date1, date2) | Calculates the number of months between two dates | months_between('2024-02-01', '2023-07-05') |
to_date(string) | Converts a string to a DateType object (format-dependent) | to_date('2023-07-04', 'yyyy-MM-dd') |
to_timestamp(string) | Converts a string to a TimestampType object (format-dependent) | to_timestamp('2023-07-04 10:20:30', 'yyyy-MM-dd HH:mm:ss') |
trunc(date, format) | Truncates a date to a specified unit (year, month, day) | trunc('2023-07-05', 'year') |
year(date) | Extracts the year from a date | year('2023-07-05') |
quarter(date) | Extracts the quarter (1-4) from a date | quarter('2023-07-05') |
month(date) | Extracts the month (1-12) from a date | month('2023-07-05') |
dayofmonth(date) | Extracts the day of the month (1-31) from a date | dayofmonth('2023-07-05') |
dayofweek(date) | Extracts the day of the week (1-7, Monday=1) from a date | dayofweek('2023-07-05') |
dayofyear(date) | Extracts the day of the year (1-365) from a date | dayofyear('2023-07-05') |
last_day(date) | Returns the last day of the month for the given date | last_day('2023-07-05') |
unix_timestamp() | Returns the number of seconds since epoch (1970-01-01 UTC) for a timestamp | unix_timestamp() |
from_unixtime(timestamp) | Converts a unix timestamp to a TimestampType object | from_unixtime(1656816000) |
date_format(date, format) | Formats a date or timestamp according to a specified format | date_format('2023-07-05', 'dd-MM-yyyy') |
PySpark date functions along with detailed explanations and examples on Spark SQL
In PySpark, you can use various built-in date functions directly within SQL queries to manipulate and extract data from timestamp
, date
, and string
columns. PySpark’s SQL API provides a wide range of date functions similar to SQL, which allows you to perform operations like formatting, adding/subtracting time intervals, and extracting specific components from dates.
Here are some of the most commonly used PySpark date functions in SQL queries:
1. Current Date and Time Functions
current_date()
– Returns the current date.current_timestamp()
– Returns the current timestamp (date and time).now()
– Alias forcurrent_timestamp()
.
Example:
SELECT current_date() AS today_date,
current_timestamp() AS current_time
FROM your_table
2. Date and Time Arithmetic Functions
date_add(date, num_days)
– Addsnum_days
days to the givendate
.date_sub(date, num_days)
– Subtractsnum_days
days from the givendate
.add_months(date, num_months)
– Addsnum_months
months to the givendate
.datediff(end_date, start_date)
– Returns the difference between two dates.months_between(end_date, start_date)
– Returns the number of months between two dates.
Example:
SELECT date_add(current_date(), 5) AS five_days_from_now,
date_sub(current_date(), 10) AS ten_days_ago,
add_months(current_date(), 3) AS three_months_later,
datediff('2024-12-31', '2024-01-01') AS days_diff
FROM your_table
3. Date Formatting and Parsing Functions
date_format(date, format)
– Formats adate
according to the givenformat
string.to_date(string)
– Converts a string to a date.to_timestamp(string)
– Converts a string to a timestamp.unix_timestamp()
– Converts a date or timestamp to the number of seconds since the Unix epoch (1970-01-01).from_unixtime(unix_time, format)
– Converts Unix time to a string in the specified format.
Example:
SELECT date_format(current_timestamp(), 'yyyy-MM-dd') AS formatted_date,
to_date('2024-09-01', 'yyyy-MM-dd') AS converted_date,
unix_timestamp('2024-09-01', 'yyyy-MM-dd') AS unix_time,
from_unixtime(1693564800, 'yyyy-MM-dd') AS from_unix_time
FROM your_table
4. Extracting Components from Dates
year(date)
– Extracts the year from a date.month(date)
– Extracts the month from a date.day(date)
ordayofmonth(date)
– Extracts the day of the month from a date.hour(timestamp)
– Extracts the hour from a timestamp.minute(timestamp)
– Extracts the minute from a timestamp.second(timestamp)
– Extracts the second from a timestamp.dayofweek(date)
– Returns the day of the week (1 = Sunday, 7 = Saturday).weekofyear(date)
– Returns the week of the year for a given date.
Example:
SELECT year(current_date()) AS year,
month(current_date()) AS month,
day(current_date()) AS day,
dayofweek(current_date()) AS day_of_week,
weekofyear(current_date()) AS week_of_year
FROM your_table
5. Date Truncation Functions
trunc(date, format)
– Truncates a date to the specified unit (year, month, etc.).date_trunc(format, timestamp)
– Truncates a timestamp to the specified unit.
Example:
SELECT trunc(current_date(), 'MM') AS truncated_to_month,
date_trunc('MM', current_timestamp()) AS timestamp_truncated_to_month
FROM your_table
6.EXTRACT
Function in PySpark SQL:-
In PySpark SQL (or standard SQL), extract()
is a built-in function for extracting parts of a date or timestamp, such as extracting the year, month, day, hour, minute, etc., from a date or timestamp column.
Syntax:
EXTRACT(field FROM source)
field
: Specifies the part of the date/time to extract (e.g.,YEAR
,MONTH
,DAY
,HOUR
,MINUTE
, etc.).source
: The column containing a date or timestamp.
Example Usage of extract()
in PySpark SQL Queries:
If you have a PySpark DataFrame with a date column and you want to extract specific components (such as year or month), you can use extract()
in a SQL query.
Example 1: Extracting Year, Month, and Day from a Date:
Assume you have a PySpark DataFrame with a date
column in YYYY-MM-DD format, and you want to extract the year, month, and day.
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("ExtractExample").getOrCreate()
# Sample data
data = [("2023-09-01",), ("2024-12-31",), ("2020-01-20",)]
df = spark.createDataFrame(data, ["DateColumn"])
# Create a temporary SQL table
df.createOrReplaceTempView("date_table")
# SQL query to extract year, month, and day
result = spark.sql("""
SELECT
DateColumn,
EXTRACT(YEAR FROM DateColumn) AS Year,
EXTRACT(MONTH FROM DateColumn) AS Month,
EXTRACT(DAY FROM DateColumn) AS Day
FROM date_table
""")
result.show()
Output:
+----------+----+-----+---+
|DateColumn|Year|Month|Day|
+----------+----+-----+---+
|2023-09-01|2023| 9| 1|
|2024-12-31|2024| 12| 31|
|2020-01-20|2020| 1| 20|
+----------+----+-----+---+
Example 2: Extracting Hour and Minute from a Timestamp:
If you have a timestamp column and you want to extract specific time components such as hour and minute, you can use EXTRACT()
in a SQL query.
data = [("2023-09-01 12:30:45",), ("2024-12-31 18:45:00",), ("2020-01-20 07:15:25",)]
df = spark.createDataFrame(data, ["TimestampColumn"])
# Create a temporary SQL table
df.createOrReplaceTempView("timestamp_table")
# SQL query to extract hour and minute from the timestamp
result = spark.sql("""
SELECT
TimestampColumn,
EXTRACT(HOUR FROM TimestampColumn) AS Hour,
EXTRACT(MINUTE FROM TimestampColumn) AS Minute
FROM timestamp_table
""")
result.show()
Output:
+-------------------+----+------+
| TimestampColumn |Hour|Minute|
+-------------------+----+------+
|2023-09-01 12:30:45| 12| 30|
|2024-12-31 18:45:00| 18| 45|
|2020-01-20 07:15:25| 7| 15|
+-------------------+----+------+
Fields Supported by EXTRACT()
:
You can extract the following parts from a date or timestamp:
- YEAR: Extracts the year.
- MONTH: Extracts the month.
- DAY: Extracts the day.
- HOUR: Extracts the hour from a timestamp.
- MINUTE: Extracts the minute from a timestamp.
- SECOND: Extracts the second from a timestamp.
Using EXTRACT()
in PySpark’s SQL Queries:
In PySpark, you typically define a temporary view using createOrReplaceTempView()
to run SQL queries with the EXTRACT()
function.
Example Scenario:
If you have a dataset with timestamps and you want to run a SQL query to extract the hour and minute, you can write a PySpark SQL query like:
df.createOrReplaceTempView("log_table")
spark.sql("""
SELECT
EXTRACT(HOUR FROM timestamp_column) AS Hour,
EXTRACT(MINUTE FROM timestamp_column) AS Minute
FROM log_table
""").show()
This query will return the hour and minute extracted from the timestamp column.
- The
EXTRACT()
function in Spark SQL allows you to extract specific parts (like year, month, day, hour, minute) from date or timestamp columns. - It is used in SQL-based queries within PySpark, similar to how you’d use SQL functions for querying structured data.
So, if you saw extract()
in a PySpark SQL query, it was likely extracting parts of a date or timestamp column.
7. Other Useful Date Functions
last_day(date)
– Returns the last day of the month for a given date.next_day(date, day_of_week)
– Returns the next date after the given date for the specified day of the week.quarter(date)
– Returns the quarter of the year (1, 2, 3, or 4).
Example:
SELECT last_day(current_date()) AS last_day_of_month,
next_day(current_date(), 'Monday') AS next_monday,
quarter(current_date()) AS current_quarter
FROM your_table
Complete Example of a PySpark SQL Query:
Assume you have a table events
with a timestamp
column named event_time
. Here’s a query that selects events from the current year, formats the event date, and extracts various components from the timestamp:
SELECT event_time,
date_format(event_time, 'yyyy-MM-dd') AS formatted_date,
year(event_time) AS event_year,
month(event_time) AS event_month,
dayofmonth(event_time) AS event_day,
hour(event_time) AS event_hour,
minute(event_time) AS event_minute,
second(event_time) AS event_second,
quarter(event_time) AS event_quarter,
weekofyear(event_time) AS event_week
FROM events
WHERE year(event_time) = year(current_date())
Notes:
Integration in PySpark: You can execute SQL queries with date functions using spark.sql()
.
For example:
result = spark.sql(""" SELECT date_add(current_date(), 5) AS five_days_from_now, year(current_timestamp()) AS current_year FROM your_table """) result.show()
Date Functions for Filtering: You can use these functions in the WHERE
clause to filter data based on dates:
SELECT * FROM your_table WHERE event_time >= '2023-01-01' AND event_time <= '2023-12-31'
These built-in PySpark SQL date functions provide extensive capabilities for handling, formatting, and manipulating date and time data at scale in Spark applications.