A quick reference for date manipulation in PySpark:–

FunctionDescriptionWorks OnExample (Spark SQL)Example (DataFrame API)
to_dateConverts string to date.StringTO_DATE('2024-01-15', 'yyyy-MM-dd')to_date(col("date_str"), "yyyy-MM-dd")
to_timestampConverts string to timestamp.StringTO_TIMESTAMP('2024-01-15 12:34:56', 'yyyy-MM-dd HH:mm:ss')to_timestamp(col("timestamp_str"), "yyyy-MM-dd HH:mm:ss")
date_formatFormats date or timestamp as a string.Date, TimestampDATE_FORMAT(CURRENT_DATE, 'dd-MM-yyyy')date_format(col("date_col"), "dd-MM-yyyy")
current_dateReturns the current date.CURRENT_DATEcurrent_date()
current_timestampReturns the current timestamp.CURRENT_TIMESTAMPcurrent_timestamp()
date_addAdds days to a date.Date, TimestampDATE_ADD('2024-01-15', 10)date_add(col("date_col"), 10)
date_subSubtracts days from a date.Date, TimestampDATE_SUB('2024-01-15', 10)date_sub(col("date_col"), 10)
months_betweenReturns months between two dates or timestamps.Date, TimestampMONTHS_BETWEEN('2024-01-15', '2023-12-15')months_between(col("date1"), col("date2"))
datediffReturns difference in days between two dates/timestamps.Date, TimestampDATEDIFF('2024-01-15', '2024-01-10')datediff(col("date1"), col("date2"))
yearExtracts year from a date or timestamp.Date, TimestampYEAR('2024-01-15')year(col("date_col"))
monthExtracts month from a date or timestamp.Date, TimestampMONTH('2024-01-15')month(col("date_col"))
dayExtracts day from a date or timestamp.Date, TimestampDAY('2024-01-15')day(col("date_col"))
dayofweekReturns the day of the week (1 = Sunday, 7 = Saturday).Date, TimestampDAYOFWEEK('2024-01-15')dayofweek(col("date_col"))
dayofmonthReturns the day of the month (1-31).Date, TimestampDAYOFMONTH('2024-01-15')dayofmonth(col("date_col"))
dayofyearReturns the day of the year (1-366).Date, TimestampDAYOFYEAR('2024-01-15')dayofyear(col("date_col"))
weekofyearReturns the week number of the year.Date, TimestampWEEKOFYEAR('2024-01-15')weekofyear(col("date_col"))
quarterExtracts quarter of the year.Date, TimestampQUARTER('2024-01-15')quarter(col("date_col"))
hourExtracts hour from a timestamp.TimestampHOUR('2024-01-15 12:34:56')hour(col("timestamp_col"))
minuteExtracts minute from a timestamp.TimestampMINUTE('2024-01-15 12:34:56')minute(col("timestamp_col"))
secondExtracts second from a timestamp.TimestampSECOND('2024-01-15 12:34:56')second(col("timestamp_col"))
last_dayReturns the last day of the month for a date.Date, TimestampLAST_DAY('2024-01-15')last_day(col("date_col"))
next_dayReturns the next day of the week after a date.Date, TimestampNEXT_DAY('2024-01-15', 'Sunday')next_day(col("date_col"), "Sunday")
truncTruncates a date to the specified part (e.g., month).Date, TimestampTRUNC('2024-01-15', 'MM')trunc(col("date_col"), "MM")
add_monthsAdds months to a date.Date, TimestampADD_MONTHS('2024-01-15', 2)add_months(col("date_col"), 2)
unix_timestampConverts a string to UNIX timestamp.StringUNIX_TIMESTAMP('2024-01-15', 'yyyy-MM-dd')unix_timestamp(col("date_str"), "yyyy-MM-dd")
from_unixtimeConverts UNIX timestamp to a formatted string.Integer (UNIX timestamp)FROM_UNIXTIME(1673827200, 'yyyy-MM-dd')from_unixtime(col("unix_col"), "yyyy-MM-dd")

Now starts our Post:-


PySpark date functions along with detailed explanations and examples on Dataframes

Here’s a comprehensive list of some common PySpark date functions along with detailed explanations and examples on Dataframes(We will again discuss these basis Pyspark sql Queries):

1. current_date()

Returns the current date.

from pyspark.sql.functions import current_date

df = spark.createDataFrame([(1,)], ["dummy"])
df.select(current_date().alias("current_date")).show()

2. current_timestamp()

Returns the current timestamp.

from pyspark.sql.functions import current_timestamp

df.select(current_timestamp().alias("current_timestamp")).show()

3. date_format()

Formats a date using the specified format.

from pyspark.sql.functions import date_format

df = spark.createDataFrame([("2010-02-05",)], ["date"])
df.select(date_format("date", "MM/dd/yyyy").alias("formatted_date")).show()

4. year()month()dayofmonth()

Extracts the year, month, and day from a date.

from pyspark.sql.functions import year, month, dayofmonth

df.select(year("date").alias("year"),
month("date").alias("month"),
dayofmonth("date").alias("day")).show()

5. date_add()date_sub()

Adds or subtracts a specified number of days to/from a date.

from pyspark.sql.functions import date_add, date_sub
df.select(date_add("date", 10).alias("date_add"),date_sub("date", 10).alias("date_sub")).show()

6. datediff()

Returns the difference in days between two dates.

from pyspark.sql.functions import datediff

df2 = spark.createDataFrame([("2010-02-15",)], ["date2"])
df.join(df2).select(datediff("date2", "date").alias("datediff")).show()

7. add_months()

Adds a specified number of months to a date.

from pyspark.sql.functions import add_months

df.select(add_months("date", 1).alias("add_months")).show()

8. months_between()

Returns the number of months between two dates.

from pyspark.sql.functions import months_between

df.join(df2).select(months_between("date2", "date").alias("months_between")).show()

9. next_day()

Returns the first date which is the specified day of the week after the given date.

from pyspark.sql.functions import next_day

df.select(next_day("date", "Sunday").alias("next_day")).show()

10. last_day()

Returns the last day of the month which the given date belongs to.

from pyspark.sql.functions import last_day

df.select(last_day("date").alias("last_day")).show()

11. trunc()

Truncates a date to the specified unit (‘year’, ‘month’).

from pyspark.sql.functions import trunc

df.select(trunc("date", "MM").alias("trunc_month"),
trunc("date", "YY").alias("trunc_year")).show()

12. date_trunc()

Truncates a timestamp to the specified unit (‘year’, ‘month’, ‘day’, ‘hour’, ‘minute’).

from pyspark.sql.functions import date_trunc

df = spark.createDataFrame([("2010-02-05 12:34:56",)], ["timestamp"])
df.select(date_trunc("hour", "timestamp").alias("date_trunc_hour")).show()

13. from_unixtime()

Converts the number of seconds from the Unix epoch to a string representing the timestamp.

from pyspark.sql.functions import from_unixtime

df = spark.createDataFrame([(1234567890,)], ["unix_time"])
df.select(from_unixtime("unix_time").alias("timestamp")).show()

14. unix_timestamp()

Converts a string timestamp to the number of seconds from the Unix epoch.

from pyspark.sql.functions import unix_timestamp

df = spark.createDataFrame([("2010-02-05 12:34:56",)], ["timestamp"])
df.select(unix_timestamp("timestamp").alias("unix_timestamp")).show()

15. to_date()

Converts a string to a date.

from pyspark.sql.functions import to_date

df = spark.createDataFrame([("2010-02-05",)], ["date_str"])
df.select(to_date("date_str").alias("date")).show()

16. to_timestamp()

Converts a string to a timestamp.

from pyspark.sql.functions import to_timestamp

df = spark.createDataFrame([("2010-02-05 12:34:56",)], ["timestamp_str"])
df.select(to_timestamp("timestamp_str").alias("timestamp")).show()

17. weekofyear()

Returns the week of the year for a date.

from pyspark.sql.functions import weekofyear

df = spark.createDataFrame([("2010-02-05",)], ["date"])
df.select(weekofyear("date").alias("weekofyear")).show()

18. quarter()

Returns the quarter of the year for a date.

from pyspark.sql.functions import quarter

df.select(quarter("date").alias("quarter")).show()

19. dayofweek()

Returns the day of the week for a date.

from pyspark.sql.functions import dayofweek

df.select(dayofweek("date").alias("dayofweek")).show()

Example Project: Combining Multiple Date Functions

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, date_format, year, month, dayofmonth, date_add, date_sub, datediff, add_months, months_between, next_day, last_day, trunc, date_trunc, from_unixtime, unix_timestamp, to_date, to_timestamp, weekofyear, quarter, dayofweek

# Initialize Spark session
spark = SparkSession.builder.appName("date_functions_example").getOrCreate()

# Create a DataFrame with a sample date
data = [("2010-02-05", "2010-02-15 12:34:56", 1234567890)]
columns = ["date", "timestamp", "unix_time"]
df = spark.createDataFrame(data, columns)

# Applying various date functions
result = df.select(
col("date"),
col("timestamp"),
col("unix_time"),
date_format("date", "MM/dd/yyyy").alias("formatted_date"),
year("date").alias("year"),
month("date").alias("month"),
dayofmonth("date").alias("day"),
date_add("date", 10).alias("date_add"),
date_sub("date", 10).alias("date_sub"),
datediff("timestamp", "date").alias("datediff"),
add_months("date", 1).alias("add_months"),
months_between("timestamp", "date").alias("months_between"),
next_day("date", "Sunday").alias("next_day"),
last_day("date").alias("last_day"),
trunc("date", "MM").alias("trunc_month"),
trunc("date", "YY").alias("trunc_year"),
date_trunc("hour", "timestamp").alias("date_trunc_hour"),
from_unixtime("unix_time").alias("from_unixtime"),
unix_timestamp("timestamp").alias("unix_timestamp"),
to_date("timestamp").alias("to_date"),
to_timestamp("timestamp").alias("to_timestamp"),
weekofyear("date").alias("weekofyear"),
quarter("date").alias("quarter"),
dayofweek("date").alias("dayofweek")
)

# Show the results
result.show(truncate=False)
FunctionDescriptionExample
current_date()Returns the current date as a DateType objectcurrent_date()
current_timestamp()Returns the current timestamp (with microseconds) as a TimestampType objectcurrent_timestamp()
date_add(date, days)Adds a specified number of days to a datedate_add('2023-07-05', 10)
datediff(end_date, start_date)Calculates the number of days between two datesdatediff('2024-01-01', '2023-12-31')
months_between(date1, date2)Calculates the number of months between two datesmonths_between('2024-02-01', '2023-07-05')
to_date(string)Converts a string to a DateType object (format-dependent)to_date('2023-07-04', 'yyyy-MM-dd')
to_timestamp(string)Converts a string to a TimestampType object (format-dependent)to_timestamp('2023-07-04 10:20:30', 'yyyy-MM-dd HH:mm:ss')
trunc(date, format)Truncates a date to a specified unit (year, month, day)trunc('2023-07-05', 'year')
year(date)Extracts the year from a dateyear('2023-07-05')
quarter(date)Extracts the quarter (1-4) from a datequarter('2023-07-05')
month(date)Extracts the month (1-12) from a datemonth('2023-07-05')
dayofmonth(date)Extracts the day of the month (1-31) from a datedayofmonth('2023-07-05')
dayofweek(date)Extracts the day of the week (1-7, Monday=1) from a datedayofweek('2023-07-05')
dayofyear(date)Extracts the day of the year (1-365) from a datedayofyear('2023-07-05')
last_day(date)Returns the last day of the month for the given datelast_day('2023-07-05')
unix_timestamp()Returns the number of seconds since epoch (1970-01-01 UTC) for a timestampunix_timestamp()
from_unixtime(timestamp)Converts a unix timestamp to a TimestampType objectfrom_unixtime(1656816000)
date_format(date, format)Formats a date or timestamp according to a specified formatdate_format('2023-07-05', 'dd-MM-yyyy')

PySpark date functions along with detailed explanations and examples on Spark SQL

In PySpark, you can use various built-in date functions directly within SQL queries to manipulate and extract data from timestampdate, and string columns. PySpark’s SQL API provides a wide range of date functions similar to SQL, which allows you to perform operations like formatting, adding/subtracting time intervals, and extracting specific components from dates.

Here are some of the most commonly used PySpark date functions in SQL queries:

1. Current Date and Time Functions

  • current_date() – Returns the current date.
  • current_timestamp() – Returns the current timestamp (date and time).
  • now() – Alias for current_timestamp().

Example:

SELECT current_date() AS today_date,
       current_timestamp() AS current_time
FROM your_table

2. Date and Time Arithmetic Functions

  • date_add(date, num_days) – Adds num_days days to the given date.
  • date_sub(date, num_days) – Subtracts num_days days from the given date.
  • add_months(date, num_months) – Adds num_months months to the given date.
  • datediff(end_date, start_date) – Returns the difference between two dates.
  • months_between(end_date, start_date) – Returns the number of months between two dates.

Example:

SELECT date_add(current_date(), 5) AS five_days_from_now,
       date_sub(current_date(), 10) AS ten_days_ago,
       add_months(current_date(), 3) AS three_months_later,
       datediff('2024-12-31', '2024-01-01') AS days_diff
FROM your_table

3. Date Formatting and Parsing Functions

  • date_format(date, format) – Formats a date according to the given format string.
  • to_date(string) – Converts a string to a date.
  • to_timestamp(string) – Converts a string to a timestamp.
  • unix_timestamp() – Converts a date or timestamp to the number of seconds since the Unix epoch (1970-01-01).
  • from_unixtime(unix_time, format) – Converts Unix time to a string in the specified format.

Example:

SELECT date_format(current_timestamp(), 'yyyy-MM-dd') AS formatted_date,
       to_date('2024-09-01', 'yyyy-MM-dd') AS converted_date,
       unix_timestamp('2024-09-01', 'yyyy-MM-dd') AS unix_time,
       from_unixtime(1693564800, 'yyyy-MM-dd') AS from_unix_time
FROM your_table

4. Extracting Components from Dates

  • year(date) – Extracts the year from a date.
  • month(date) – Extracts the month from a date.
  • day(date) or dayofmonth(date) – Extracts the day of the month from a date.
  • hour(timestamp) – Extracts the hour from a timestamp.
  • minute(timestamp) – Extracts the minute from a timestamp.
  • second(timestamp) – Extracts the second from a timestamp.
  • dayofweek(date) – Returns the day of the week (1 = Sunday, 7 = Saturday).
  • weekofyear(date) – Returns the week of the year for a given date.

Example:

SELECT year(current_date()) AS year,
       month(current_date()) AS month,
       day(current_date()) AS day,
       dayofweek(current_date()) AS day_of_week,
       weekofyear(current_date()) AS week_of_year
FROM your_table

5. Date Truncation Functions

  • trunc(date, format) – Truncates a date to the specified unit (year, month, etc.).
  • date_trunc(format, timestamp) – Truncates a timestamp to the specified unit.

Example:

SELECT trunc(current_date(), 'MM') AS truncated_to_month,
       date_trunc('MM', current_timestamp()) AS timestamp_truncated_to_month
FROM your_table

6.EXTRACT Function in PySpark SQL:-

In PySpark SQL (or standard SQL), extract() is a built-in function for extracting parts of a date or timestamp, such as extracting the year, month, day, hour, minute, etc., from a date or timestamp column.

Syntax:

EXTRACT(field FROM source)
  • field: Specifies the part of the date/time to extract (e.g., YEARMONTHDAYHOURMINUTE, etc.).
  • source: The column containing a date or timestamp.

Example Usage of extract() in PySpark SQL Queries:

If you have a PySpark DataFrame with a date column and you want to extract specific components (such as year or month), you can use extract() in a SQL query.

Example 1: Extracting Year, Month, and Day from a Date:

Assume you have a PySpark DataFrame with a date column in YYYY-MM-DD format, and you want to extract the yearmonth, and day.


from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("ExtractExample").getOrCreate()
# Sample data
data = [("2023-09-01",), ("2024-12-31",), ("2020-01-20",)]
df = spark.createDataFrame(data, ["DateColumn"])
# Create a temporary SQL table
df.createOrReplaceTempView("date_table")
# SQL query to extract year, month, and day
result = spark.sql("""
    SELECT
        DateColumn,
        EXTRACT(YEAR FROM DateColumn) AS Year,
        EXTRACT(MONTH FROM DateColumn) AS Month,
        EXTRACT(DAY FROM DateColumn) AS Day
    FROM date_table
""")
result.show()

Output:

+----------+----+-----+---+
|DateColumn|Year|Month|Day|
+----------+----+-----+---+
|2023-09-01|2023| 9| 1|
|2024-12-31|2024| 12| 31|
|2020-01-20|2020| 1| 20|
+----------+----+-----+---+

Example 2: Extracting Hour and Minute from a Timestamp:

If you have a timestamp column and you want to extract specific time components such as hour and minute, you can use EXTRACT() in a SQL query.

data = [("2023-09-01 12:30:45",), ("2024-12-31 18:45:00",), ("2020-01-20 07:15:25",)]
df = spark.createDataFrame(data, ["TimestampColumn"])
# Create a temporary SQL table
df.createOrReplaceTempView("timestamp_table")
# SQL query to extract hour and minute from the timestamp
result = spark.sql("""
    SELECT
        TimestampColumn,
        EXTRACT(HOUR FROM TimestampColumn) AS Hour,
        EXTRACT(MINUTE FROM TimestampColumn) AS Minute
    FROM timestamp_table
""")
result.show()

Output:

+-------------------+----+------+
| TimestampColumn |Hour|Minute|
+-------------------+----+------+
|2023-09-01 12:30:45| 12| 30|
|2024-12-31 18:45:00| 18| 45|
|2020-01-20 07:15:25| 7| 15|
+-------------------+----+------+

Fields Supported by EXTRACT():

You can extract the following parts from a date or timestamp:

  • YEAR: Extracts the year.
  • MONTH: Extracts the month.
  • DAY: Extracts the day.
  • HOUR: Extracts the hour from a timestamp.
  • MINUTE: Extracts the minute from a timestamp.
  • SECOND: Extracts the second from a timestamp.

Using EXTRACT() in PySpark’s SQL Queries:

In PySpark, you typically define a temporary view using createOrReplaceTempView() to run SQL queries with the EXTRACT() function.

Example Scenario:

If you have a dataset with timestamps and you want to run a SQL query to extract the hour and minute, you can write a PySpark SQL query like:

df.createOrReplaceTempView("log_table")
spark.sql("""
    SELECT
        EXTRACT(HOUR FROM timestamp_column) AS Hour,
        EXTRACT(MINUTE FROM timestamp_column) AS Minute
    FROM log_table
""").show()

This query will return the hour and minute extracted from the timestamp column.

  • The EXTRACT() function in Spark SQL allows you to extract specific parts (like year, month, day, hour, minute) from date or timestamp columns.
  • It is used in SQL-based queries within PySpark, similar to how you’d use SQL functions for querying structured data.

So, if you saw extract() in a PySpark SQL query, it was likely extracting parts of a date or timestamp column.

7. Other Useful Date Functions

  • last_day(date) – Returns the last day of the month for a given date.
  • next_day(date, day_of_week) – Returns the next date after the given date for the specified day of the week.
  • quarter(date) – Returns the quarter of the year (1, 2, 3, or 4).

Example:

SELECT last_day(current_date()) AS last_day_of_month,
       next_day(current_date(), 'Monday') AS next_monday,
       quarter(current_date()) AS current_quarter
FROM your_table

Complete Example of a PySpark SQL Query:

Assume you have a table events with a timestamp column named event_time. Here’s a query that selects events from the current year, formats the event date, and extracts various components from the timestamp:

SELECT event_time,
       date_format(event_time, 'yyyy-MM-dd') AS formatted_date,
       year(event_time) AS event_year,
       month(event_time) AS event_month,
       dayofmonth(event_time) AS event_day,
       hour(event_time) AS event_hour,
       minute(event_time) AS event_minute,
       second(event_time) AS event_second,
       quarter(event_time) AS event_quarter,
       weekofyear(event_time) AS event_week
FROM events
WHERE year(event_time) = year(current_date())

Notes:

Integration in PySpark: You can execute SQL queries with date functions using spark.sql().

For example:

result = spark.sql(""" SELECT date_add(current_date(), 5) AS five_days_from_now, year(current_timestamp()) AS current_year FROM your_table """) result.show()

Date Functions for Filtering: You can use these functions in the WHERE clause to filter data based on dates:

SELECT * FROM your_table WHERE event_time >= '2023-01-01' AND event_time <= '2023-12-31'

These built-in PySpark SQL date functions provide extensive capabilities for handling, formatting, and manipulating date and time data at scale in Spark applications.


Pages: 1 2 3 4 5

Posted in ,

Leave a Reply

Your email address will not be published. Required fields are marked *