Here’s a comprehensive list of some common PySpark date functions along with detailed explanations and examples on Dataframes(We will again discuss thess basis Pyspark sql Queries):

1. current_date()

Returns the current date.

2. current_timestamp()

Returns the current timestamp.

3. date_format()

Formats a date using the specified format.

4. year(), month(), dayofmonth()

Extracts the year, month, and day from a date.

5. date_add(), date_sub()

Adds or subtracts a specified number of days to/from a date.

6. datediff()

Returns the difference in days between two dates.

7. add_months()

Adds a specified number of months to a date.

8. months_between()

Returns the number of months between two dates.

9. next_day()

Returns the first date which is the specified day of the week after the given date.

10. last_day()

Returns the last day of the month which the given date belongs to.

11. trunc()

Truncates a date to the specified unit (‘year’, ‘month’).

12. date_trunc()

Truncates a timestamp to the specified unit (‘year’, ‘month’, ‘day’, ‘hour’, ‘minute’).

13. from_unixtime()

Converts the number of seconds from the Unix epoch to a string representing the timestamp.

14. unix_timestamp()

Converts a string timestamp to the number of seconds from the Unix epoch.

15. to_date()

Converts a string to a date.

16. to_timestamp()

Converts a string to a timestamp.

17. weekofyear()

Returns the week of the year for a date.

18. quarter()

Returns the quarter of the year for a date.

19. dayofweek()

Returns the day of the week for a date.

Example Project: Combining Multiple Date Functions


In PySpark, you can use various built-in date functions directly within SQL queries to manipulate and extract data from timestamp, date, and string columns. PySpark’s SQL API provides a wide range of date functions similar to SQL, which allows you to perform operations like formatting, adding/subtracting time intervals, and extracting specific components from dates.

Here are some of the most commonly used PySpark date functions in SQL queries:

1. Current Date and Time Functions

  • current_date() – Returns the current date.
  • current_timestamp() – Returns the current timestamp (date and time).
  • now() – Alias for current_timestamp().

Example:

SELECT current_date() AS today_date,
       current_timestamp() AS current_time
FROM your_table

2. Date and Time Arithmetic Functions

  • date_add(date, num_days) – Adds num_days days to the given date.
  • date_sub(date, num_days) – Subtracts num_days days from the given date.
  • add_months(date, num_months) – Adds num_months months to the given date.
  • datediff(end_date, start_date) – Returns the difference between two dates.
  • months_between(end_date, start_date) – Returns the number of months between two dates.

Example:

SELECT date_add(current_date(), 5) AS five_days_from_now,
       date_sub(current_date(), 10) AS ten_days_ago,
       add_months(current_date(), 3) AS three_months_later,
       datediff('2024-12-31', '2024-01-01') AS days_diff
FROM your_table

3. Date Formatting and Parsing Functions

  • date_format(date, format) – Formats a date according to the given format string.
  • to_date(string) – Converts a string to a date.
  • to_timestamp(string) – Converts a string to a timestamp.
  • unix_timestamp() – Converts a date or timestamp to the number of seconds since the Unix epoch (1970-01-01).
  • from_unixtime(unix_time, format) – Converts Unix time to a string in the specified format.

Example:

SELECT date_format(current_timestamp(), 'yyyy-MM-dd') AS formatted_date,
       to_date('2024-09-01', 'yyyy-MM-dd') AS converted_date,
       unix_timestamp('2024-09-01', 'yyyy-MM-dd') AS unix_time,
       from_unixtime(1693564800, 'yyyy-MM-dd') AS from_unix_time
FROM your_table

4. Extracting Components from Dates

  • year(date) – Extracts the year from a date.
  • month(date) – Extracts the month from a date.
  • day(date) or dayofmonth(date) – Extracts the day of the month from a date.
  • hour(timestamp) – Extracts the hour from a timestamp.
  • minute(timestamp) – Extracts the minute from a timestamp.
  • second(timestamp) – Extracts the second from a timestamp.
  • dayofweek(date) – Returns the day of the week (1 = Sunday, 7 = Saturday).
  • weekofyear(date) – Returns the week of the year for a given date.

Example:

SELECT year(current_date()) AS year,
       month(current_date()) AS month,
       day(current_date()) AS day,
       dayofweek(current_date()) AS day_of_week,
       weekofyear(current_date()) AS week_of_year
FROM your_table

5. Date Truncation Functions

  • trunc(date, format) – Truncates a date to the specified unit (year, month, etc.).
  • date_trunc(format, timestamp) – Truncates a timestamp to the specified unit.

Example:

SELECT trunc(current_date(), 'MM') AS truncated_to_month, 
       date_trunc('MM', current_timestamp()) AS timestamp_truncated_to_month
FROM your_table

6.EXTRACT Function in PySpark SQL:-

In PySpark SQL (or standard SQL), extract() is a built-in function for extracting parts of a date or timestamp, such as extracting the year, month, day, hour, minute, etc., from a date or timestamp column.

Syntax:

EXTRACT(field FROM source)
  • field: Specifies the part of the date/time to extract (e.g., YEAR, MONTH, DAY, HOUR, MINUTE, etc.).
  • source: The column containing a date or timestamp.

Example Usage of extract() in PySpark SQL Queries:

If you have a PySpark DataFrame with a date column and you want to extract specific components (such as year or month), you can use extract() in a SQL query.

Example 1: Extracting Year, Month, and Day from a Date:

Assume you have a PySpark DataFrame with a date column in YYYY-MM-DD format, and you want to extract the year, month, and day.


from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("ExtractExample").getOrCreate()

# Sample data
data = [("2023-09-01",), ("2024-12-31",), ("2020-01-20",)]
df = spark.createDataFrame(data, ["DateColumn"])

# Create a temporary SQL table
df.createOrReplaceTempView("date_table")

# SQL query to extract year, month, and day
result = spark.sql("""
    SELECT 
        DateColumn,
        EXTRACT(YEAR FROM DateColumn) AS Year,
        EXTRACT(MONTH FROM DateColumn) AS Month,
        EXTRACT(DAY FROM DateColumn) AS Day
    FROM date_table
""")

result.show()

Output:

+----------+----+-----+---+
|DateColumn|Year|Month|Day|
+----------+----+-----+---+
|2023-09-01|2023| 9| 1|
|2024-12-31|2024| 12| 31|
|2020-01-20|2020| 1| 20|
+----------+----+-----+---+

Example 2: Extracting Hour and Minute from a Timestamp:

If you have a timestamp column and you want to extract specific time components such as hour and minute, you can use EXTRACT() in a SQL query.

data = [("2023-09-01 12:30:45",), ("2024-12-31 18:45:00",), ("2020-01-20 07:15:25",)]
df = spark.createDataFrame(data, ["TimestampColumn"])

# Create a temporary SQL table
df.createOrReplaceTempView("timestamp_table")

# SQL query to extract hour and minute from the timestamp
result = spark.sql("""
    SELECT 
        TimestampColumn,
        EXTRACT(HOUR FROM TimestampColumn) AS Hour,
        EXTRACT(MINUTE FROM TimestampColumn) AS Minute
    FROM timestamp_table
""")

result.show()

Output:

+-------------------+----+------+
| TimestampColumn |Hour|Minute|
+-------------------+----+------+
|2023-09-01 12:30:45| 12| 30|
|2024-12-31 18:45:00| 18| 45|
|2020-01-20 07:15:25| 7| 15|
+-------------------+----+------+

Fields Supported by EXTRACT():

You can extract the following parts from a date or timestamp:

  • YEAR: Extracts the year.
  • MONTH: Extracts the month.
  • DAY: Extracts the day.
  • HOUR: Extracts the hour from a timestamp.
  • MINUTE: Extracts the minute from a timestamp.
  • SECOND: Extracts the second from a timestamp.

Using EXTRACT() in PySpark’s SQL Queries:

In PySpark, you typically define a temporary view using createOrReplaceTempView() to run SQL queries with the EXTRACT() function.

Example Scenario:

If you have a dataset with timestamps and you want to run a SQL query to extract the hour and minute, you can write a PySpark SQL query like:

df.createOrReplaceTempView("log_table")

spark.sql("""
    SELECT 
        EXTRACT(HOUR FROM timestamp_column) AS Hour,
        EXTRACT(MINUTE FROM timestamp_column) AS Minute
    FROM log_table
""").show()

This query will return the hour and minute extracted from the timestamp column.

  • The EXTRACT() function in Spark SQL allows you to extract specific parts (like year, month, day, hour, minute) from date or timestamp columns.
  • It is used in SQL-based queries within PySpark, similar to how you’d use SQL functions for querying structured data.

So, if you saw extract() in a PySpark SQL query, it was likely extracting parts of a date or timestamp column.

7. Other Useful Date Functions

  • last_day(date) – Returns the last day of the month for a given date.
  • next_day(date, day_of_week) – Returns the next date after the given date for the specified day of the week.
  • quarter(date) – Returns the quarter of the year (1, 2, 3, or 4).

Example:

SELECT last_day(current_date()) AS last_day_of_month,
       next_day(current_date(), 'Monday') AS next_monday,
       quarter(current_date()) AS current_quarter
FROM your_table

Complete Example of a PySpark SQL Query:

Assume you have a table events with a timestamp column named event_time. Here’s a query that selects events from the current year, formats the event date, and extracts various components from the timestamp:

SELECT event_time,
       date_format(event_time, 'yyyy-MM-dd') AS formatted_date,
       year(event_time) AS event_year,
       month(event_time) AS event_month,
       dayofmonth(event_time) AS event_day,
       hour(event_time) AS event_hour,
       minute(event_time) AS event_minute,
       second(event_time) AS event_second,
       quarter(event_time) AS event_quarter,
       weekofyear(event_time) AS event_week
FROM events
WHERE year(event_time) = year(current_date())

Notes:

Integration in PySpark: You can execute SQL queries with date functions using spark.sql().

For example: result = spark.sql(""" SELECT date_add(current_date(), 5) AS five_days_from_now, year(current_timestamp()) AS current_year FROM your_table """) result.show()

Date Functions for Filtering: You can use these functions in the WHERE clause to filter data based on dates:

SELECT * FROM your_table WHERE event_time >= '2023-01-01' AND event_time <= '2023-12-31'

These built-in PySpark SQL date functions provide extensive capabilities for handling, formatting, and manipulating date and time data at scale in Spark applications.

To generate a dynamic list of the next and previous 12 months in yyyyMM format

you can generate a dynamic list of the next and previous 12 months in yyyyMM format using PySpark SQL functions like date_format() and date_add(). PySpark provides built-in functions that make it easier to manipulate dates directly in SQL queries without needing to rely on Python’s datetime module or manually handling date calculations.

Here’s how you can do it purely using PySpark SQL functions:

Example Code

from pyspark.sql import SparkSession
from pyspark.sql.functions import current_date, expr

# Initialize Spark session
spark = SparkSession.builder \
    .appName("Generate YearMonth List using SQL with .format()") \
    .getOrCreate()

# Create a DataFrame with a single row and the current date
df = spark.createDataFrame([(1,)], ["id"]).withColumn("current_date", current_date())

# Generate SQL expressions for the next 12 months and the previous 12 months using .format()
# We are using a range from -12 to +12 (last 12 months, current month, and next 12 months)
exprs = []
for i in range(-12, 13):
    # Using .format() to create dynamic SQL expressions
    exprs.append("date_format(date_add(current_date, {}), 'yyyyMM') as month_{}".format(i * 30, i))

# Select these dynamically generated columns in the PySpark DataFrame
months_df = df.selectExpr(*exprs)

# Show the DataFrame with all the generated months
months_df.show(truncate=False)

Explanation:

  1. current_date(): Generates the current date. This is a PySpark function that automatically gets the current system date.
  2. date_add(current_date, {i * 30}): Adds or subtracts 30 days (approximating 1 month) to the current date. We use a range from -12 to 12 to get the past and future months.
  3. date_format(): Converts the resulting date to the yyyyMM format.
  4. selectExpr(): Dynamically selects columns using SQL expressions for each month in the range from -12 to +12, creating column names like month_-12, month_0, month_12.

Output Example:

+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+
|month_-12|month_-11|month_-10|month_-9 |month_-8 |month_-7 |month_-6 |month_-5 |month_-4 |month_-3 |month_-2 |month_-1 |month_0 |month_1 |month_2 |month_3 |month_4 |month_5 |month_6 |month_7 |month_8 |month_9 |month_10 |month_11 |month_12 |
+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+
|202309 |202310 |202311 |202312 |202401 |202402 |202403 |202404 |202405 |202406 |202407 |202408 |202409 |202410 |202411 |202412 |202501 |202502 |202503 |202504 |202505 |202506 |202507 |202508 |202509 |
+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+

Explanation of Output:

  • month_-12 to month_12: These columns represent the past 12 months, the current month, and the next 12 months in yyyyMM format.

How to Access Individual Months:

You can select specific months or access them programmatically by collecting the results:

# Collect results into a Python list
months_list = months_df.collect()[0]

# Example: Accessing the current month and next month
current_month = months_list["month_0"]
next_month = months_list["month_1"]

print("Current Month:", current_month)
print("Next Month:", next_month)

Storing Results in Variables or DataFrame:

If you want to store the results as separate variables or use them for filtering, you can extract the values from the DataFrame as shown above and use them as needed.

Filtering Example:

If you want to filter your data using these month values, you can convert the months into a filter condition:

# Example: Filter another DataFrame by one of the generated months
data_df.filter(data_df["year_month_column"] == months_list["month_0"]).show()

Advantages of Using PySpark SQL Functions:

  • Efficiency: PySpark’s built-in functions like date_add() and date_format() are optimized and can handle large datasets efficiently in a distributed environment.
  • Compatibility: This method ensures compatibility with older and newer versions of PySpark since SQL expressions and functions like date_format() have been available for a long time.

Above Task can also be done by combining python Date Functionality.

from datetime import datetime, timedelta
from pyspark.sql import SparkSession
from pyspark.sql import Row

# Initialize Spark session
spark = SparkSession.builder \
    .appName("Generate YearMonth List") \
    .getOrCreate()

# Get the current date
current_date = datetime.now()

# Function to generate year-month (yyyyMM) format for a given date
def generate_year_month(date):
    return date.strftime('%Y%m')

# Create a list to store the year-month values
months_list = []

# Generate the next 12 months (including the current month)
for i in range(0, 13):
    future_date = current_date + timedelta(days=30 * i)  # Approximate to 30 days per month
    months_list.append(generate_year_month(future_date))

# Generate the last 12 months
for i in range(1, 13):
    past_date = current_date - timedelta(days=30 * i)  # Approximate to 30 days per month
    months_list.append(generate_year_month(past_date))

# Convert the list into a DataFrame for further operations in PySpark
months_rdd = spark.sparkContext.parallelize(months_list)
months_df = spark.createDataFrame(months_rdd.map(lambda x: Row(month=x)))

# Show the DataFrame
months_df.show()

# Example of accessing specific months in the list
print("Next month: ", months_list[1])
print("Previous month: ", months_list[-1])

Python date functionality vs Pyspark date functionality

Python and PySpark both provide extensive date and time manipulation functionalities, but they serve different use cases and are part of distinct ecosystems. Here’s a comparison of Python date functionality (using the standard datetime module) and PySpark date functionality (using PySpark SQL functions like date_add, date_format, etc.).

1. Python Date Functionality

Python’s standard datetime module provides a comprehensive suite of tools for handling dates and times. It is widely used in standalone Python scripts, applications, and non-distributed environments.

Key Functions and Methods (Python’s datetime):

from datetime import datetime, timedelta, date

# Get current date and time
current_datetime = datetime.now()

# Get current date
current_date = date.today()

# Add or subtract days
future_date = current_date + timedelta(days=30)
past_date = current_date - timedelta(days=30)

# Format date to a string (custom format)
formatted_date = current_date.strftime('%Y%m%d')

# Parse a string to a date
parsed_date = datetime.strptime('2024-09-10', '%Y-%m-%d')

# Difference between two dates
date_diff = future_date - past_date

Key Features:

  • datetime.now(): Gets the current date and time.
  • timedelta(): Adds or subtracts a specific time period (days, weeks, seconds, etc.) to/from a datetime object.
  • strftime(): Formats a datetime object into a string (customizable formatting).
  • strptime(): Converts a string into a datetime object based on a specified format.
  • Arithmetic: You can perform arithmetic between two datetime objects (e.g., subtracting dates).

Use Cases:

  • Suitable for local or single-machine date/time manipulations.
  • Great for date/time parsing and formatting when working with smaller datasets.
  • Commonly used for non-distributed data processing.

2. PySpark Date Functionality

PySpark provides SQL-based date/time manipulation functions, optimized for distributed processing across large datasets. These are essential for working with big data environments and are used inside DataFrame queries.

Key PySpark Date Functions:

from pyspark.sql.functions import current_date, date_add, date_sub, date_format, to_date, col

# Get current date
df = spark.createDataFrame([(1,)], ["id"]).withColumn("current_date", current_date())

# Add or subtract days
df = df.withColumn("future_date", date_add(col("current_date"), 30))
df = df.withColumn("past_date", date_sub(col("current_date"), 30))

# Format date to a string (yyyyMM format)
df = df.withColumn("formatted_date", date_format(col("current_date"), 'yyyyMM'))

# Parse a string to date (convert '2024-09-10' to a date type)
df = df.withColumn("parsed_date", to_date(lit("2024-09-10"), 'yyyy-MM-dd'))

# Show results
df.show()

Key Features:

  • current_date(): Returns the current date (no time part).
  • date_add() and date_sub(): Adds or subtracts days from a date column.
  • date_format(): Formats a date column into a string (customizable like yyyyMM or yyyyMMdd).
  • to_date(): Converts a string into a date object within a DataFrame.
  • Date Arithmetic: You can perform arithmetic with dates directly within PySpark DataFrames.

Additional PySpark Date Functions:

  • months_between(): Calculates the difference between two dates in months.
  • year(), month(), dayofmonth(): Extract year, month, or day from a date column.
  • datediff(): Computes the difference between two dates in days.

Use Cases:

  • Suitable for distributed data processing (i.e., processing large datasets using clusters).
  • Can handle complex date manipulations directly within SQL-like DataFrame queries.
  • Ideal for big data workflows where data is stored and processed in distributed databases like Hive, HDFS, or cloud environments.

Key Differences Between Python and PySpark Date Functionalities:

AspectPython (datetime)PySpark (pyspark.sql.functions)
ScopeLocal machine; small-scale operationsDistributed computing; large datasets
IntegrationWorks well with standard Python scripts and librariesWorks seamlessly within DataFrames and Spark SQL
PerformanceEfficient for small datasets; single-threadedOptimized for big data; multi-threaded and distributed
Common Functionsdatetime.now(), timedelta(), strftime(), strptime()current_date(), date_add(), date_format(), to_date()
Date ArithmeticDirect arithmetic with datetime objectsDate functions within DataFrame operations
Output FormatsCustomizable formats via strftime()Customizable formats via date_format()
Use CaseLocal Python applications or small-scale jobsBig data applications, ETL jobs, large datasets
Date Differencestimedelta, manual calculations for months, yearsFunctions like months_between(), datediff()
UsagePython lists, dicts, DataFrames (Pandas)PySpark DataFrames, SQL-like queries

Summary:

  • Python datetime: Best for small-scale, single-machine data processing and parsing tasks where date formatting or arithmetic is required.
  • PySpark pyspark.sql.functions: Ideal for large-scale, distributed data processing environments where dates need to be handled inside DataFrames in a scalable way.

If your use case involves big data or distributed data processing, PySpark’s date functions are more suited to the task. For local, lightweight date manipulations, Python’s datetime module is more appropriate.


Discover more from AI HintsToday

Subscribe to get the latest posts sent to your email.

Discover more from AI HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading