PySpark Date & Timestamp Operations - Convert, Format, Calculate

Master PySpark date and timestamp operations. Convert strings to dates/timestamps, extract date parts, add/subtract days/months, and calculate date differences.

PySpark Date & Timestamp Operations

PySpark Date & Timestamp Operations Explained

This section demonstrates essential PySpark SQL functions for handling date and timestamp data. Efficiently manipulate and analyze temporal data within your Spark DataFrames.

1. Adding Current Date

Add a column containing the current date to your DataFrame.

# Add a column with the current date
df = df.withColumn('current_date', F.current_date())

2. Converting Strings to Dates and Timestamps

Convert string columns to proper date or timestamp types, specifying the exact format.

# Convert a string of known format to a date (excludes time information)
df = df.withColumn('date_of_birth', F.to_date('date_of_birth', 'yyyy-MM-dd'))

# Convert a string of known format to a timestamp (includes time information)
df = df.withColumn('time_of_birth', F.to_timestamp('time_of_birth', 'yyyy-MM-dd HH:mm:ss'))

3. Extracting Date Components

Easily extract specific parts of a date or timestamp column.

# Get year from date:       F.year(col)
# Get month from date:      F.month(col)
# Get day from date:        F.dayofmonth(col)
# Get hour from date:       F.hour(col)
# Get minute from date:     F.minute(col)
# Get second from date:     F.second(col)
df = df.filter(F.year('date_of_birth') == F.lit('2017'))

4. Adding and Subtracting Days and Months

Perform arithmetic operations on dates to find future or past dates.

# Add & subtract days
df = df.withColumn('three_days_after', F.date_add('date_of_birth', 3))
df = df.withColumn('three_days_before', F.date_sub('date_of_birth', 3))

# Add & Subtract months
df = df.withColumn('next_month', F.add_month('date_of_birth', 1))

5. Calculating Date Differences

Determine the number of days or months between two date columns.

# Get number of days between two dates
df = df.withColumn('days_between', F.datediff('start', 'end'))

# Get number of months between two dates
df = df.withColumn('months_between', F.months_between('start', 'end'))

6. Filtering by Date Range

Filter your DataFrame to include only records within a specified date range.

# Keep only rows where date_of_birth is between 2017-05-10 and 2018-07-21
df = df.filter(
    (F.col('date_of_birth') >= F.lit('2017-05-10')) &
    (F.col('date_of_birth') <= F.lit('2018-07-21'))
)

These PySpark functions are crucial for any data analysis involving time-series data, enabling precise control over date and timestamp manipulations.

For more details on date and timestamp functions in Spark SQL, refer to the official PySpark SQL documentation.