PySpark Date & Timestamp Operations
PySpark Date & Timestamp Operations Explained
This section demonstrates essential PySpark SQL functions for handling date and timestamp data. Efficiently manipulate and analyze temporal data within your Spark DataFrames.
1. Adding Current Date
Add a column containing the current date to your DataFrame.
# Add a column with the current date
df = df.withColumn('current_date', F.current_date())
2. Converting Strings to Dates and Timestamps
Convert string columns to proper date or timestamp types, specifying the exact format.
# Convert a string of known format to a date (excludes time information)
df = df.withColumn('date_of_birth', F.to_date('date_of_birth', 'yyyy-MM-dd'))
# Convert a string of known format to a timestamp (includes time information)
df = df.withColumn('time_of_birth', F.to_timestamp('time_of_birth', 'yyyy-MM-dd HH:mm:ss'))
3. Extracting Date Components
Easily extract specific parts of a date or timestamp column.
# Get year from date: F.year(col)
# Get month from date: F.month(col)
# Get day from date: F.dayofmonth(col)
# Get hour from date: F.hour(col)
# Get minute from date: F.minute(col)
# Get second from date: F.second(col)
df = df.filter(F.year('date_of_birth') == F.lit('2017'))
4. Adding and Subtracting Days and Months
Perform arithmetic operations on dates to find future or past dates.
# Add & subtract days
df = df.withColumn('three_days_after', F.date_add('date_of_birth', 3))
df = df.withColumn('three_days_before', F.date_sub('date_of_birth', 3))
# Add & Subtract months
df = df.withColumn('next_month', F.add_month('date_of_birth', 1))
5. Calculating Date Differences
Determine the number of days or months between two date columns.
# Get number of days between two dates
df = df.withColumn('days_between', F.datediff('start', 'end'))
# Get number of months between two dates
df = df.withColumn('months_between', F.months_between('start', 'end'))
6. Filtering by Date Range
Filter your DataFrame to include only records within a specified date range.
# Keep only rows where date_of_birth is between 2017-05-10 and 2018-07-21
df = df.filter(
(F.col('date_of_birth') >= F.lit('2017-05-10')) &
(F.col('date_of_birth') <= F.lit('2018-07-21'))
)
These PySpark functions are crucial for any data analysis involving time-series data, enabling precise control over date and timestamp manipulations.
For more details on date and timestamp functions in Spark SQL, refer to the official PySpark SQL documentation.