PySpark String Operations - Filter & Manipulate Text Data - pyspark Cheatsheets

PySpark String Filtering

PySpark provides a robust set of functions to filter DataFrames based on string patterns and conditions. These operations are crucial for cleaning and selecting relevant text data.

Common String Filters

Below are examples of how to use various string filters in PySpark:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, F

spark = SparkSession.builder.appName("StringOperations").getOrCreate()
data = [("Alice", 1), ("Bob", 2), ("Charlie", 3), ("David", 4), ("Eve", 5)]
columns = ["name", "id"]
df = spark.createDataFrame(data, columns)

# Contains - col.contains(string)
# Select rows where the 'name' column contains the letter 'o'
df_contains = df.filter(df.name.contains('o'))
# df_contains.show()

# Starts With - col.startswith(string)
# Select rows where the 'name' column starts with 'Al'
df_starts_with = df.filter(df.name.startswith('Al'))
# df_starts_with.show()

# Ends With - col.endswith(string)
# Select rows where the 'name' column ends with 'ice'
df_ends_with = df.filter(df.name.endswith('ice'))
# df_ends_with.show()

# Is Null - col.isNull()
# Example assuming a nullable column 'age'
# df_null_example = spark.createDataFrame([(None,), ("Alice",)], ["age", "name"])
# df_is_null = df_null_example.filter(df_null_example.age.isNull())
# df_is_null.show()

# Is Not Null - col.isNotNull()
# Example assuming a nullable column 'age'
# df_is_not_null = df_null_example.filter(df_null_example.age.isNotNull())
# df_is_not_null.show()

# Like - col.like(string_with_sql_wildcards)
# Select rows where the 'name' column matches the pattern 'Al%' (starts with Al)
df_like = df.filter(df.name.like('Al%'))
# df_like.show()

# Regex Like - col.rlike(regex)
# Select rows where the 'name' column matches the regex '[A-Z]*ice$' (ends with 'ice' and starts with any uppercase letters)
df_rlike = df.filter(df.name.rlike('[A-Z]*ice$'))
# df_rlike.show()

# Is In List - col.isin(*cols)
# Select rows where the 'name' column is either 'Bob' or 'Mike'
df_is_in = df.filter(df.name.isin('Bob', 'Mike'))
# df_is_in.show()

PySpark String Functions

Beyond filtering, PySpark offers a rich set of functions to transform and manipulate string data within DataFrame columns. These functions are essential for data cleaning, feature engineering, and text processing.

Common String Manipulation Functions

Here are examples demonstrating the usage of various string functions:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, F

spark = SparkSession.builder.appName("StringOperations").getOrCreate()
data = [("12345", "  Alice  "), ("67890", "Bob"), ("11223", "Charlie"), ("44556", "David"), ("77889", "Eve")]
columns = ["id", "name"]
df = spark.createDataFrame(data, columns)

# Substring - col.substr(startPos, length)
# Extract the first 10 characters from the 'id' column
df_substr = df.withColumn('short_id', df.id.substr(0, 10))
# df_substr.show()

# Trim - F.trim(col)
# Remove leading and trailing whitespace from the 'name' column
df_trim = df.withColumn('name', F.trim(df.name))
# df_trim.show()

# Left Pad - F.lpad(col, len, pad)
# Right Pad - F.rpad(col, len, pad)
# Pad the 'id' column to a length of 4 with leading zeros
df_lpad = df.withColumn('id', F.lpad('id', 4, '0'))
# df_lpad.show()

# Left Trim - F.ltrim(col)
# Right Trim - F.rtrim(col)
# Remove leading whitespace from the 'id' column (example, though 'id' doesn't have leading spaces here)
df_ltrim = df.withColumn('id', F.ltrim('id'))
# df_ltrim.show()

# Concatenate - F.concat(*cols)
# Create a 'full_name' column by concatenating 'fname' and 'lname' with a space
# Assuming 'fname' and 'lname' columns exist for this example
# data_names = [("John", "Doe"), ("Jane", "Smith")]
# df_names = spark.createDataFrame(data_names, ["fname", "lname"])
# df_concat = df_names.withColumn('full_name', F.concat('fname', F.lit(' '), 'lname'))
# df_concat.show()

# Concatenate with Separator/Delimiter - F.concat_ws(delimiter, *cols)
# Create a 'full_name' column by concatenating 'fname' and 'lname' with a hyphen
# df_concat_ws = df_names.withColumn('full_name', F.concat_ws('-', 'fname', 'lname'))
# df_concat_ws.show()

# Regex Replace - F.regexp_replace(str, pattern, replacement)
# Replace a pattern in the 'id' column
# Example: Replace '0F1' followed by any characters with '1F1-' and the captured characters
# data_regex = [("0F1abc",), ("XYZ",)]
# df_regex = spark.createDataFrame(data_regex, ["id"])
# df_regex_replace = df_regex.withColumn('id', F.regexp_replace(col('id'), '0F1(.*)', '1F1-$1'))
# df_regex_replace.show()

# Regex Extract - F.regexp_extract(str, pattern, idx)
# Extract numeric parts from the 'id' column
# df_regex_extract = df.withColumn('extracted_id', F.regexp_extract(col('id'), '[0-9]*', 0))
# df_regex_extract.show()

These PySpark string operations are fundamental for any data scientist or engineer working with textual data in big data environments. By mastering these functions, you can effectively clean, transform, and analyze your datasets.

For more advanced string manipulations and regular expression usage in PySpark, refer to the official PySpark SQL String Functions documentation.