PySpark Array Operations

PySpark Array Manipulation Functions

This section demonstrates essential PySpark array operations for efficient data manipulation within DataFrames. Learn how to create, access, transform, and flatten array columns using PySpark's built-in functions.

Creating Arrays

Combine existing columns into a new array column using F.array().

# Column Array - F.array(*cols)
df = df.withColumn('full_name', F.array('fname', 'lname'))

# Empty Array - F.array(*cols)
df = df.withColumn('empty_array_column', F.array([]))

Accessing Array Elements

Retrieve an element at a specific index from an array column.

# Get element at index – col.getItem(n)
df = df.withColumn('first_element', F.col("my_array").getItem(0))

Array Size and Flattening

Determine the number of elements in an array and flatten nested arrays.

# Array Size/Length – F.size(col)
df = df.withColumn('array_length', F.size('my_array'))

# Flatten Array – F.flatten(col)
df = df.withColumn('flattened', F.flatten('my_array'))

Unique Elements and Transformations

Extract distinct elements from an array and apply transformations to each element.

# Unique/Distinct Elements – F.array_distinct(col)
df = df.withColumn('unique_elements', F.array_distinct('my_array'))

# Map over & transform array elements – F.transform(col, func: col -> col)
df = df.withColumn('elem_ids', F.transform(F.col('my_array'), lambda x: x.getField('id')))

Exploding Arrays

Transform an array column into multiple rows, with each row containing one element from the array.

# Return a row per array element – F.explode(col)
df = df.select(F.explode('my_array'))

PySpark Array Operations - Manipulate and Transform Arrays