String Manipulation on PySpark DataFrames

String manipulation is a common task in data processing. PySpark provides a variety of built-in functions for manipulating string columns in DataFrames. Below, we explore some of the most useful string manipulation functions and demonstrate how to use them with examples.

Common String Manipulation Functions

  1. concat: Concatenates multiple string columns into one.
  2. substring: Extracts a substring from a string column.
  3. length: Computes the length of a string column.
  4. trim, ltrim, rtrim: Trims whitespace from strings.
  5. upper, lower: Converts strings to upper or lower case.
  6. regexp_replace: Replaces substrings matching a regex pattern.
  7. regexp_extract: Extracts substrings matching a regex pattern.
  8. split: Splits a string column based on a delimiter.
  9. replace: Replaces all occurrences of a substring with another substring.
  10. translate: Replaces characters in a string based on a mapping.

Example Usage

1. Concatenation

Syntax:

from pyspark.sql import SparkSession
from pyspark.sql.functions import concat, lit

# Initialize Spark session
spark = SparkSession.builder.appName("string_manipulation").getOrCreate()

# Sample data
data = [("John", "Doe"), ("Jane", "Smith")]
columns = ["first_name", "last_name"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Concatenate first and last names
df = df.withColumn("full_name", concat(df.first_name, lit(" "), df.last_name))

# Show result
df.show()

2. Substring Extraction

Syntax:

from pyspark.sql.functions import substring

# Extract first three characters of first_name
df = df.withColumn("initials", substring(df.first_name, 1, 3))

# Show result
df.show()

3. Length Calculation

Syntax:

from pyspark.sql.functions import length

# Calculate length of first_name
df = df.withColumn("name_length", length(df.first_name))

# Show result
df.show()

4. Trimming

Syntax:

from pyspark.sql.functions import trim, ltrim, rtrim

# Trim whitespace from first_name
df = df.withColumn("trimmed_name", trim(df.first_name))

# Show result
df.show()

5. Case Conversion

Syntax:

from pyspark.sql.functions import upper, lower

# Convert first_name to upper case
df = df.withColumn("upper_name", upper(df.first_name))

# Convert last_name to lower case
df = df.withColumn("lower_name", lower(df.last_name))

# Show result
df.show()

6. Regex Replace

Syntax:

from pyspark.sql.functions import regexp_replace

# Replace all digits with asterisks
df = df.withColumn("masked_name", regexp_replace(df.first_name, "\\d", "*"))

# Show result
df.show()

7. Regex Extract

Syntax:

from pyspark.sql.functions import regexp_extract

# Extract digits from first_name
df = df.withColumn("extracted_digits", regexp_extract(df.first_name, "(\\d+)", 1))

# Show result
df.show()

8. Split

from pyspark.sql.functions import split

# Split full_name into first and last name
df = df.withColumn("split_name", split(df.full_name, " "))

# Show result
df.select("split_name").show(truncate=False)

9. Replace

from pyspark.sql.functions import expr

# Replace 'Doe' with 'Smith' in last_name
df = df.withColumn("updated_last_name", expr("replace(last_name, 'Doe', 'Smith')"))

# Show result
df.show()

10. Translate

Syntax:

from pyspark.sql.functions import translate

# Translate 'o' to '0' in last_name
df = df.withColumn("translated_last_name", translate(df.last_name, "o", "0"))

# Show result
df.show()

Comprehensive Example: Combining Multiple String Manipulations

Let’s create a project that combines multiple string manipulation operations on a DataFrame.

from pyspark.sql import SparkSession
from pyspark.sql.functions import concat, lit, substring, length, trim, upper, lower, regexp_replace, regexp_extract, split, expr, translate

# Initialize Spark session
spark = SparkSession.builder.appName("string_manipulation").getOrCreate()

# Sample data
data = [("John123", "Doe456"), ("Jane789", "Smith012")]
columns = ["first_name", "last_name"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Perform various string manipulations
df = df.withColumn("full_name", concat(df.first_name, lit(" "), df.last_name))
df = df.withColumn("initials", substring(df.first_name, 1, 3))
df = df.withColumn("name_length", length(df.first_name))
df = df.withColumn("trimmed_name", trim(df.first_name))
df = df.withColumn("upper_name", upper(df.first_name))
df = df.withColumn("lower_name", lower(df.last_name))
df = df.withColumn("masked_name", regexp_replace(df.first_name, "\\d", "*"))
df = df.withColumn("extracted_digits", regexp_extract(df.first_name, "(\\d+)", 1))
df = df.withColumn("split_name", split(df.full_name, " "))
df = df.withColumn("updated_last_name", expr("replace(last_name, 'Doe456', 'Smith')"))
df = df.withColumn("translated_last_name", translate(df.last_name, "456", "789"))

# Show result
df.show(truncate=False)

Explanation

  1. Concatenation: Combines the first_name and last_name with a space.
  2. Substring Extraction: Extracts the first three characters of first_name.
  3. Length Calculation: Computes the length of first_name.
  4. Trimming: Trims any whitespace around first_name.
  5. Case Conversion: Converts first_name to upper case and last_name to lower case.
  6. Regex Replace: Replaces all digits in first_name with asterisks.
  7. Regex Extract: Extracts digits from first_name.
  8. Split: Splits the full_name into a list of first_name and last_name.
  9. Replace: Replaces ‘Doe456’ with ‘Smith’ in last_name.
  10. Translate: Translates ‘456’ to ‘789’ in last_name.

These examples demonstrate the versatility of string manipulation functions in PySpark, allowing for complex transformations and processing of text data.

Here’s a table summarizing common string manipulation functions in PySpark SQL, including regular expressions (regex):

FunctionDescriptionExample (Input: “Hello World”)
initcap(str)Converts first letter of each word to uppercase“Hello World” -> “Hello World” (already capitalized words remain unchanged)
lower(str)Converts all characters to lowercase“Hello World” -> “hello world”
upper(str)Converts all characters to uppercase“Hello World” -> “HELLO WORLD”
lpad(str, length, pad_string)Pads a string to the left with a specified string“Hello World”, 15, “-” -> “—Hello World”
rpad(str, length, pad_string)Pads a string to the right with a specified string“Hello World”, 15, “-” -> “Hello World—“
ltrim(str)Removes leading whitespace characters” Hello World” -> “Hello World”
rtrim(str)Removes trailing whitespace characters“Hello World ” -> “Hello World”
trim(str)Removes leading and trailing whitespace characters” Hello World ” -> “Hello World”
length(str)Returns the length (number of characters) of a string“Hello World” -> 11
substr(str, pos, len)Extracts a substring from a string“Hello World”, 7, 5 -> “World”
instr(str, substr)Returns the starting position of a substring within a string (0 if not found)“Hello World”, “World” -> 7
concat(str1, str2…)Concatenates (joins) multiple strings“Hello”, ” “, “World” -> “Hello World”
concat_ws(sep, str1, str2…)Concatenates strings with a specified separator“Hello”, “,”, “World” -> “Hello,World”
regexp_extract(str, pattern)Extracts a pattern (regular expression) from a string“Hello_World”, r”[^]+” -> “_World”
regexp_replace(str, pattern, replacement)Replaces a pattern (regular expression) in a string with a replacement string“Hello_World”, r”_”, ” ” -> “Hello World”
split(str, delimiter)Splits a string into an array based on a delimiter“Hello,World,How”, “,” -> [“Hello”, “World”, “How”]
array_join(arr, sep)Joins the elements of an array into a string with a specified separator[“Hello”, “World”, “How”], “,” -> “Hello,World,How”
soundex(str)Returns the Soundex code of a string (phonetic equivalent)“Hello” -> “H400”
translate(str, remove, replace)Removes characters from a string and replaces them with“Hello World”, “e”, “” -> “Hllo World”
overlay(str, overlay_str, pos, len)Replaces a substring within a string with another“Hello World”, “there”, 6, 5 -> “Hello thered”
reverse(str)Reverses the order of characters in a string“Hello World” -> “dlroW olleH”
instrc(str, substr)Returns the starting position of a substring within a string (case-insensitive, 0 if not found)“Hello World”, “world” -> 7
levenshtein(str1, str2)Calculates the Levenshtein distance (minimum number of edits) to transform one string to another“Hello”, “Jello” -> 1

Discover more from HintsToday

Subscribe to get the latest posts sent to your email.