In PySpark, string manipulation and data cleaning are essential tasks for preparing data for analysis. PySpark provides several built-in functions for handling string operations efficiently on large datasets. Here’s a guide on how to perform common string manipulation tasks in PySpark:-
concat: Concatenates two or more strings.
Syntax: concat(col1, col2, ..., colN)
Example:
from pyspark.sql.functions import concat, col
df = df.withColumn("Full Name", concat(col("First Name"), lit(" "), col("Last Name")))
substr: Extracts a substring from a string.
Syntax: substr(col, start, length)
Example:
from pyspark.sql.functions import substr, col
df = df.withColumn("First Name", substr(col("Name"), 1, 4))
split: Splits a string into an array of substrings.
Syntax: split(col, pattern)
Example:
from pyspark.sql.functions import split, col
df = df.withColumn("Address Parts", split(col("Address"), " "))
regex_extract: Extracts a substring using a regular expression.
Syntax: regex_extract(col, pattern, group)
Example:
from pyspark.sql.functions import regex_extract, col
df = df.withColumn("Phone Number", regex_extract(col("Contact Info"), "\\d{3}-\\d{3}-\\d{4}", 0))
translate: Replaces specified characters in a string.
Syntax: translate(col, matching, replace)
Example:
from pyspark.sql.functions import translate, col
df = df.withColumn("Clean Name", translate(col("Name"), "aeiou", "AEIOU"))
trim: Removes leading and trailing whitespace from a string.
Syntax: trim(col)
Example:
from pyspark.sql.functions import trim, col
df = df.withColumn("Clean Address", trim(col("Address")))
lower: Converts a string to lowercase.
Syntax: lower(col)
Example:
from pyspark.sql.functions import lower, col
df = df.withColumn("Lower Name", lower(col("Name")))
upper: Converts a string to uppercase.
Syntax: upper(col)
Example:
from pyspark.sql.functions import upper, col
df = df.withColumn("Upper Name", upper(col("Name")))
String Data Cleaning in PySpark
Here are some common string data cleaning functions in PySpark, along with their syntax and examples:
trim: Removes leading and trailing whitespace from a string.
Syntax: trim(col)
Example:
from pyspark.sql.functions import trim, col
df = df.withColumn("Clean Address", trim(col("Address")))
regexp_replace: Replaces substrings matching a regular expression.
Syntax: regexp_replace(col, pattern, replacement)
Example:
from pyspark.sql.functions import regexp_replace, col
df = df.withColumn("Clean Name", regexp_replace(col("Name"), "[^a-zA-Z]", ""))
replace: Replaces specified characters or substrings in a string.
Syntax: replace(col, matching, replace)
Example:
from pyspark.sql.functions import replace, col
df = df.withColumn("Clean Address", replace(col("Address"), " ", ""))
remove_accents: Removes accents from a string.
Syntax: remove_accents(col)
Example:
from pyspark.sql.functions import remove_accents, col
df = df.withColumn("Clean Name", remove_accents(col("Name")))
standardize: Standardizes a string by removing punctuation and converting to lowercase.
Syntax: standardize(col)
Example:
from pyspark.sql.functions import standardize, col
df = df.withColumn("Standardized Name", standardize(col("Name")))
Leave a Reply