String Manipulation on PySpark DataFrames

In PySpark, regex functions let you extract, match, or replace parts of strings using powerful regular expression patterns. These are essential for parsing messy data, extracting tokens, or cleaning text.

🔧 Common PySpark Regex Functions (from `pyspark.sql.functions`)

Function	Description
`regexp_extract()`	Extracts first matching group using a regex
`regexp_replace()`	Replaces all substrings matching regex
`rlike()`	Returns Boolean if regex pattern matches
`split()`	Splits string using regex pattern

✅ 1. `regexp_extract(col, pattern, groupIndex)`

Extracts a substring using a regex group.

from pyspark.sql.functions import regexp_extract, col

df = spark.createDataFrame([("user_123_name",)], ["raw"])

# Extract the digits
df.withColumn("extracted", regexp_extract("raw", "user_(\\d+)_", 1)).show()

Pattern Explained:

user_ → fixed prefix
(\d+) → group 1: one or more digits
_ → fixed suffix
Group index 1 → we extract only the digits

Output:

+-------------+---------+
| raw         | extracted |
+-------------+---------+
| user_123_name | 123   |
+-------------+---------+

✅ 2. `regexp_replace(col, pattern, replacement)`

Replaces all matches of a regex pattern.

from pyspark.sql.functions import regexp_replace

df.withColumn("cleaned", regexp_replace("raw", "_", "-")).show()

Replaces all _ with - → user-123-name

✅ 3. `rlike(col, pattern)`

Boolean match for regex.

df.withColumn("has_digits", col("raw").rlike("\\d+")).show()

Returns True if the string contains any digits.

✅ 4. `split(col, pattern)`

Splits string by regex pattern into an array.

from pyspark.sql.functions import split

df.withColumn("tokens", split("raw", "_")).show(truncate=False)

🎯 Notable Regex Patterns and Rules

Pattern	Matches
`\d`	Any digit (0-9)
`\D`	Any non-digit
`\w`	Word character (letters, digits, _)
`\W`	Non-word character
`\s`	Whitespace (space, tab)
`.`	Any character except newline
`^` / `$`	Start / end of line
`.*`	Zero or more of anything
`+`, `?`, `{n}`	Quantifiers
`()`	Capture group
`[]`	Character class
`	`

🧠 Example Use Cases

Use Case	Function	Example Regex
Extract domain from email	`regexp_extract`	`@(\w+\.\w+)`
Remove all non-digits	`regexp_replace`	`\D` → `""`
Validate phone numbers	`rlike`	`^\d{10}$`
Split CSV-like field	`split`	`,\s*`

✅ Realistic Example

emails = [("raj@example.com",), ("test123@gmail.com",)]
df_emails = spark.createDataFrame(emails, ["email"])

df_emails.withColumn("domain", regexp_extract("email", "@(\\w+\\.\\w+)", 1)).show()

Output:

+-------------------+----------+
| email             | domain   |
+-------------------+----------+
| raj@example.com   | example.com |
| test123@gmail.com | gmail.com   |
+-------------------+----------+

Absolutely! Here’s a practical, PySpark-friendly regex pattern cheatsheet – tailored for data cleaning and extraction tasks using Spark functions like regexp_extract, regexp_replace, rlike, and split.

✅ Regex Cheatsheet for PySpark Data Cleaning

🧹 Use Case	🔍 Regex Pattern	✨ Description
Extract digits	`\d+`	One or more digits
Remove non-digits	`\D`	Matches any non-digit (use with `regexp_replace`)
Extract letters only	`[A-Za-z]+`	Alphabetic characters only
Remove special characters	`[^A-Za-z0-9 ]`	Keep only letters, digits, and spaces
Extract words	`\w+`	Alphanumeric + underscore (word characters)
Remove whitespace	`\s+`	All whitespace (tabs, spaces, newlines)
Keep only alphanumerics	`[^A-Za-z0-9]`	Remove everything else
Match email	`^[\w\.-]+@[\w\.-]+\.\w{2,}$`	Email format validation
Extract domain from email	`@(\w+\.\w+)`	Pull domain name (e.g., gmail.com)
Match phone number (10 digits)	`^\d{10}$`	Exactly 10 digits
Split on commas or semicolons	`[;,]`	For lists in CSV-style fields
Find words starting with capital	`\b[A-Z][a-z]*\b`	Proper names, etc.
Extract first number in string	`(\d+)`	Use with `regexp_extract(..., 1)`
Extract content inside brackets	`\[(.*?)\]`	Extracts `[content]`
Match date (YYYY-MM-DD)	`\d{4}-\d{2}-\d{2}`	ISO date pattern
Extract URL domain	`https?://(?:www\.)?([^/]+)`	Extracts domain from full URL
Remove trailing punctuation	`[.,!?;:]+$`	Strip ending punctuation

🔧 PySpark Usage Patterns

1. `regexp_extract()`

from pyspark.sql.functions import regexp_extract

df = df.withColumn("domain", regexp_extract("email", "@(\\w+\\.\\w+)", 1))

2. `regexp_replace()`

from pyspark.sql.functions import regexp_replace

df = df.withColumn("clean_text", regexp_replace("text", "[^A-Za-z0-9 ]", ""))

3. `rlike()` (Boolean filter)

df.filter(df.phone.rlike("^\d{10}$")).show()

4. `split()`

from pyspark.sql.functions import split

df = df.withColumn("tags", split("csv_column", "[;,]"))

🔍 Tip: Escaping in Regex

In PySpark strings, double escape backslashes: \\d, \\s, etc.
Or use raw Python strings: r"\d+"

HintsToday

recent posts

about