In PySpark, regex functions let you extract, match, or replace parts of strings using powerful regular expression patterns. These are essential for parsing messy data, extracting tokens, or cleaning text.


πŸ”§ Common PySpark Regex Functions (from pyspark.sql.functions)

FunctionDescription
regexp_extract()Extracts first matching group using a regex
regexp_replace()Replaces all substrings matching regex
rlike()Returns Boolean if regex pattern matches
split()Splits string using regex pattern

βœ… 1. regexp_extract(col, pattern, groupIndex)

Extracts a substring using a regex group.

from pyspark.sql.functions import regexp_extract, col

df = spark.createDataFrame([("user_123_name",)], ["raw"])

# Extract the digits
df.withColumn("extracted", regexp_extract("raw", "user_(\\d+)_", 1)).show()

Pattern Explained:

  • user_ β†’ fixed prefix
  • (\d+) β†’ group 1: one or more digits
  • _ β†’ fixed suffix
  • Group index 1 β†’ we extract only the digits

Output:

+-------------+---------+
| raw         | extracted |
+-------------+---------+
| user_123_name | 123   |
+-------------+---------+

βœ… 2. regexp_replace(col, pattern, replacement)

Replaces all matches of a regex pattern.

from pyspark.sql.functions import regexp_replace

df.withColumn("cleaned", regexp_replace("raw", "_", "-")).show()

Replaces all _ with - β†’ user-123-name


βœ… 3. rlike(col, pattern)

Boolean match for regex.

df.withColumn("has_digits", col("raw").rlike("\\d+")).show()

Returns True if the string contains any digits.


βœ… 4. split(col, pattern)

Splits string by regex pattern into an array.

from pyspark.sql.functions import split

df.withColumn("tokens", split("raw", "_")).show(truncate=False)

🎯 Notable Regex Patterns and Rules

PatternMatches
\dAny digit (0-9)
\DAny non-digit
\wWord character (letters, digits, _)
\WNon-word character
\sWhitespace (space, tab)
.Any character except newline
^ / $Start / end of line
.*Zero or more of anything
+, ?, {n}Quantifiers
()Capture group
[]Character class
``

🧠 Example Use Cases

Use CaseFunctionExample Regex
Extract domain from emailregexp_extract@(\w+\.\w+)
Remove all non-digitsregexp_replace\D β†’ ""
Validate phone numbersrlike^\d{10}$
Split CSV-like fieldsplit,\s*

βœ… Realistic Example

emails = [("raj@example.com",), ("test123@gmail.com",)]
df_emails = spark.createDataFrame(emails, ["email"])

df_emails.withColumn("domain", regexp_extract("email", "@(\\w+\\.\\w+)", 1)).show()

Output:

+-------------------+----------+
| email             | domain   |
+-------------------+----------+
| raj@example.com   | example.com |
| test123@gmail.com | gmail.com   |
+-------------------+----------+

Absolutely! Here’s a practical, PySpark-friendly regex pattern cheatsheet – tailored for data cleaning and extraction tasks using Spark functions like regexp_extract, regexp_replace, rlike, and split.


βœ… Regex Cheatsheet for PySpark Data Cleaning

🧹 Use CaseπŸ” Regex Pattern✨ Description
Extract digits\d+One or more digits
Remove non-digits\DMatches any non-digit (use with regexp_replace)
Extract letters only[A-Za-z]+Alphabetic characters only
Remove special characters[^A-Za-z0-9 ]Keep only letters, digits, and spaces
Extract words\w+Alphanumeric + underscore (word characters)
Remove whitespace\s+All whitespace (tabs, spaces, newlines)
Keep only alphanumerics[^A-Za-z0-9]Remove everything else
Match email^[\w\.-]+@[\w\.-]+\.\w{2,}$Email format validation
Extract domain from email@(\w+\.\w+)Pull domain name (e.g., gmail.com)
Match phone number (10 digits)^\d{10}$Exactly 10 digits
Split on commas or semicolons[;,]For lists in CSV-style fields
Find words starting with capital\b[A-Z][a-z]*\bProper names, etc.
Extract first number in string(\d+)Use with regexp_extract(..., 1)
Extract content inside brackets\[(.*?)\]Extracts [content]
Match date (YYYY-MM-DD)\d{4}-\d{2}-\d{2}ISO date pattern
Extract URL domainhttps?://(?:www\.)?([^/]+)Extracts domain from full URL
Remove trailing punctuation[.,!?;:]+$Strip ending punctuation

πŸ”§ PySpark Usage Patterns

1. regexp_extract()
from pyspark.sql.functions import regexp_extract

df = df.withColumn("domain", regexp_extract("email", "@(\\w+\\.\\w+)", 1))
2. regexp_replace()
from pyspark.sql.functions import regexp_replace

df = df.withColumn("clean_text", regexp_replace("text", "[^A-Za-z0-9 ]", ""))
3. rlike() (Boolean filter)
df.filter(df.phone.rlike("^\d{10}$")).show()
4. split()
from pyspark.sql.functions import split

df = df.withColumn("tags", split("csv_column", "[;,]"))

πŸ” Tip: Escaping in Regex

  • In PySpark strings, double escape backslashes: \\d, \\s, etc.
  • Or use raw Python strings: r"\d+"

Pages: 1 2 3

Posted in