In PySpark, regex functions let you extract, match, or replace parts of strings using powerful regular expression patterns. These are essential for parsing messy data, extracting tokens, or cleaning text.
π§ Common PySpark Regex Functions (from pyspark.sql.functions
)
Function | Description |
---|---|
regexp_extract() | Extracts first matching group using a regex |
regexp_replace() | Replaces all substrings matching regex |
rlike() | Returns Boolean if regex pattern matches |
split() | Splits string using regex pattern |
β
1. regexp_extract(col, pattern, groupIndex)
Extracts a substring using a regex group.
from pyspark.sql.functions import regexp_extract, col
df = spark.createDataFrame([("user_123_name",)], ["raw"])
# Extract the digits
df.withColumn("extracted", regexp_extract("raw", "user_(\\d+)_", 1)).show()
Pattern Explained:
user_
β fixed prefix(\d+)
β group 1: one or more digits_
β fixed suffix- Group index
1
β we extract only the digits
Output:
+-------------+---------+
| raw | extracted |
+-------------+---------+
| user_123_name | 123 |
+-------------+---------+
β
2. regexp_replace(col, pattern, replacement)
Replaces all matches of a regex pattern.
from pyspark.sql.functions import regexp_replace
df.withColumn("cleaned", regexp_replace("raw", "_", "-")).show()
Replaces all
_
with-
βuser-123-name
β
3. rlike(col, pattern)
Boolean match for regex.
df.withColumn("has_digits", col("raw").rlike("\\d+")).show()
Returns
True
if the string contains any digits.
β
4. split(col, pattern)
Splits string by regex pattern into an array.
from pyspark.sql.functions import split
df.withColumn("tokens", split("raw", "_")).show(truncate=False)
π― Notable Regex Patterns and Rules
Pattern | Matches |
---|---|
\d | Any digit (0-9) |
\D | Any non-digit |
\w | Word character (letters, digits, _) |
\W | Non-word character |
\s | Whitespace (space, tab) |
. | Any character except newline |
^ / $ | Start / end of line |
.* | Zero or more of anything |
+ , ? , {n} | Quantifiers |
() | Capture group |
[] | Character class |
` | ` |
π§ Example Use Cases
Use Case | Function | Example Regex |
---|---|---|
Extract domain from email | regexp_extract | @(\w+\.\w+) |
Remove all non-digits | regexp_replace | \D β "" |
Validate phone numbers | rlike | ^\d{10}$ |
Split CSV-like field | split | ,\s* |
β Realistic Example
emails = [("raj@example.com",), ("test123@gmail.com",)]
df_emails = spark.createDataFrame(emails, ["email"])
df_emails.withColumn("domain", regexp_extract("email", "@(\\w+\\.\\w+)", 1)).show()
Output:
+-------------------+----------+
| email | domain |
+-------------------+----------+
| raj@example.com | example.com |
| test123@gmail.com | gmail.com |
+-------------------+----------+
Absolutely! Hereβs a practical, PySpark-friendly regex pattern cheatsheet β tailored for data cleaning and extraction tasks using Spark functions like regexp_extract
, regexp_replace
, rlike
, and split
.
β Regex Cheatsheet for PySpark Data Cleaning
π§Ή Use Case | π Regex Pattern | β¨ Description |
---|---|---|
Extract digits | \d+ | One or more digits |
Remove non-digits | \D | Matches any non-digit (use with regexp_replace ) |
Extract letters only | [A-Za-z]+ | Alphabetic characters only |
Remove special characters | [^A-Za-z0-9 ] | Keep only letters, digits, and spaces |
Extract words | \w+ | Alphanumeric + underscore (word characters) |
Remove whitespace | \s+ | All whitespace (tabs, spaces, newlines) |
Keep only alphanumerics | [^A-Za-z0-9] | Remove everything else |
Match email | ^[\w\.-]+@[\w\.-]+\.\w{2,}$ | Email format validation |
Extract domain from email | @(\w+\.\w+) | Pull domain name (e.g., gmail.com) |
Match phone number (10 digits) | ^\d{10}$ | Exactly 10 digits |
Split on commas or semicolons | [;,] | For lists in CSV-style fields |
Find words starting with capital | \b[A-Z][a-z]*\b | Proper names, etc. |
Extract first number in string | (\d+) | Use with regexp_extract(..., 1) |
Extract content inside brackets | \[(.*?)\] | Extracts [content] |
Match date (YYYY-MM-DD) | \d{4}-\d{2}-\d{2} | ISO date pattern |
Extract URL domain | https?://(?:www\.)?([^/]+) | Extracts domain from full URL |
Remove trailing punctuation | [.,!?;:]+$ | Strip ending punctuation |
π§ PySpark Usage Patterns
1. regexp_extract()
from pyspark.sql.functions import regexp_extract
df = df.withColumn("domain", regexp_extract("email", "@(\\w+\\.\\w+)", 1))
2. regexp_replace()
from pyspark.sql.functions import regexp_replace
df = df.withColumn("clean_text", regexp_replace("text", "[^A-Za-z0-9 ]", ""))
3. rlike()
(Boolean filter)
df.filter(df.phone.rlike("^\d{10}$")).show()
4. split()
from pyspark.sql.functions import split
df = df.withColumn("tags", split("csv_column", "[;,]"))
π Tip: Escaping in Regex
- In PySpark strings, double escape backslashes:
\\d
,\\s
, etc. - Or use raw Python strings:
r"\d+"