In PySpark, regex functions let you extract, match, or replace parts of strings using powerful regular expression patterns. These are essential for parsing messy data, extracting tokens, or cleaning text.
🔧 Common PySpark Regex Functions (from pyspark.sql.functions
)
Function | Description |
---|---|
regexp_extract() | Extracts first matching group using a regex |
regexp_replace() | Replaces all substrings matching regex |
rlike() | Returns Boolean if regex pattern matches |
split() | Splits string using regex pattern |
✅ 1. regexp_extract(col, pattern, groupIndex)
Extracts a substring using a regex group.
from pyspark.sql.functions import regexp_extract, col
df = spark.createDataFrame([("user_123_name",)], ["raw"])
# Extract the digits
df.withColumn("extracted", regexp_extract("raw", "user_(\\d+)_", 1)).show()
Pattern Explained:
user_
→ fixed prefix(\d+)
→ group 1: one or more digits_
→ fixed suffix- Group index
1
→ we extract only the digits
Output:
+-------------+---------+
| raw | extracted |
+-------------+---------+
| user_123_name | 123 |
+-------------+---------+
✅ 2. regexp_replace(col, pattern, replacement)
Replaces all matches of a regex pattern.
from pyspark.sql.functions import regexp_replace
df.withColumn("cleaned", regexp_replace("raw", "_", "-")).show()
Replaces all
_
with-
→user-123-name
✅ 3. rlike(col, pattern)
Boolean match for regex.
df.withColumn("has_digits", col("raw").rlike("\\d+")).show()
Returns
True
if the string contains any digits.
✅ 4. split(col, pattern)
Splits string by regex pattern into an array.
from pyspark.sql.functions import split
df.withColumn("tokens", split("raw", "_")).show(truncate=False)
🎯 Notable Regex Patterns and Rules
Pattern | Matches |
---|---|
\d | Any digit (0-9) |
\D | Any non-digit |
\w | Word character (letters, digits, _) |
\W | Non-word character |
\s | Whitespace (space, tab) |
. | Any character except newline |
^ / $ | Start / end of line |
.* | Zero or more of anything |
+ , ? , {n} | Quantifiers |
() | Capture group |
[] | Character class |
` | ` |
🧠 Example Use Cases
Use Case | Function | Example Regex |
---|---|---|
Extract domain from email | regexp_extract | @(\w+\.\w+) |
Remove all non-digits | regexp_replace | \D → "" |
Validate phone numbers | rlike | ^\d{10}$ |
Split CSV-like field | split | ,\s* |
✅ Realistic Example
emails = [("raj@example.com",), ("test123@gmail.com",)]
df_emails = spark.createDataFrame(emails, ["email"])
df_emails.withColumn("domain", regexp_extract("email", "@(\\w+\\.\\w+)", 1)).show()
Output:
+-------------------+----------+
| email | domain |
+-------------------+----------+
| raj@example.com | example.com |
| test123@gmail.com | gmail.com |
+-------------------+----------+
Absolutely! Here’s a practical, PySpark-friendly regex pattern cheatsheet – tailored for data cleaning and extraction tasks using Spark functions like regexp_extract
, regexp_replace
, rlike
, and split
.
✅ Regex Cheatsheet for PySpark Data Cleaning
🧹 Use Case | 🔍 Regex Pattern | ✨ Description |
---|---|---|
Extract digits | \d+ | One or more digits |
Remove non-digits | \D | Matches any non-digit (use with regexp_replace ) |
Extract letters only | [A-Za-z]+ | Alphabetic characters only |
Remove special characters | [^A-Za-z0-9 ] | Keep only letters, digits, and spaces |
Extract words | \w+ | Alphanumeric + underscore (word characters) |
Remove whitespace | \s+ | All whitespace (tabs, spaces, newlines) |
Keep only alphanumerics | [^A-Za-z0-9] | Remove everything else |
Match email | ^[\w\.-]+@[\w\.-]+\.\w{2,}$ | Email format validation |
Extract domain from email | @(\w+\.\w+) | Pull domain name (e.g., gmail.com) |
Match phone number (10 digits) | ^\d{10}$ | Exactly 10 digits |
Split on commas or semicolons | [;,] | For lists in CSV-style fields |
Find words starting with capital | \b[A-Z][a-z]*\b | Proper names, etc. |
Extract first number in string | (\d+) | Use with regexp_extract(..., 1) |
Extract content inside brackets | \[(.*?)\] | Extracts [content] |
Match date (YYYY-MM-DD) | \d{4}-\d{2}-\d{2} | ISO date pattern |
Extract URL domain | https?://(?:www\.)?([^/]+) | Extracts domain from full URL |
Remove trailing punctuation | [.,!?;:]+$ | Strip ending punctuation |
🔧 PySpark Usage Patterns
1. regexp_extract()
from pyspark.sql.functions import regexp_extract
df = df.withColumn("domain", regexp_extract("email", "@(\\w+\\.\\w+)", 1))
2. regexp_replace()
from pyspark.sql.functions import regexp_replace
df = df.withColumn("clean_text", regexp_replace("text", "[^A-Za-z0-9 ]", ""))
3. rlike()
(Boolean filter)
df.filter(df.phone.rlike("^\d{10}$")).show()
4. split()
from pyspark.sql.functions import split
df = df.withColumn("tags", split("csv_column", "[;,]"))
🔍 Tip: Escaping in Regex
- In PySpark strings, double escape backslashes:
\\d
,\\s
, etc. - Or use raw Python strings:
r"\d+"