Regular expressions (regex) are a powerful tool for matching patterns in text. Python’s re module provides functions and tools for working with regular expressions. Here’s a complete tutorial on using regex in Python.

1. Importing the re Module

To use regular expressions in Python, you need to import the re module:

import re

2. Basic Functions

re.search()

Searches for a pattern within a string and returns a match object if found.

pattern = r"\d+"  # Matches one or more digits
text = "The year is 2024"
match = re.search(pattern, text)
if match:
    print(f"Match found: {match.group()}")  # Output: Match found: 2024

re.match()

Matches a pattern at the beginning of a string.

pattern = r"\d+"  # Matches one or more digits
text = "2024 is the year"
match = re.match(pattern, text)
if match:
    print(f"Match found: {match.group()}")  # Output: Match found: 2024

re.findall()

Finds all non-overlapping matches of a pattern in a string.

pattern = r"\d+"  # Matches one or more digits
text = "The years are 2024, 2025, and 2026"
matches = re.findall(pattern, text)
print(f"Matches found: {matches}")  # Output: Matches found: ['2024', '2025', '2026']

re.finditer()

Returns an iterator yielding match objects for all non-overlapping matches.

pattern = r"\d+"  # Matches one or more digits
text = "The years are 2024, 2025, and 2026"
matches = re.finditer(pattern, text)
for match in matches:
    print(f"Match found: {match.group()}")  # Output: 2024, 2025, 2026

re.sub()

Replaces occurrences of a pattern with a specified string.

pattern = r"\d+"  # Matches one or more digits
text = "The years are 2024, 2025, and 2026"
new_text = re.sub(pattern, "YEAR", text)
print(new_text)  # Output: The years are YEAR, YEAR, and YEAR

3. Special Characters

  • .: Matches any character except a newline.
  • ^: Matches the start of the string.
  • $: Matches the end of the string.
  • *: Matches 0 or more repetitions of the preceding element.
  • +: Matches 1 or more repetitions of the preceding element.
  • ?: Matches 0 or 1 repetition of the preceding element.
  • {m,n}: Matches from m to n repetitions of the preceding element.
  • []: Matches any one of the characters inside the brackets.
  • |: Matches either the pattern before or the pattern after the |.
  • () : Groups patterns and captures matches.

4. Escaping Special Characters

To match a literal special character, escape it with a backslash (\).

pattern = r"\$100"  # Matches the string "$100"
text = "The price is $100"
match = re.search(pattern, text)
if match:
    print(f"Match found: {match.group()}")  # Output: Match found: $100

5. Character Classes

  • \d: Matches any digit (equivalent to [0-9]).
  • \D: Matches any non-digit.
  • \w: Matches any alphanumeric character (equivalent to [a-zA-Z0-9_]).
  • \W: Matches any non-alphanumeric character.
  • \s: Matches any whitespace character (spaces, tabs, newlines).
  • \S: Matches any non-whitespace character.

6. Groups and Capturing

You can use parentheses to create groups and capture parts of the match.

pattern = r"(\d{4})-(\d{2})-(\d{2})"  # Matches dates in the format YYYY-MM-DD
text = "Today's date is 2024-06-27"
match = re.search(pattern, text)
if match:
    year, month, day = match.groups()
    print(f"Year: {year}, Month: {month}, Day: {day}")  # Output: Year: 2024, Month: 06, Day: 27

7. Named Groups

Named groups allow you to assign a name to a capturing group for easier access.

pattern = r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"  # Matches dates in the format YYYY-MM-DD
text = "Today's date is 2024-06-27"
match = re.search(pattern, text)
if match:
    print(f"Year: {match.group('year')}, Month: {match.group('month')}, Day: {match.group('day')}")  # Output: Year: 2024, Month: 06, Day: 27

8. Lookahead and Lookbehind

Lookaheads and lookbehinds are assertions that allow you to match a pattern only if it is (or isn’t) followed or preceded by another pattern.

Positive Lookahead

pattern = r"\d+(?= dollars)"  # Matches digits followed by "dollars"
text = "The price is 100 dollars"
match = re.search(pattern, text)
if match:
    print(f"Match found: {match.group()}")  # Output: Match found: 100

Negative Lookahead

pattern = r"\d+(?! dollars)"  # Matches digits not followed by "dollars"
text = "The price is 100 dollars or 200 euros"
matches = re.findall(pattern, text)
print(f"Matches found: {matches}")  # Output: Matches found: ['200']

Positive Lookbehind

pattern = r"(?<=\$)\d+"  # Matches digits preceded by a dollar sign
text = "The price is $100"
match = re.search(pattern, text)
if match:
    print(f"Match found: {match.group()}")  # Output: Match found: 100

Negative Lookbehind

pattern = r"(?<!\$)\d+"  # Matches digits not preceded by a dollar sign
text = "The price is 100 dollars or $200"
matches = re.findall(pattern, text)
print(f"Matches found: {matches}")  # Output: Matches found: ['100']

9. Compiling Regular Expressions

For efficiency, especially when using the same regex multiple times, you can compile a regular expression.

pattern = re.compile(r"\d{4}-\d{2}-\d{2}")  # Compile the regex pattern
text = "The dates are 2024-06-27 and 2025-07-28"

# Use the compiled pattern
matches = pattern.findall(text)
print(f"Matches found: {matches}")  # Output: Matches found: ['2024-06-27', '2025-07-28']

10. Flags

You can use flags to modify the behavior of regex functions. Common flags include:

  • re.IGNORECASE (re.I): Ignore case.
  • re.MULTILINE (re.M): Multi-line matching, affects ^ and $.
  • re.DOTALL (re.S): Dot matches all characters, including newline.
pattern = r"^hello"
text = "Hello\nhello"

# Without flag
matches = re.findall(pattern, text)
print(f"Matches without flag: {matches}")  # Output: Matches without flag: ['hello']

# With IGNORECASE and MULTILINE flags
matches = re.findall(pattern, text, re.IGNORECASE | re.MULTILINE)
print(f"Matches with flags: {matches}")  # Output: Matches with flags: ['Hello', 'hello']

Example: Searching for Emails in a Database

Suppose you have a database with user data, and you want to extract all email addresses.

import re

# Sample data representing rows in a database
data = [
    "Alice, alice@example.com, 123-456-7890",
    "Bob, bob123@gmail.com, 987-654-3210",
    "Charlie, charlie123@company.org, 555-555-5555",
    "Invalid data, no email here, 000-000-0000"
]

# Regex pattern for matching email addresses
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

emails = []

for entry in data:
    found_emails = re.findall(email_pattern, entry)
    emails.extend(found_emails)

print("Extracted Emails:")
for email in emails:
    print(email)

Example: Searching for Code Snippets in a Code Repository

Suppose you have a repository with Python files, and you want to find all instances of function definitions.

import re
import os

# Directory containing Python files
code_directory = 'path_to_code_repository'

# Regex pattern for matching function definitions in Python
function_pattern = r'def\s+([a-zA-Z_][a-zA-Z0-9_]*)\s*\(.*\)\s*:'

functions = []

# Walk through the directory and process each Python file
for root, dirs, files in os.walk(code_directory):
    for file in files:
        if file.endswith('.py'):
            file_path = os.path.join(root, file)
            with open(file_path, 'r') as f:
                file_content = f.read()
                found_functions = re.findall(function_pattern, file_content)
                for func in found_functions:
                    functions.append((file, func))

print("Found Functions:")
for file, func in functions:
    print(f"Function '{func}' found in file '{file}'")

Discover more from AI HintsToday

Subscribe to get the latest posts sent to your email.

Discover more from AI HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading