Progress0%

10 of 20 topics completed

Python Regular Expressions

Regular expressions (regex or regexp) are powerful sequences of characters that define search patterns. They're extremely useful for finding, matching, and manipulating text. In this tutorial, you'll learn how to use Python's built-in regex module to perform pattern matching operations.

💡 Real-World Applications

Regular expressions are used in many real-world scenarios:

  • Form validation (email, phone numbers, passwords)
  • Data extraction and web scraping
  • Search and replace operations
  • Text parsing and processing
  • Log file analysis

Introduction to Regular Expressions

To use regular expressions in Python, we need to import the re module:

Python
1import re
2
3# Basic example: finding a pattern in a string
4text = "Contact us at info@example.com for more information."
5pattern = r"info@example.com"
6
7if re.search(pattern, text):
8 print("Email address found!")
9else:
10 print("Email address not found.")

⚠️ Note on Raw Strings

Notice the r prefix in r"info@example\.com". This creates a "raw string" that treats backslashes literally, which is important in regex patterns where backslashes have special meaning.

Basic Pattern Matching

The re module provides several functions for working with regular expressions:

Python
1import re
2
3text = "Python programming is fun and powerful!"
4
5# 1. search() - Find the first match
6result = re.search(r"fun", text)
7if result:
8 print(f"Match found at position: {result.start()}") # Output: Match found at position: 22
9
10# 2. findall() - Find all matches
11matches = re.findall(r"p[a-z]*", text, re.IGNORECASE)
12print(matches) # Output: ['Python', 'programming', 'powerful']
13
14# 3. match() - Check if string starts with the pattern
15if re.match(r"Python", text):
16 print("Text starts with 'Python'") # This will print
17
18# 4. split() - Split string by pattern
19words = re.split(r"s+", text)
20print(words) # Output: ['Python', 'programming', 'is', 'fun', 'and', 'powerful!']
21
22# 5. sub() - Replace pattern with another string
23new_text = re.sub(r"fun", "enjoyable", text)
24print(new_text) # Output: Python programming is enjoyable and powerful!

Regular Expression Patterns

Regular expression patterns use special characters to match different types of text:

CharacterDescriptionExample
.Matches any character except newlinea.c matches "abc", "axc", etc.
^Matches start of string^hello matches strings starting with "hello"
$Matches end of stringworld$ matches strings ending with "world"
*Matches 0 or more repetitionsab*c matches "ac", "abc", "abbc", etc.
+Matches 1 or more repetitionsab+c matches "abc", "abbc", but not "ac"
?Matches 0 or 1 repetitionab?c matches "ac" or "abc"
{n}Matches exactly n repetitionsa{3} matches "aaa"
{n,}Matches n or more repetitionsa{2,} matches "aa", "aaa", etc.
{n,m}Matches between n and m repetitionsa{1,3} matches "a", "aa", or "aaa"
[]Character set - matches any character in the brackets[abc] matches "a", "b", or "c"
[^]Negated character set - matches any character not in the brackets[^abc] matches any character except "a", "b", or "c"
\dMatches any digit (0-9)\d3 matches three digits like "123"
\wMatches any alphanumeric character (a-z, A-Z, 0-9, _)\w+ matches words like "Python3"
\sMatches any whitespace characterhello\sworld matches "hello world"
|Alternation (OR)cat|dog matches either "cat" or "dog"

Common Regex Examples

Email Validation

Python
import re
def is_valid_email(email):
# Simple email pattern
pattern = r'^[w.-]+@[w.-]+.w+$'
return bool(re.match(pattern, email))
# Test the function
emails = [
"user@example.com", # Valid
"john.doe@company.co", # Valid
"invalid@email", # Invalid - missing top-level domain
"@missing.com", # Invalid - missing username
"spaces not@allowed.com" # Invalid - contains space
]
for email in emails:
if is_valid_email(email):
print(f"{email} is a valid email address")
else:
print(f"{email} is NOT valid")

Phone Number Extraction

Python
import re
text = """Contact info:
Alice: (123) 456-7890
Bob: 555-123-4567
Charlie: 987.654.3210
"""
# Pattern for different phone formats
pattern = r'[(]?d{3}[)]?[-.s]?d{3}[-.s]?d{4}'
# Find all phone numbers
phone_numbers = re.findall(pattern, text)
print("Phone numbers found:")
for number in phone_numbers:
print(number)
# Output:
# (123) 456-7890
# 555-123-4567
# 987.654.3210

Groups and Capturing

You can use parentheses () to create capture groups in your patterns, which allow you to extract specific parts of the matched text:

Python
1import re
2
3# Extracting information from a structured string
4log_entry = "2023-05-15 14:32:15 - ERROR - File not found: data.csv"
5
6pattern = r"(d{4}-d{2}-d{2}) (d{2}:d{2}:d{2}) - (w+) - (.+)"
7match = re.search(pattern, log_entry)
8
9if match:
10 date = match.group(1)
11 time = match.group(2)
12 level = match.group(3)
13 message = match.group(4)
14
15 print(f"Date: {date}")
16 print(f"Time: {time}")
17 print(f"Log Level: {level}")
18 print(f"Message: {message}")
19
20# Output:
21# Date: 2023-05-15
22# Time: 14:32:15
23# Log Level: ERROR
24# Message: File not found: data.csv

Named Groups

For more readable code, you can use named groups with the (?P<name>pattern) syntax:

Python
1import re
2
3# Parsing a URL using named groups
4url = "https://www.example.com:8080/path/to/page.html?query=value#section"
5
6pattern = r"(?P<protocol>https?://)?(?P<host>[w.-]+)(:(?P<port>d+))?(?P<path>/[w/.-]*)?(?(?P<query>[w=&]+))?(?P<fragment>#[w-]+)?"
7match = re.search(pattern, url)
8
9if match:
10 # Access groups by name
11 protocol = match.group("protocol") or ""
12 host = match.group("host") or ""
13 port = match.group("port") or "default"
14 path = match.group("path") or "/"
15 query = match.group("query") or "none"
16 fragment = match.group("fragment") or "none"
17
18 print(f"Protocol: {protocol}")
19 print(f"Host: {host}")
20 print(f"Port: {port}")
21 print(f"Path: {path}")
22 print(f"Query: {query}")
23 print(f"Fragment: {fragment}")

Flags and Options

Python's re module provides several flags to modify the behavior of regular expressions:

Python
1import re
2
3text = """
4Python is a programming language.
5PYTHON is very popular.
6python is easy to learn.
7"""
8
9# Case-insensitive matching with re.IGNORECASE
10matches = re.findall(r"python", text, re.IGNORECASE)
11print(f"Found {len(matches)} occurrences of 'python'") # Found 3 occurrences of 'python'
12
13# Multi-line mode with re.MULTILINE
14# ^ and $ match the start/end of each line
15matches = re.findall(r"^python", text, re.MULTILINE | re.IGNORECASE)
16print(f"Found {len(matches)} lines starting with 'python'") # Found 1 lines starting with 'python'
17
18# Dot matches any character including newline with re.DOTALL
19pattern_with_dotall = re.compile(r"programming.*popular", re.DOTALL)
20match1 = pattern_with_dotall.search(text)
21print("With DOTALL:", "Match found" if match1 else "No match") # With DOTALL: Match found
22
23pattern_without_dotall = re.compile(r"programming.*popular")
24match2 = pattern_without_dotall.search(text)
25print("Without DOTALL:", "Match found" if match2 else "No match") # Without DOTALL: No match
FlagDescription
re.IGNORECASE or re.IPerform case-insensitive matching
re.MULTILINE or re.MMake ^ and $ match the beginning/end of each line
re.DOTALL or re.SMake . match any character including newline
re.VERBOSE or re.XAllow pattern to contain comments and whitespace

Using Verbose Mode for Complex Patterns

For complex patterns, you can use the re.VERBOSE flag to make your regex more readable:

Python
1import re
2
3# Complex pattern for validating a password
4# Rules:
5# - At least 8 characters
6# - Contains at least one uppercase letter
7# - Contains at least one lowercase letter
8# - Contains at least one digit
9# - Contains at least one special character
10
11password_pattern = re.compile(r"""
12 ^ # Start of string
13 (?=.*[A-Z]) # At least one uppercase letter
14 (?=.*[a-z]) # At least one lowercase letter
15 (?=.*d) # At least one digit
16 (?=.*[!@#$%^&*()]) # At least one special character
17 .{8,} # At least 8 characters long
18 $ # End of string
19""", re.VERBOSE)
20
21def is_valid_password(password):
22 return bool(password_pattern.match(password))
23
24# Test the function
25passwords = [
26 "Abc123!", # Too short
27 "password123", # No uppercase or special char
28 "PASSWORD123!", # No lowercase
29 "Password!", # No digit
30 "P@ssw0rd", # Valid
31 "Str0ng!Pass" # Valid
32]
33
34for password in passwords:
35 if is_valid_password(password):
36 print(f"'{password}' is a valid password")
37 else:
38 print(f"'{password}' is NOT valid")

Practical Example: Log Parser

Let's build a simple log parser that extracts information from log entries:

Python
1import re
2from datetime import datetime
3
4log_data = """
52023-01-15 08:22:03 INFO User login successful: alice@example.com
62023-01-15 08:23:15 WARNING Failed login attempt: bob@example.com (wrong password)
72023-01-15 08:25:42 ERROR Database connection failed: timeout after 30s
82023-01-15 09:05:22 INFO User logout: alice@example.com
92023-01-15 09:10:54 ERROR File not found: /data/reports/january.csv
10"""
11
12# Define the pattern with named groups
13log_pattern = re.compile(r"""
14 (d{4}-d{2}-d{2})s+ # Date (YYYY-MM-DD)
15 (d{2}:d{2}:d{2})s+ # Time (HH:MM:SS)
16 (w+)s+ # Log level (INFO, WARNING, ERROR)
17 (.+) # Message
18""", re.VERBOSE)
19
20# Parse the log entries
21entries = []
22for match in log_pattern.finditer(log_data):
23 date_str, time_str, level, message = match.groups()
24
25 # Convert to datetime object
26 timestamp = datetime.strptime(f"{date_str} {time_str}", "%Y-%m-%d %H:%M:%S")
27
28 entries.append({
29 'timestamp': timestamp,
30 'level': level,
31 'message': message
32 })
33
34# Filter for error entries
35error_entries = [entry for entry in entries if entry['level'] == 'ERROR']
36
37# Print the results
38print(f"Total log entries: {len(entries)}")
39print(f"Error entries: {len(error_entries)}")
40print("
41Error details:")
42for entry in error_entries:
43 print(f"{entry['timestamp']}: {entry['message']}")

🎯 Try it yourself!

Create a function that uses regular expressions to extract all URLs from a text document. The function should handle URLs starting with http://, https://, or www.

Python
def extract_urls(text):
# Your code here
pass
sample_text = """
Check out these websites:
https://www.python.org
http://example.com/page
Visit www.github.com for code repositories
Email me at user@example.com for more info.
"""
urls = extract_urls(sample_text)
print(urls) # Should print: ['https://www.python.org', 'http://example.com/page', 'www.github.com']

Best Practices for Regular Expressions

  1. Keep it simple - Use the simplest pattern that does the job
  2. Test thoroughly - Test your regex with various inputs, including edge cases
  3. Use raw strings - Always use raw strings (r"pattern") for regex patterns
  4. Use named groups - For complex patterns, named groups make code more readable
  5. Compile patterns - For patterns used multiple times, compile them first
  6. Use verbose mode - For complex patterns, use re.VERBOSE and comments
  7. Be careful with greedy matching - Use non-greedy quantifiers (*?, +?) when appropriate

Summary

In this tutorial, you've learned:

  • How to use Python's re module for regular expressions
  • Basic pattern matching and common regex metacharacters
  • How to use capture groups and named groups
  • Working with regex flags like re.IGNORECASE and re.VERBOSE
  • Practical examples like email validation, URL parsing, and log analysis

Regular expressions are incredibly powerful for text manipulation, but they can also be complex. Start with simple patterns and gradually build your understanding. With practice, you'll be able to craft efficient patterns for any text processing need.

Related Tutorials

Master string manipulation in Python.

Learn more

Learn how to work with files in Python.

Learn more

Learn how to handle errors and exceptions in Python.

Learn more