10 of 20 topics completed
Python Regular Expressions
Regular expressions (regex or regexp) are powerful sequences of characters that define search patterns. They're extremely useful for finding, matching, and manipulating text. In this tutorial, you'll learn how to use Python's built-in regex module to perform pattern matching operations.
💡 Real-World Applications
Regular expressions are used in many real-world scenarios:
- Form validation (email, phone numbers, passwords)
- Data extraction and web scraping
- Search and replace operations
- Text parsing and processing
- Log file analysis
Introduction to Regular Expressions
To use regular expressions in Python, we need to import the re
module:
1import re23# Basic example: finding a pattern in a string4text = "Contact us at info@example.com for more information."5pattern = r"info@example.com"67if re.search(pattern, text):8 print("Email address found!")9else:10 print("Email address not found.")
⚠️ Note on Raw Strings
Notice the r
prefix in r"info@example\.com"
. This creates a "raw string" that treats backslashes literally, which is important in regex patterns where backslashes have special meaning.
Basic Pattern Matching
The re
module provides several functions for working with regular expressions:
1import re23text = "Python programming is fun and powerful!"45# 1. search() - Find the first match6result = re.search(r"fun", text)7if result:8 print(f"Match found at position: {result.start()}") # Output: Match found at position: 22910# 2. findall() - Find all matches11matches = re.findall(r"p[a-z]*", text, re.IGNORECASE)12print(matches) # Output: ['Python', 'programming', 'powerful']1314# 3. match() - Check if string starts with the pattern15if re.match(r"Python", text):16 print("Text starts with 'Python'") # This will print1718# 4. split() - Split string by pattern19words = re.split(r"s+", text)20print(words) # Output: ['Python', 'programming', 'is', 'fun', 'and', 'powerful!']2122# 5. sub() - Replace pattern with another string23new_text = re.sub(r"fun", "enjoyable", text)24print(new_text) # Output: Python programming is enjoyable and powerful!
Regular Expression Patterns
Regular expression patterns use special characters to match different types of text:
Character | Description | Example |
---|---|---|
. | Matches any character except newline | a.c matches "abc", "axc", etc. |
^ | Matches start of string | ^hello matches strings starting with "hello" |
$ | Matches end of string | world$ matches strings ending with "world" |
* | Matches 0 or more repetitions | ab*c matches "ac", "abc", "abbc", etc. |
+ | Matches 1 or more repetitions | ab+c matches "abc", "abbc", but not "ac" |
? | Matches 0 or 1 repetition | ab?c matches "ac" or "abc" |
{n} | Matches exactly n repetitions | a{3} matches "aaa" |
{n,} | Matches n or more repetitions | a{2,} matches "aa", "aaa", etc. |
{n,m} | Matches between n and m repetitions | a{1,3} matches "a", "aa", or "aaa" |
[] | Character set - matches any character in the brackets | [abc] matches "a", "b", or "c" |
[^] | Negated character set - matches any character not in the brackets | [^abc] matches any character except "a", "b", or "c" |
\d | Matches any digit (0-9) | \d3 matches three digits like "123" |
\w | Matches any alphanumeric character (a-z, A-Z, 0-9, _) | \w+ matches words like "Python3" |
\s | Matches any whitespace character | hello\sworld matches "hello world" |
| | Alternation (OR) | cat|dog matches either "cat" or "dog" |
Common Regex Examples
Email Validation
import redef is_valid_email(email):# Simple email patternpattern = r'^[w.-]+@[w.-]+.w+$'return bool(re.match(pattern, email))# Test the functionemails = ["user@example.com", # Valid"john.doe@company.co", # Valid"invalid@email", # Invalid - missing top-level domain"@missing.com", # Invalid - missing username"spaces not@allowed.com" # Invalid - contains space]for email in emails:if is_valid_email(email):print(f"{email} is a valid email address")else:print(f"{email} is NOT valid")
Phone Number Extraction
import retext = """Contact info:Alice: (123) 456-7890Bob: 555-123-4567Charlie: 987.654.3210"""# Pattern for different phone formatspattern = r'[(]?d{3}[)]?[-.s]?d{3}[-.s]?d{4}'# Find all phone numbersphone_numbers = re.findall(pattern, text)print("Phone numbers found:")for number in phone_numbers:print(number)# Output:# (123) 456-7890# 555-123-4567# 987.654.3210
Groups and Capturing
You can use parentheses ()
to create capture groups in your patterns, which allow you to extract specific parts of the matched text:
1import re23# Extracting information from a structured string4log_entry = "2023-05-15 14:32:15 - ERROR - File not found: data.csv"56pattern = r"(d{4}-d{2}-d{2}) (d{2}:d{2}:d{2}) - (w+) - (.+)"7match = re.search(pattern, log_entry)89if match:10 date = match.group(1)11 time = match.group(2)12 level = match.group(3)13 message = match.group(4)1415 print(f"Date: {date}")16 print(f"Time: {time}")17 print(f"Log Level: {level}")18 print(f"Message: {message}")1920# Output:21# Date: 2023-05-1522# Time: 14:32:1523# Log Level: ERROR24# Message: File not found: data.csv
Named Groups
For more readable code, you can use named groups with the (?P<name>pattern)
syntax:
1import re23# Parsing a URL using named groups4url = "https://www.example.com:8080/path/to/page.html?query=value#section"56pattern = r"(?P<protocol>https?://)?(?P<host>[w.-]+)(:(?P<port>d+))?(?P<path>/[w/.-]*)?(?(?P<query>[w=&]+))?(?P<fragment>#[w-]+)?"7match = re.search(pattern, url)89if match:10 # Access groups by name11 protocol = match.group("protocol") or ""12 host = match.group("host") or ""13 port = match.group("port") or "default"14 path = match.group("path") or "/"15 query = match.group("query") or "none"16 fragment = match.group("fragment") or "none"1718 print(f"Protocol: {protocol}")19 print(f"Host: {host}")20 print(f"Port: {port}")21 print(f"Path: {path}")22 print(f"Query: {query}")23 print(f"Fragment: {fragment}")
Flags and Options
Python's re
module provides several flags to modify the behavior of regular expressions:
1import re23text = """4Python is a programming language.5PYTHON is very popular.6python is easy to learn.7"""89# Case-insensitive matching with re.IGNORECASE10matches = re.findall(r"python", text, re.IGNORECASE)11print(f"Found {len(matches)} occurrences of 'python'") # Found 3 occurrences of 'python'1213# Multi-line mode with re.MULTILINE14# ^ and $ match the start/end of each line15matches = re.findall(r"^python", text, re.MULTILINE | re.IGNORECASE)16print(f"Found {len(matches)} lines starting with 'python'") # Found 1 lines starting with 'python'1718# Dot matches any character including newline with re.DOTALL19pattern_with_dotall = re.compile(r"programming.*popular", re.DOTALL)20match1 = pattern_with_dotall.search(text)21print("With DOTALL:", "Match found" if match1 else "No match") # With DOTALL: Match found2223pattern_without_dotall = re.compile(r"programming.*popular")24match2 = pattern_without_dotall.search(text)25print("Without DOTALL:", "Match found" if match2 else "No match") # Without DOTALL: No match
Flag | Description |
---|---|
re.IGNORECASE or re.I | Perform case-insensitive matching |
re.MULTILINE or re.M | Make ^ and $ match the beginning/end of each line |
re.DOTALL or re.S | Make . match any character including newline |
re.VERBOSE or re.X | Allow pattern to contain comments and whitespace |
Using Verbose Mode for Complex Patterns
For complex patterns, you can use the re.VERBOSE
flag to make your regex more readable:
1import re23# Complex pattern for validating a password4# Rules:5# - At least 8 characters6# - Contains at least one uppercase letter7# - Contains at least one lowercase letter8# - Contains at least one digit9# - Contains at least one special character1011password_pattern = re.compile(r"""12 ^ # Start of string13 (?=.*[A-Z]) # At least one uppercase letter14 (?=.*[a-z]) # At least one lowercase letter15 (?=.*d) # At least one digit16 (?=.*[!@#$%^&*()]) # At least one special character17 .{8,} # At least 8 characters long18 $ # End of string19""", re.VERBOSE)2021def is_valid_password(password):22 return bool(password_pattern.match(password))2324# Test the function25passwords = [26 "Abc123!", # Too short27 "password123", # No uppercase or special char28 "PASSWORD123!", # No lowercase29 "Password!", # No digit30 "P@ssw0rd", # Valid31 "Str0ng!Pass" # Valid32]3334for password in passwords:35 if is_valid_password(password):36 print(f"'{password}' is a valid password")37 else:38 print(f"'{password}' is NOT valid")
Practical Example: Log Parser
Let's build a simple log parser that extracts information from log entries:
1import re2from datetime import datetime34log_data = """52023-01-15 08:22:03 INFO User login successful: alice@example.com62023-01-15 08:23:15 WARNING Failed login attempt: bob@example.com (wrong password)72023-01-15 08:25:42 ERROR Database connection failed: timeout after 30s82023-01-15 09:05:22 INFO User logout: alice@example.com92023-01-15 09:10:54 ERROR File not found: /data/reports/january.csv10"""1112# Define the pattern with named groups13log_pattern = re.compile(r"""14 (d{4}-d{2}-d{2})s+ # Date (YYYY-MM-DD)15 (d{2}:d{2}:d{2})s+ # Time (HH:MM:SS)16 (w+)s+ # Log level (INFO, WARNING, ERROR)17 (.+) # Message18""", re.VERBOSE)1920# Parse the log entries21entries = []22for match in log_pattern.finditer(log_data):23 date_str, time_str, level, message = match.groups()2425 # Convert to datetime object26 timestamp = datetime.strptime(f"{date_str} {time_str}", "%Y-%m-%d %H:%M:%S")2728 entries.append({29 'timestamp': timestamp,30 'level': level,31 'message': message32 })3334# Filter for error entries35error_entries = [entry for entry in entries if entry['level'] == 'ERROR']3637# Print the results38print(f"Total log entries: {len(entries)}")39print(f"Error entries: {len(error_entries)}")40print("41Error details:")42for entry in error_entries:43 print(f"{entry['timestamp']}: {entry['message']}")
🎯 Try it yourself!
Create a function that uses regular expressions to extract all URLs from a text document. The function should handle URLs starting with http://, https://, or www.
def extract_urls(text):# Your code herepasssample_text = """Check out these websites:https://www.python.orghttp://example.com/pageVisit www.github.com for code repositoriesEmail me at user@example.com for more info."""urls = extract_urls(sample_text)print(urls) # Should print: ['https://www.python.org', 'http://example.com/page', 'www.github.com']
Best Practices for Regular Expressions
- Keep it simple - Use the simplest pattern that does the job
- Test thoroughly - Test your regex with various inputs, including edge cases
- Use raw strings - Always use raw strings (
r"pattern"
) for regex patterns - Use named groups - For complex patterns, named groups make code more readable
- Compile patterns - For patterns used multiple times, compile them first
- Use verbose mode - For complex patterns, use
re.VERBOSE
and comments - Be careful with greedy matching - Use non-greedy quantifiers (
*?
,+?
) when appropriate
Summary
In this tutorial, you've learned:
- How to use Python's
re
module for regular expressions - Basic pattern matching and common regex metacharacters
- How to use capture groups and named groups
- Working with regex flags like
re.IGNORECASE
andre.VERBOSE
- Practical examples like email validation, URL parsing, and log analysis
Regular expressions are incredibly powerful for text manipulation, but they can also be complex. Start with simple patterns and gradually build your understanding. With practice, you'll be able to craft efficient patterns for any text processing need.
Related Tutorials
Master string manipulation in Python.
Learn moreLearn how to work with files in Python.
Learn moreLearn how to handle errors and exceptions in Python.
Learn more