In Python a regular expression search is typically written as following syntax :
The re.search()
method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, search()
returns a match object or None otherwise. Therefore, the search is usually immediately followed by an if-statement to test if the search succeeded, as shown in the following example :
>>> str = 'an example word:ftl!!'
>>> match = re.search(r'word:\w\w\w', str)
>>> # If-statement after search() tests if it succeeded
>>> if match:
print ('found', match.group()) ## 'found word:ftl'
else:
print ('did not find')
found word:ftl
>>>
The compilation flags let you modify some aspects of how regular expressions work. Flags are available in the re module under two names, a long name such as IGNORECASE and a short, one-letter form such as I. (If you’re familiar with Perl’s pattern modifiers, the one-letter forms use the same letters; the short form of re.VERBOSE is re.X, for example.) Multiple flags can be specified by bitwise OR-ing them; re.I | re.M sets both the I and M flags, for example.
Flag | Meaning |
---|---|
DOTALL, S | Make . match any character, including newlines |
IGNORECASE, I | Do case-insensitive matches |
LOCALE, L | Do a locale-aware match |
MULTILINE, M | Multi-line matching, affecting ^ and $ |
VERBOSE, X | Enable verbose REs, which can be organized more cleanly and understandably. |
UNICODE, U | Makes several escapes like \w, \b, \s and \d dependent on the Unicode character database. |
Except for control characters, (+ ? . * ^ $ ( ) [ ] { } | \
), all characters match themselves. You can escape a control character by preceding it with a backslash.
Following table lists the regular expression syntax that is available in Python
Pattern | Description |
---|---|
^ | Matches beginning of line. |
$ | Matches end of line. |
. | Matches any single character except newline. Using m option allows it to match newline as well. |
[...] | Matches any single character in brackets. |
[^...] | Matches any single character not in brackets |
re* | Matches 0 or more occurrences of preceding expression. |
re+ | Matches 1 or more occurrence of preceding expression. |
re? | Matches 0 or 1 occurrence of preceding expression. |
re{ n} | Matches exactly n number of occurrences of preceding expression. |
re{ n,} | Matches n or more occurrences of preceding expression. |
re{ n, m} | Matches at least n and at most m occurrences of preceding expression. |
a| b | Matches either a or b. |
(re) | Groups regular expressions and remembers matched text. |
(?imx) | Temporarily toggles on i, m, or x options within a regular expression. If in parentheses, only that area is affected. |
(?-imx) | Temporarily toggles off i, m, or x options within a regular expression. If in parentheses, only that area is affected. |
(?: re) | Groups regular expressions without remembering matched text. |
(?imx: re) | Temporarily toggles on i, m, or x options within parentheses. |
(?-imx: re) | Temporarily toggles off i, m, or x options within parentheses. |
(?#...) | Comment. |
(?= re) | Specifies position using a pattern. Doesn't have a range. |
(?! re) | Specifies position using pattern negation. Doesn't have a range. |
(?> re) | Matches independent pattern without backtracking. |
\w | Matches word characters. |
\W | Matches nonword characters. |
\s | Matches whitespace. Equivalent to [\t\n\r\f]. |
\S | Matches nonwhitespace. |
\d | Matches digits. Equivalent to [0-9]. |
\D | Matches nondigits. |
\A | Matches beginning of string. |
\Z | Matches end of string. If a newline exists, it matches just before newline. |
\z | Matches end of string. |
\G | Matches point where last match finished. |
\b | Matches word boundaries when outside brackets. Matches backspace (0x08) when inside brackets. |
\B | Matches nonword boundaries. |
\n, \t, etc. | Matches newlines, carriage returns, tabs, etc. |
\1...\9 | Matches nth grouped subexpression. |
\10 | Matches nth grouped subexpression if it matched already. Otherwise refers to the octal representation of a character code. |
You are used + and * to specify repetition in the pattern
Groups are marked by the '(', ')' metacharacters. '(' and ')' have much the same meaning as they do in mathematical expressions; they group together the expressions contained inside them, and you can repeat the contents of a group with a repeating qualifier, such as *, +, ?, or {m,n}. For example, (ab)* will match zero or more repetitions of ab.
>>> p = re.compile('(ab)*')
>>> print (p.match('ababababab').span())
(0, 10)
>>>
Subgroups are numbered from left to right, from 1 upward. Groups can be nested; to determine the number, just count the opening parenthesis characters, going from left to right.
>>> p = re.compile('(a(b)c)d')
>>> m = p.match('abcd')
>>> m.group(0)
'abcd'
>>> m.group(1)
'abc'
>>> m.group(2)
'b'
>>>
Up to this point, we’ve simply performed searches against a static string. Regular expressions are also commonly used to modify strings in various ways, using the following pattern methods :
Method/Attribute | Purpose |
split() | Split the string into a list, splitting it wherever the RE matches |
sub() | Find all substrings where the RE matches, and replace them with a different string |
subn() | Does the same thing as sub(), but returns the new string and the number of replacements |
The split()
method of a pattern splits a string apart wherever the RE matches, returning a list of the pieces. It’s similar to the split()
method of strings but provides much more generality in the delimiters that you can split by; split()
only supports splitting by whitespace or by a fixed string. As you’d expect, there’s a module-level re.split()
function, too.
.split(string[, maxsplit=0])
Split string by the matches of the regular expression. If capturing parentheses are used in the RE, then their contents will also be returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits are performed.
>>> p = re.compile(r'\W+')
>>> p.split('This is a test, short and sweet, of split().')
['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', '']
>>> p.split('This is a test, short and sweet, of split().', 3)
['This', 'is', 'a', 'test, short and sweet, of split().']
>>>
One of the most important re methods that use regular expressions is sub.
Returns the string obtained by replacing the leftmost non-overlapping occurrences of the RE in string by the replacement replacement. If the pattern isn’t found, string is returned unchanged.
The optional argument count is the maximum number of pattern occurrences to be replaced; count must be a non-negative integer. The default value of 0 means to replace all occurrences.
>>> import re
>>> phone = "90100 23210 # This is FTL Mobile Number"
>>> # Delete Python-style comments
>>> num = re.sub(r'#.*$', "", phone)
>>> print ("Mobile Number : ", num)
Mobile Number : 90100 23210
>>> num = re.sub(r'\D', "", phone)
>>> print ("Mobile Number : ", num)
Mobile Number : 9010023210
>>>