Python
Python Regular Expressions
A regular expression in a programming language is a special text string used for describing a search pattern. It is extremely useful for extracting information from text such as code, files, log, spreadsheets or even documents. Regular expressions are widely used in UNIX world.

The module re provides full support for Perl-like regular expressions in Python. The re module raises the exception re.error if an error occurs while compiling or using a regular expression.

  • Regular Expression (re) module included with Python primarily used for string searching and manipulation
  • Also used frequently for web page "Scraping" (extract large amount of data from websites)

In Python a regular expression search is typically written as following syntax :

match = re.search(pat, str)

The re.search() method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, search() returns a match object or None otherwise. Therefore, the search is usually immediately followed by an if-statement to test if the search succeeded, as shown in the following example :

>>> str = 'an example word:ftl!!'
>>> match = re.search(r'word:\w\w\w', str)
>>> # If-statement after search() tests if it succeeded
>>> if match:
	print ('found', match.group()) ## 'found word:ftl'
else:
	print ('did not find')

	
found word:ftl
>>> 

Regular Expression Compilation Flags

The compilation flags let you modify some aspects of how regular expressions work. Flags are available in the re module under two names, a long name such as IGNORECASE and a short, one-letter form such as I. (If you’re familiar with Perl’s pattern modifiers, the one-letter forms use the same letters; the short form of re.VERBOSE is re.X, for example.) Multiple flags can be specified by bitwise OR-ing them; re.I | re.M sets both the I and M flags, for example.

Flag Meaning
DOTALL, S Make . match any character, including newlines
IGNORECASE, I Do case-insensitive matches
LOCALE, L Do a locale-aware match
MULTILINE, M Multi-line matching, affecting ^ and $
VERBOSE, X Enable verbose REs, which can be organized more cleanly and understandably.
UNICODE, U Makes several escapes like \w, \b, \s and \d dependent on the Unicode character database.

Regular Expression Patterns

Except for control characters, (+ ? . * ^ $ ( ) [ ] { } | \), all characters match themselves. You can escape a control character by preceding it with a backslash.

Following table lists the regular expression syntax that is available in Python

PatternDescription
^Matches beginning of line.
$Matches end of line.
.Matches any single character except newline. Using m option allows it to match newline as well.
[...]Matches any single character in brackets.
[^...]Matches any single character not in brackets
re*Matches 0 or more occurrences of preceding expression.
re+Matches 1 or more occurrence of preceding expression.
re?Matches 0 or 1 occurrence of preceding expression.
re{ n}Matches exactly n number of occurrences of preceding expression.
re{ n,}Matches n or more occurrences of preceding expression.
re{ n, m}Matches at least n and at most m occurrences of preceding expression.
a| bMatches either a or b.
(re)Groups regular expressions and remembers matched text.
(?imx)Temporarily toggles on i, m, or x options within a regular expression. If in parentheses, only that area is affected.
(?-imx)Temporarily toggles off i, m, or x options within a regular expression. If in parentheses, only that area is affected.
(?: re)Groups regular expressions without remembering matched text.
(?imx: re)Temporarily toggles on i, m, or x options within parentheses.
(?-imx: re)Temporarily toggles off i, m, or x options within parentheses.
(?#...)Comment.
(?= re)Specifies position using a pattern. Doesn't have a range.
(?! re)Specifies position using pattern negation. Doesn't have a range.
(?> re)Matches independent pattern without backtracking.
\wMatches word characters.
\WMatches nonword characters.
\sMatches whitespace. Equivalent to [\t\n\r\f].
\SMatches nonwhitespace.
\dMatches digits. Equivalent to [0-9].
\DMatches nondigits.
\AMatches beginning of string.
\ZMatches end of string. If a newline exists, it matches just before newline.
\zMatches end of string.
\GMatches point where last match finished.
\bMatches word boundaries when outside brackets. Matches backspace (0x08) when inside brackets.
\BMatches nonword boundaries.
\n, \t, etc.Matches newlines, carriage returns, tabs, etc.
\1...\9Matches nth grouped subexpression.
\10Matches nth grouped subexpression if it matched already. Otherwise refers to the octal representation of a character code.

Regular Expressions Repetition

You are used + and * to specify repetition in the pattern

  • + -- 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's
  • * -- 0 or more occurrences of the pattern to its left
  • ? -- Match 0 or 1 occurrences of the pattern to its left
  • \d{3} -- Match exactly 3 digits
  • \d{3,} -- Match 3 or more digits
  • \d{3,5} -- Match 3, 4, or 5 digits

Regular Expression Grouping

Groups are marked by the '(', ')' metacharacters. '(' and ')' have much the same meaning as they do in mathematical expressions; they group together the expressions contained inside them, and you can repeat the contents of a group with a repeating qualifier, such as *, +, ?, or {m,n}. For example, (ab)* will match zero or more repetitions of ab.

  • \D\d -- No group: + repeats \d
  • (\D\d)+ -- Grouped: + repeats \D\d pair
  • ([Pp]ython(, )?) -- Match "Python", "Python, python, python", etc.
>>> p = re.compile('(ab)*')
>>> print (p.match('ababababab').span())
(0, 10)
>>> 

Subgroups are numbered from left to right, from 1 upward. Groups can be nested; to determine the number, just count the opening parenthesis characters, going from left to right.

>>> p = re.compile('(a(b)c)d')
>>> m = p.match('abcd')
>>> m.group(0)
'abcd'
>>> m.group(1)
'abc'
>>> m.group(2)
'b'
>>> 

Modifying Strings

Up to this point, we’ve simply performed searches against a static string. Regular expressions are also commonly used to modify strings in various ways, using the following pattern methods :

Method/Attribute Purpose
split() Split the string into a list, splitting it wherever the RE matches
sub() Find all substrings where the RE matches, and replace them with a different string
subn() Does the same thing as sub(), but returns the new string and the number of replacements

Splitting Strings

The split() method of a pattern splits a string apart wherever the RE matches, returning a list of the pieces. It’s similar to the split() method of strings but provides much more generality in the delimiters that you can split by; split() only supports splitting by whitespace or by a fixed string. As you’d expect, there’s a module-level re.split() function, too.

.split(string[, maxsplit=0])

Split string by the matches of the regular expression. If capturing parentheses are used in the RE, then their contents will also be returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits are performed.

>>> p = re.compile(r'\W+')
>>> p.split('This is a test, short and sweet, of split().')
['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', '']
>>> p.split('This is a test, short and sweet, of split().', 3)
['This', 'is', 'a', 'test, short and sweet, of split().']
>>> 

Search and Replace

One of the most important re methods that use regular expressions is sub.

.sub(replacement, string[, count=0])

Returns the string obtained by replacing the leftmost non-overlapping occurrences of the RE in string by the replacement replacement. If the pattern isn’t found, string is returned unchanged.

The optional argument count is the maximum number of pattern occurrences to be replaced; count must be a non-negative integer. The default value of 0 means to replace all occurrences.

>>> import re
>>> phone = "90100 23210 # This is FTL Mobile Number"
>>> # Delete Python-style comments
>>> num = re.sub(r'#.*$', "", phone)
>>> print ("Mobile Number : ", num)
Mobile Number :  90100 23210 
>>> num = re.sub(r'\D', "", phone)
>>> print ("Mobile Number : ", num)
Mobile Number :  9010023210
>>>