Regular Expressions - User Guide
A regular
expression is the best way to handle the random cases. It is a special sequence of characters that helps you match or
find other strings or sets of strings, using a specialized syntax held in a
pattern. Regular expressions are widely used in UNIX world.
The
module re provides full support for Perl-like regular
expressions in Python. The re module raises the exception re.error if an error
occurs while compiling or using a regular expression.
We would
cover two important functions, which would be used to handle regular
expressions. But a small thing first: There are various characters, which would
have special meaning when they are used in regular expression. To avoid any
confusion while dealing with regular expressions, we would use Raw Strings
as r'expression'.
The match Function
The match Function
This
function attempts to match RE pattern to string with
optional flags.
Here is
the syntax for this function −
re.match(pattern, string, flags=0)
Here is
the description of the parameters:
Parameter
|
Description
|
pattern
|
This
is the regular expression to be matched.
|
string
|
This
is the string, which would be searched to match the pattern at the beginning
of string.
|
flags
|
You
can specify different flags using bitwise OR (|). These are modifiers, which
are listed in the table below.
|
The re.match function
returns a match object on success, None on
failure. We usegroup(num) or groups() function
of match object to get matched expression.
Match
Object Methods
|
Description
|
group(num=0)
|
This
method returns entire match (or specific subgroup num)
|
groups()
|
This
method returns all matching subgroups in a tuple (empty if there weren't any)
|
Example
#!/usr/bin/python
import re
line = "Cats are smarter than dogs"
matchObj = re.match( r'(.*) are (.*?) .*',
line, re.M|re.I)
if matchObj:
print "matchObj.group()
: ", matchObj.group()
print "matchObj.group(1)
: ", matchObj.group(1)
print "matchObj.group(2)
: ", matchObj.group(2)
else:
print "No match!!"
When the
above code is executed, it produces following result −
matchObj.group() : Cats
are smarter than dogs
matchObj.group(1) : Cats
matchObj.group(2) : smarter
The search Function
This
function searches for first occurrence of RE pattern within string with
optional flags.
Here is
the syntax for this function:
re.search(pattern, string, flags=0)
Here is
the description of the parameters:
Parameter
|
Description
|
pattern
|
This
is the regular expression to be matched.
|
string
|
This
is the string, which would be searched to match the pattern anywhere in the
string.
|
flags
|
You
can specify different flags using bitwise OR (|). These are modifiers, which
are listed in the table below.
|
The re.search function
returns a match object on success, none on
failure. We use group(num) or groups() function
of match object to get matched expression.
Match
Object Methods
|
Description
|
group(num=0)
|
This
method returns entire match (or specific subgroup num)
|
groups()
|
This
method returns all matching subgroups in a tuple (empty if there weren't any)
|
Example
#!/usr/bin/python
import re
line = "Cats are smarter than dogs";
searchObj = re.search( r'(.*) are (.*?) .*',
line, re.M|re.I)
if searchObj:
print "searchObj.group()
: ", searchObj.group()
print "searchObj.group(1)
: ", searchObj.group(1)
print "searchObj.group(2)
: ", searchObj.group(2)
else:
print "Nothing
found!!"
When the
above code is executed, it produces following result −
searchObj.group() : Cats
are smarter than dogs
searchObj.group(1) : Cats
searchObj.group(2) : smarter
Matching Versus Searching
Python
offers two different primitive operations based on regular expressions: match checks
for a match only at the beginning of the string, while searchchecks
for a match anywhere in the string (this is what Perl does by default).
Example
#!/usr/bin/python
import re
line = "Cats are smarter than dogs";
matchObj = re.match( r'dogs',
line, re.M|re.I)
if matchObj:
print "match -->
matchObj.group() : ",
matchObj.group()
else:
print "No match!!"
searchObj = re.search( r'dogs',
line, re.M|re.I)
if searchObj:
print "search -->
searchObj.group() : ",
searchObj.group()
else:
print "Nothing
found!!"
When the
above code is executed, it produces the following result −
No
match!!
search
--> matchObj.group() : dogs
Search and Replace
One of
the most important re methods that use regular expressions
is sub.
Syntax
re.sub(pattern, repl, string, max=0)
This
method replaces all occurrences of the RE pattern in string with repl,
substituting all occurrences unless max provided. This method
returns modified string.
Example
#!/usr/bin/python
import re
phone = "2004-959-559 # This is Phone Number"
# Delete Python-style
comments
num = re.sub(r'#.*$', "",
phone)
print "Phone Num : ",
num
# Remove anything
other than digits
num = re.sub(r'\D', "",
phone)
print "Phone Num : ",
num
When the
above code is executed, it produces the following result −
Phone Num : 2004-959-559
Phone Num : 2004959559
Regular Expression Modifiers: Option Flags
Regular
expression literals may include an optional modifier to control various aspects
of matching. The modifiers are specified as an optional flag. You can provide
multiple modifiers using exclusive OR (|), as shown previously and may be
represented by one of these −
Modifier
|
Description
|
re.I
|
Performs
case-insensitive matching.
|
re.L
|
Interprets
words according to the current locale. This interpretation affects the
alphabetic group (\w and \W), as well as word boundary behavior (\b and \B).
|
re.M
|
Makes
$ match the end of a line (not just the end of the string) and makes ^ match
the start of any line (not just the start of the string).
|
re.S
|
Makes
a period (dot) match any character, including a newline.
|
re.U
|
Interprets
letters according to the Unicode character set. This flag affects the
behavior of \w, \W, \b, \B.
|
re.X
|
Permits
"cuter" regular expression syntax. It ignores whitespace (except
inside a set [] or when escaped by a backslash) and treats unescaped # as a
comment marker.
|
Regular Expression Patterns
Except
for control characters, (+ ? . * ^ $ ( ) [ ] { } | \), all
characters match themselves. You can escape a control character by preceding it
with a backslash.
Following
table lists the regular expression syntax that is available in Python −
Pattern
|
Description
|
^
|
Matches
beginning of line.
|
$
|
Matches
end of line.
|
.
|
Matches
any single character except newline. Using m option allows it to match
newline as well.
|
[...]
|
Matches
any single character in brackets.
|
[^...]
|
Matches
any single character not in brackets
|
re*
|
Matches
0 or more occurrences of preceding expression.
|
re+
|
Matches
1 or more occurrence of preceding expression.
|
re?
|
Matches
0 or 1 occurrence of preceding expression.
|
re{
n}
|
Matches
exactly n number of occurrences of preceding expression.
|
re{
n,}
|
Matches
n or more occurrences of preceding expression.
|
re{
n, m}
|
Matches
at least n and at most m occurrences of preceding expression.
|
a| b
|
Matches
either a or b.
|
(re)
|
Groups
regular expressions and remembers matched text.
|
(?imx)
|
Temporarily
toggles on i, m, or x options within a regular expression. If in parentheses,
only that area is affected.
|
(?-imx)
|
Temporarily
toggles off i, m, or x options within a regular expression. If in
parentheses, only that area is affected.
|
(?:
re)
|
Groups
regular expressions without remembering matched text.
|
(?imx:
re)
|
Temporarily
toggles on i, m, or x options within parentheses.
|
(?-imx:
re)
|
Temporarily
toggles off i, m, or x options within parentheses.
|
(?#...)
|
Comment.
|
(?=
re)
|
Specifies
position using a pattern. Doesn't have a range.
|
(?!
re)
|
Specifies
position using pattern negation. Doesn't have a range.
|
(?>
re)
|
Matches
independent pattern without backtracking.
|
\w
|
Matches
word characters.
|
\W
|
Matches
nonword characters.
|
\s
|
Matches
whitespace. Equivalent to [\t\n\r\f].
|
\S
|
Matches
nonwhitespace.
|
\d
|
Matches
digits. Equivalent to [0-9].
|
\D
|
Matches
nondigits.
|
\A
|
Matches
beginning of string.
|
\Z
|
Matches
end of string. If a newline exists, it matches just before newline.
|
\z
|
Matches
end of string.
|
\G
|
Matches
point where last match finished.
|
\b
|
Matches
word boundaries when outside brackets. Matches backspace (0x08) when inside
brackets.
|
\B
|
Matches
nonword boundaries.
|
\n,
\t, etc.
|
Matches
newlines, carriage returns, tabs, etc.
|
\1...\9
|
Matches
nth grouped subexpression.
|
\10
|
Matches
nth grouped subexpression if it matched already. Otherwise refers to the
octal representation of a character code.
|
Regular Expression Examples
Literal
characters
Example
|
Description
|
python
|
Match
"python".
|
Character classes
Example
|
Description
|
[Pp]ython
|
Match
"Python" or "python"
|
rub[ye]
|
Match
"ruby" or "rube"
|
[aeiou]
|
Match
any one lowercase vowel
|
[0-9]
|
Match
any digit; same as [0123456789]
|
[a-z]
|
Match
any lowercase ASCII letter
|
[A-Z]
|
Match
any uppercase ASCII letter
|
[a-zA-Z0-9]
|
Match
any of the above
|
[^aeiou]
|
Match
anything other than a lowercase vowel
|
[^0-9]
|
Match
anything other than a digit
|
Special Character Classes
Example
|
Description
|
.
|
Match
any character except newline
|
\d
|
Match
a digit: [0-9]
|
\D
|
Match
a nondigit: [^0-9]
|
\s
|
Match
a whitespace character: [ \t\r\n\f]
|
\S
|
Match
nonwhitespace: [^ \t\r\n\f]
|
\w
|
Match
a single word character: [A-Za-z0-9_]
|
\W
|
Match
a nonword character: [^A-Za-z0-9_]
|
Repetition Cases
Example
|
Description
|
ruby?
|
Match
"rub" or "ruby": the y is optional
|
ruby*
|
Match
"rub" plus 0 or more ys
|
ruby+
|
Match
"rub" plus 1 or more ys
|
\d{3}
|
Match
exactly 3 digits
|
\d{3,}
|
Match
3 or more digits
|
\d{3,5}
|
Match
3, 4, or 5 digits
|
Nongreedy repetition
This
matches the smallest number of repetitions −
Example
|
Description
|
<.*>
|
Greedy
repetition: matches "<python>perl>"
|
<.*?>
|
Nongreedy:
matches "<python>" in "<python>perl>"
|
Grouping with Parentheses
Example
|
Description
|
\D\d+
|
No
group: + repeats \d
|
(\D\d)+
|
Grouped:
+ repeats \D\d pair
|
([Pp]ython(,
)?)+
|
Match
"Python", "Python, python, python", etc.
|
Backreferences
This
matches a previously matched group again −
Example
|
Description
|
([Pp])ython&\1ails
|
Match
python&pails or Python&Pails
|
(['"])[^\1]*\1
|
Single
or double-quoted string. \1 matches whatever the 1st group matched. \2
matches whatever the 2nd group matched, etc.
|
Alternatives
Example
|
Description
|
python|perl
|
Match
"python" or "perl"
|
rub(y|le))
|
Match
"ruby" or "ruble"
|
Python(!+|\?)
|
"Python"
followed by one or more ! or one ?
|
Anchors
This
needs to specify match position.
Example
|
Description
|
^Python
|
Match
"Python" at the start of a string or internal line
|
Python$
|
Match
"Python" at the end of a string or line
|
\APython
|
Match
"Python" at the start of a string
|
Python\Z
|
Match
"Python" at the end of a string
|
\bPython\b
|
Match
"Python" at a word boundary
|
\brub\B
|
\B is
nonword boundary: match "rub" in "rube" and
"ruby" but not alone
|
Python(?=!)
|
Match
"Python", if followed by an exclamation point.
|
Python(?!!)
|
Match
"Python", if not followed by an exclamation point.
|
Special Syntax with Parentheses
Example
|
Description
|
R(?#comment)
|
Matches
"R". All the rest is a comment
|
R(?i)uby
|
Case-insensitive
while matching "uby"
|
R(?i:uby)
|
Same
as above
|
rub(?:y|le))
|
Group
only without creating \1 backreference
|
Comments
Post a Comment