Python Regular Expression (RegEx) in 6 minutes

Regular Expression (RegEx) is a series of characters or a pattern of characters used to search or extract texts from a string. This is widely used in UNIX, Linux operating systems and Programming language like Perl. Python has full support for regular expressions through module 're'

re module have different functions to find or search a string or pattern within a given text. It can also do a search and replace function. I it's simplest form, it can do search for a specific texts (series of characters) from a string

To do a search or match, minimum required is the text or pattern to be searched for and the string where search or match operation will be performed. Result would always be an Object

Basic search

>>> import re
>>> myText="What a wonderful day today!"
>>> resultObject = re.search("day", myText)
>>> print(resultObject.group())  
day
>>>

In the example above, we are try to search "day" within myText and the result is stored in match object resultObject. Any matched pattern can be accessed by group() method of match object. Please note that it has matched the first occurrence of "day". It did not match "today"

re.search()

Let us extend our knowledge of basic search() method of re. Following is the syntax for search()

re.search(pattern, string, flags=0)

To do a search, you need to provide at least two things. The text or pattern to search for and where to search, that is string. flags are optional.

pattern could be plain simple text or could be specialized series of characters that is being searched form a string. Pattern is important if you are looking beyond simple exact text search. pattern in somewhat details are explained here.

Search tries to match given search test or pattern from the beginning of the string and stops at the first time, it finds a match.

Search stops at first match

>>> import re
>>> myText="What a wonderful day today!" 
>>> resultObject = re.search("day|wonder", myText)
>>> print(resultObject.group())
wonder
>>> print (resultObject)
<re.Match object; span=(7, 13), match='wonder'>

If it does not find a match, it does not return anything. So, in our example, resultObject would have been None . In the above example, when we print the match object, it shows the details of the match and also the position where it find the match. In the following example, search did not yield any result and that is why the Match Object is None. So, obviously trying to print resultObject.group() throws error. These two scenarios must be handled in program while working with re.search()

Search returns None if there is no match

>>> import re
>>> myText="What a wonderful day!"
>>> resultObject = re.search("tomorrow", myText) 
>>> print (resultObject) 
None
>>> print(resultObject.group())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'

pattern

The power of regular expressions lies with pattern. Pattern could be plain simple characters or could be series special of characters or combination of simple and special characters that is being searched form. Simple characters are matched to exact same including the case, while searching as you have seen in the examples above. Special characters either matches to a class of characters or control how the regular expression to be interpreted.

Example:

Match to character class: Dot (.) matches to any single character other than new line by default and "\d" matches any digit 0-9

Control how to be interpreted: Character "|" acts as an OR operator. In our previous example "day|wonder" matches either "day" or "wonder"

If you want to search for special characters you have to escape those with "\". Special Characters are + ? . * ^ $ ( ) [ ] { } | \ . So, when you want to search for * in a string, the regular expression should be \*

Following are meaning of some Special Characters (not exhaustive) (Please note that some of the behavior can be changed with flags):

. (Dot.) - In the default mode, this matches any single character except a newline.

^ - Matches the start of the string

$ - Matches the end of the string.

* - matches 0 or more repetitions of the preceding character or group,. xyz* will match ‘xy’, ‘xyz’, 'xyzzzz'.

+ - matches 1 or more repetitions of the preceding character or group. xyz+ will match ‘xyz’, 'xyzzzz'.

? - matches 0 or 1 repetitions of the preceding character or group. xyz? will match ‘xy’, ‘xyz’,

{n} - matches n number of times

{n,m} - matches between n and m times

a|b - matches a or b

[] - matches any of character in the class. [abc] will match a or b or c. [a-z] will match any single small alphabet

() - creates a group while matching. anything matches to the RegEx within () can be retrieved

There are also some sequences of characters which have special meaning for regular expressions

\d - Matches any Unicode decimal digit including [0-9]

\D - This is the opposite of \d. Matches any character which is not a decimal digit.

\s - Matches Unicode whitespace characters

\S - This is the opposite of \s. Matches any character which is not a whitespace character.

\w - Matches Unicode word characters

\W - This is the opposite of \w. Matches any character which is not a word character.

flags

Flags influence how the regular expressions will match. multiple flags can be passed with or '|' notation. example of few flags:

DOTALL, S - . (dot) matches newline

IGNORECASE, I - Do case-insensitive matches.

Syntax of passing the flags are re.S or multiple flags re.S|re.I

flags example

>>> result = re.search('(wonder).*(DAY)', 'a wonderful day',re.I)   
>>> result.groups()
('wonder', 'day')
>>>

Reading search results

You might be interested to know if there is a match in a string. We know that if there is a match, then it returns a match object. Otherwise, it will return None. Checking for None gives you understanding of match or not.

checking for a match

>>> import re
>>> result = re.search('day', 'a wonderful day')
>>> print(result)
<re.Match object; span=(12, 15), match='day'>
>>> if result:
...   print("that's a match")
... 
that's a match
>>>

Sometimes, you might be looking for beyond True or False and interested in what actually matched. Regular expression can be grouped with parenthesis '(' and ')'. If matched, these sections can be accessed from Match object. Match Object have group() and groups() methods. groups() is a tuple of all matched sub groups. individual subgroups can be accessed through group(). group(0) contains a string of entire match. Individual subgroups are numbered from 1 upwards.

Following examples shows all the options

match group and sub group

>>> result = re.search('(wonder).*(day)', 'a wonderful day') 
>>> print(result)
<re.Match object; span=(2, 15), match='wonderful day'>
>>> result.groups()
('wonder', 'day')
>>> result.group(0)  
'wonderful day'
>>> result.group(1) 
'wonder'
>>> result.group(2) 
'day'
>>>

Please refer to the following example - line 2 through 12. Here, we are checking for a pattern in the string 'What a wonderful day' and the pattern we are looking for is 'wonder\w.*day'. The pattern '\w' matches any word character. So, the pattern 'wonder\w.*', means a word wonder followed by 0 or more alphabet or numeric character. So, it got a match. group(0) shows the value as it is the default group when there is a match and contains entire match string. But groups() is an empty tuple as we have not used grouping while search using '(' ')'. and obviously group(1) does not exist

Now in the lines 15 through 27 we have searched for the same string but with grouping '(wonder\w+)'. So, the same match. But added advantage we have is that we can access the full word which got matched as group(1). Any part that was not included in () in RegEx can be accessed in group(0) but group (1) would be the part that is within the grouping , e.g. '(wonder\w+)'. if there are multiple grouping created in the regEx, those would be accessible through subsequent group like group(2) and so on.

More group examples

>>> import re
>>> >>> t = re.search('wonder\w+.*day', 'what a wonderful day')   
>>> t
<re.Match object; span=(7, 20), match='wonderful day'>
>>>> t.groups()
()
>>> t.group(0)
'wonderful day'
>>> t.group(1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: no such group


>>> t = re.search('(wonder\w+).*day', 'what a wonderful day') 
>>> t
<re.Match object; span=(7, 20), match='wonderful day'>
>>> t.groups()
('wonderful',)
>>> t.group(0)
'wonderful day'
>>> t.group(1)
'wonderful'
>>> t.group(2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: no such group

This was quick overview on regular expression with Python. If you are interested in more detail or advanced knowledge on Python regular expressions, I suggest you refer to the following documentations Regular Expression HOWTO and Regular expression operations