15. Regular Expressions

Regular expressions or Regex is a pattern matching “sublanguage” that allows you to filter content from data or text that follows a specific pattern. For a quick introduction have a look at Al Sweigart’s Pycon 2017 presentation or his book “Automate the Boring Stuff with Python” [Sweigart2015]

15.1. First Steps

We first import the powerful Python regular expressions library

import re

When working with Regex remember that we usually use three commands from the Regex library:

  1. compile

  2. search

  3. group

The first command defines the regex-pattern that we want to use. The second applies the regex-pattern and searches a string for the pattern. The last command presents all the found information in a list for further processing.

In python this is implemented as follows:

import re
myRegex = re.compile('regex pattern')
mySearchObject = myRegex.search('Textstring that contains the pattern ...')
myList = mysearchObject.group()
print(myList)

15.2. Example 1: Matching Phone Numbers

Let’s try a first example and extract all the phone numbers from a text-string. The text string is as follows:

We need to call John (412-233-9876), James (312-323-7658) as well as Jimmy (450-123-1234) to make sure it gets all done.

You could try to run a loop over this string and use some of the string manipulation commands from the earlier chapters to try to extract the three phone numbers from this text. However, that would be somewhat cumbersome. Since the phone numbers follow a very specific pattern—i.e., 3 numbers followed by a dash followed by 3 numbers followed by a dash followed by 4 numbers—regular expressions are a perfect vehicle for content extraction.

We first need to define the pattern of what we are looking for in regular expression syntax. You can think of this as a separate sub-language within Python. Regular expressions can be used across different programming languages. So what you learn here about regular expressions in the Python context will be applicable in a very similar fashion in all other programming languages that support regular expressions like Java, C or Ruby etc.

Note

Regular expressions are not a fully specified programming languague as they miss some of the branching features of real programming languages like Python, C, or Java.

Our task is to match a phone number like 412-233-9876. The pattern of this is ddd-ddd-dddd which means a digit, followed by another digit, followed by a third digit followed by a dash followed by a digit etc. In Regex we have so called character classes for this where:

  1. \d Digit character (i.e., a number)

  2. \w Word characters (i.e., letters and numbers)

  3. \s Space characters (i.e., space, tab, \n)

as well as their logical negatives:

  1. \D Non-Digit

  2. \W Non-Word

  3. \S Non-Space

You can also create your own character classes by putting them inside brackets:

  • [aeiouAEIOU] matches vowels a or e or i etc.

By putting the caret ^ character in front of your class definition you can negate the meaning so that

  • [^aeiouAEIOU] matches all non-vowels b or c or d etc. So this would include all the consonants, numbers, and other symbols.

You will often see the group [0-9a-zA-z] which is the same as \w above, it matches either a digit or a lowercase letter or an upper case letter. The hyphen character - acts as a range indicator, so 0-9 means any of the digits from 0, 1, 2, … , 9. Similarly a-z means any of the lowercase characters a, b, … , z.

Punctuation symbols such as ., ,, *, (, ), ^, $, |, ?, \, {, }, [, ], + have meaning in regular expression “language”. If the pattern that you are looking for contains some of these characters, you need to “escape” them first in order to use them. If you want to match parenthesis ( and ) for instance you can define your regex pattern as

  • [\(\)] which matches open an parenthesis ( or a closed parenthesis )

The pattern definition for matching the phone number in the example above is therefore:

r'\d\d\d-\d\d\d-\d\d\d\d'

The r'some string' is the raw string notation in Python which indicates that all the characters within the string definition are taken “literally.” In Python some characters have special meaning, especially the \ backslash character. If we didn’t indicate the string as a raw-string using the r prefix we’d have to “escape” all the backslashes that we use inside the string to tell python that it needs to use them literally in the pattern definition. This would result in a very ugly pattern definition (without the r prefix) of:

'\\d\\d\\d-\\d\\d\\d-\\d\\d\\d\\d'

If certain character classes get repeated, we can use curly brackets to indicate how often a character gets repeated. For our phone number pattern this would look like:

r'\d{3}-\d{3}-\d{4}'

where the curly brackets indicate how many times the \d character is repeated in the pattern. Notice how we do not put the backslash escape character \ in front of the curly brackets. Curly brackets have meaning in regular expression language. Once you put the escape character in front of it, the regular expression interpreter thinks you are looking for a curly bracket in the text that you are analysing.

Now let’s combine everything. The following code block will extract the first instance of a phone number pattern match.

import re

# Assign your text info to a string variable
myTextString = """We need to call
    John (412-233-9876), James (312-323-7658)
    as well as Jimmy (450-123-1234) to make sure it gets all done."""

myRegex = re.compile(r'\d{3}-\d{3}-\d{4}')
mySearchObject = myRegex.search(myTextString)
myList = mySearchObject.group()
print(myList)
412-233-9876

If you want to return all the phone numbers you can use the findall method.

import re

# Assign your text info to a string variable
myTextString = """We need to call
    John (412-233-9876), James (312-323-7658)
    as well as Jimmy (450-123-1234) to make sure it gets all done."""


myRegex = re.compile(r'\d{3}-\d{3}-\d{4}')
myList = myRegex.findall(myTextString)
print(myList)
['412-233-9876', '312-323-7658', '450-123-1234']

Note

There are additional ways to specify how often to repeat a character. Here’s a brief summary for the quantity of character classes:

regex expression

Explanation

\d

one digit

\d?

zero or one digit

\d*

zero or more digits

\d+

one or more digits

\d{3}

exactly three digits

\d{3,5}

between 3 and 5 digits

\d{3,}

3 or more digits

Similarly you can do this for your own defined classes such as the vowel class above [aeiou], which results in:

regex expression

Explanation

[aeiou]

one vowel

[aeiou]?

zero or one vowel

[aeiou]*

zero or more vowels

[aeiou]+

one or more vowels

[aeiou]{3}

exactly three vowels

[aeiou]{3,5}

between 3 and 5 vowels

[aeiou]{3,}

3 or more vowels

15.3. Example 2: Matching Email Addresses

We next want to extract email addresses from a text string. Typically the pattern of an email address is: something@somethingelse.extension. We can try to come up with a pattern for this. A good way to “build” regular expressions like these is to use tools like Regexr which is a website-tool where you see how your regular expression matches certain patterns in real time. In addition, if you hover over the regular expression a pop-out appears that describes what the regular expression does.

import re

myTextString = """We need to call
    John (foo@demo.net), James (bar.ba@test.co.au)
    as well as Jimmy (jjing@towson.edu) and Charles
    (ch.ch.43-1_20@towson.students.edu) to make sure it gets all done."""

myRegex = re.compile(r'[\w_\-\.]+@[\w_\-\.]+\.[a-zA-Z]{2,5}')
myList = myRegex.findall(myTextString)
print(myList)
['foo@demo.net', 'bar.ba@test.co.au', 'jjing@towson.edu',
'ch.ch.43-1_20@towson.students.edu']

Or prettier

for i,email in enumerate(myList):
    print('{} Email: {}'.format(i+1, email))
1 Email: foo@demo.net
2 Email: bar.ba@test.co.au
3 Email: jjing@towson.edu
4 Email: ch.ch.43-1_20@towson.students.edu

15.4. Tutorials

Some of the notes above are summaries of these tutorials:

15.5. References

Sweigart2015

Sweigart, Al “Automate the Boring Stuff with Python,” No Starch Press, 2015.