import re
16 Regular Expressions
Regular expressions or Regex is a pattern matching "sublanguage" that allows you to filter content from data or text that follows a specific pattern. For a quick introduction have a look at Al Sweigart's Pycon 2017 presentation or his book "Automate the Boring Stuff with Python" [Sweigart2015]
16.1 First Steps
We first import the powerful Python regular expressions library
When working with Regex remember that we usually use three commands from the Regex library:
compile
search
group
The first command defines the regex-pattern that we want to use. The second applies the regex-pattern and searches a string for the pattern. The last command presents all the found information in a list for further processing.
In python this is implemented as follows:
import re
= re.compile('regex pattern')
myRegex = myRegex.search('Textstring that contains the pattern ...')
mySearchObject = mysearchObject.group()
myList print(myList)
16.2 Example 1: Matching Phone Numbers
Let's try a first example and extract all the phone numbers from a text-string. The text string is as follows:
We need to call John (412-233-9876), James (312-323-7658) as well as Jimmy (450-123-1234) to make sure it gets all done.
You could try to run a loop over this string and use some of the string manipulation commands from the earlier chapters to try to extract the three phone numbers from this text. However, that would be somewhat cumbersome. Since the phone numbers follow a very specific pattern---i.e., 3 numbers followed by a dash followed by 3 numbers followed by a dash followed by 4 numbers---regular expressions are a perfect vehicle for content extraction.
We first need to define the pattern of what we are looking for in regular expression syntax. You can think of this as a separate sub-language within Python. Regular expressions can be used across different programming languages. So what you learn here about regular expressions in the Python context will be applicable in a very similar fashion in all other programming languages that support regular expressions like Java, C or Ruby etc.
Regular expressions are not a fully specified programming languague as they miss some of the branching features of real programming languages like Python, C, or Java.
Our task is to match a phone number like 412-233-9876. The pattern of this is ddd-ddd-dddd which means a digit, followed by another digit, followed by a third digit followed by a dash followed by a digit etc. In Regex we have so called character classes for this where:
\d
Digit character (i.e., a number)\w
Word characters (i.e., letters and numbers)\s
Space characters (i.e., space, tab, \n)
as well as their logical negatives:
\D
Non-Digit\W
Non-Word\S
Non-Space
You can also create your own character classes by putting them inside brackets:
-
[aeiouAEIOU]
matches vowelsa
ore
ori
etc.
By putting the caret ^
character in front of your class definition you can negate the meaning so that
-
[^aeiouAEIOU]
matches all non-vowelsb
orc
ord
etc. So this would include all the consonants, numbers, and other symbols.
You will often see the group [0-9a-zA-Z]
which is the same as \w
above, it matches either a digit or a lowercase letter or an upper case letter. The hyphen character -
acts as a range indicator, so 0-9
means any of the digits from 0, 1, 2, ... , 9. Similarly a-z
means any of the lowercase characters a, b, ... , z.
Punctuation symbols such as .
, ,
, *
, (
, )
, ^
, $
, |
, ?
, \
, {
, }
, [
, ]
, +
have meaning in regular expression "language". If the pattern that you are looking for contains some of these characters, you need to "escape" them first in order to use them. If you want to match parenthesis (
and )
for instance you can define your regex pattern as
-
[\(\)]
which matches open an parenthesis(
or a closed parenthesis)
The pattern definition for matching the phone number in the example above is therefore:
r'\d\d\d-\d\d\d-\d\d\d\d'
The r'some string'
is the raw string notation in Python which indicates that all the characters within the string definition are taken "literally." In Python some characters have special meaning, especially the \
backslash character. If we didn't indicate the string as a raw-string using the r
prefix we'd have to "escape" all the backslashes that we use inside the string to tell python that it needs to use them literally in the pattern definition. This would result in a very ugly pattern definition (without the r
prefix) of:
'\\d\\d\\d-\\d\\d\\d-\\d\\d\\d\\d'
If certain character classes get repeated, we can use curly brackets to indicate how often a character gets repeated. For our phone number pattern this would look like:
r'\d{3}-\d{3}-\d{4}'
where the curly brackets indicate how many times the \d
character is repeated in the pattern. Notice how we do not put the backslash escape character \
in front of the curly brackets. Curly brackets have meaning in regular expression language. Once you put the escape character in front of it, the regular expression interpreter thinks you are looking for a curly bracket in the text that you are analysing.
Now let's combine everything. The following code block will extract the first instance of a phone number pattern match.
import re
# Assign your text info to a string variable
= """We need to call
myTextString John (412-233-9876), James (312-323-7658)
as well as Jimmy (450-123-1234) to make sure it gets all done."""
= re.compile(r'\d{3}-\d{3}-\d{4}')
myRegex = myRegex.search(myTextString)
mySearchObject = mySearchObject.group()
myList print(myList)
412-233-9876
# Load the stringr library
library(stringr)
# Assign your text info to a string variable
myTextString <- "We need to call John (412-233-9876), James (312-323-7658) as well as Jimmy (450-123-1234) to make sure it gets all done."
# Define the regular expression pattern
myRegexPattern <- "\\d{3}-\\d{3}-\\d{4}"
# Use str_extract to find and extract the first match
myList <- str_extract(myTextString, myRegexPattern)
# Print the result
cat(myList, "\n")
412-233-9876
If you want to return all the phone numbers you can use the findall
method.
import re
# Assign your text info to a string variable
= """We need to call
myTextString John (412-233-9876), James (312-323-7658)
as well as Jimmy (450-123-1234) to make sure it gets all done."""
= re.compile(r'\d{3}-\d{3}-\d{4}')
myRegex = myRegex.findall(myTextString)
myList print(myList)
['412-233-9876', '312-323-7658', '450-123-1234']
# Load the stringr library
library(stringr)
# Assign your text info to a string variable
myTextString <- "We need to call John (412-233-9876), James (312-323-7658) as well as Jimmy (450-123-1234) to make sure it gets all done."
# Define the regular expression pattern
myRegexPattern <- "\\d{3}-\\d{3}-\\d{4}"
# Use str_extract_all to find and extract all matches
myList <- str_extract_all(myTextString, myRegexPattern)[[1]]
# Print the result
cat(myList, "\n")
412-233-9876 312-323-7658 450-123-1234
There are additional ways to specify how often to repeat a character. Here's a brief summary for the quantity of character classes:
regex expression | Explanation |
---|---|
\d |
one digit |
\d? |
zero or one digit |
\d* |
zero or more digits |
\d+ |
one or more digits |
\d{3} |
exactly three digits |
\d{3,5} |
between 3 and 5 digits |
\d{3,} |
3 or more digits |
Similarly you can do this for your own defined classes such as the vowel class above [aeiou]
, which results in:
regex expression Explanation [aeiou]
one vowel [aeiou]?
zero or one vowel [aeiou]*
zero or more vowels [aeiou]+
one or more vowels [aeiou]{3}
exactly three vowels [aeiou]{3,5}
between 3 and 5 vowels [aeiou]{3,}
3 or more vowels
16.3 Example 2: Matching Email Addresses
We next want to extract email addresses from a text string. Typically the pattern of an email address is: something@somethingelse.extension
. We can try to come up with a pattern for this. A good way to "build" regular expressions like these is to use tools like Regexr which is a website-tool where you see how your regular expression matches certain patterns in real time. In addition, if you hover over the regular expression a pop-out appears that describes what the regular expression does.
import re
= """We need to call
myTextString John (foo@demo.net), James (bar.ba@test.co.au)
as well as Jimmy (jjing@towson.edu) and Charles
(ch.ch.43-1_20@towson.students.edu) to make sure it gets all done."""
= re.compile(r'[\w_\-\.]+@[\w_\-\.]+\.[a-zA-Z]{2,5}')
myRegex = myRegex.findall(myTextString)
myList print(myList)
['foo@demo.net', 'bar.ba@test.co.au', 'jjing@towson.edu', 'ch.ch.43-1_20@towson.students.edu']
This R code uses the stringr library and the str_extract_all function to extract email addresses based on the specified regular expression pattern. It then prints the extracted email addresses in a readable format.
# Load the stringr library
library(stringr)
# Assign your text info to a string variable
myTextString <- "We need to call John (foo@demo.net), James (bar.ba@test.co.au) as well as Jimmy (jjing@towson.edu) and Charles (ch.ch.43-1_20@towson.students.edu) to make sure it gets all done."
# Define the regular expression pattern for email addresses
myRegexPattern <- "[\\w_\\-\\.]+@[\\w_\\-\\.]+\\.[a-zA-Z]{2,5}"
# Use str_extract_all to find and extract all matches
myList <- str_extract_all(myTextString, myRegexPattern)[[1]]
Or prettier
for i,email in enumerate(myList):
print('{} Email: {}'.format(i+1, email))
1 Email: foo@demo.net
2 Email: bar.ba@test.co.au
3 Email: jjing@towson.edu
4 Email: ch.ch.43-1_20@towson.students.edu
16.4 Tutorials
Some of the notes above are summaries of these tutorials:
16.5 References
- Sweigart2015
-
Sweigart, Al "Automate the Boring Stuff with Python," No Starch Press, 2015.
16.6 Key Concepts and Summary
- Regular expressions
- Pattern matching
16.7 Self-Check Questions
- Write up a regular expression that can match a US phone number
- Write up a regular expression that can match a US social security number