Using the Python Regular Expressions Module – import re

8 years ago

Imagine, if you will, a string, "I'm a fighter, I'm a poet, I'm a preacher". And imagine that you wanted to check if a word, "poet", was in that string.

This is not terribly difficult to do with Python's built-in string methods.

TheString = "I'm a fighter, I'm a poet, I'm a preacher"
TheWord = "poet"
TheLoc = TheString.find(TheWord)
if TheLoc != -1: #make sure TheWord was found in TheString
    print(TheString[TheLoc:TheLoc+4], 'begins at', TheLoc)

will print poet begins at 21, the location of the beginning of the word poet inside the string. There are other ways to have accomplished similar functionality, like checking TheWord in TheString if you didn’t need to know the position in the string. There are probably even cleverer ways to do it with combinations of string methods.

Now imagine you wanted to get the 6th word in the string.

print(TheString.split()[5]) prints poet,. split() separates the string into words using spaces as the delimiter and [5] selects the 6th word in the resulting list. You could slice the result to take that comma off by using TheString.split()[5][0:-1].

So far the examples have been doable with string methods. Let’s try a more complicated example. Find all the numbers in the string "genius is 1 percent inspiration and 99 percent perspiration". This would be tricky with the built-in string methods, as you’d have to search for a digit in the string, and then check the characters after the starting digit to see if those characters were digits until you found the end of the number. You could code this with a couple of loops and keeping tracking of starting and ending positions. What if I told you there was an easier way?

Text searching with regular expressions

A regular expression is a search pattern that is applied to a string to find sequences within the string. A simple, boring and arguably useless regular expression is "and". That exact sequence of characters would be found in the example string "genius is 1 percent inspiration and 99 percent perspiration" because the character sequence "and" appears in it. So far that’s the same as what you could do with string methods – there’s no pattern in "and", just fixed characters. A more useful regular expression is ".", which is a pattern that matches any single character (except for a newline, although there is a setting to make it match newlines too). The "." will match digits too. Look at the regular expression ".s". That would match "us", "is", "ns", "rs" and even "3s". Brackets, [], match a set of characters. [abc123] will match an a, b, c, 1, 2 or 3. Brackets can also be used to match a range, like [a-c1-3p-r], which matches 17 times in the example "genius is 1 percent inspiration and 99 percent perspiration". "+" matches 1 or more instances of the preceding regular expression. "is+" will match "is" (both instances) and "iss" in the string "i said the issue is genius is 1 percent inspiration and 99 percent perspiration 3s". "*" matches 0 or more instances of the preceding regular expression. Therefore, "is*" matches both instances of "is", "iss" and all the single "i"s without a trailing "s" since the "*" can mean to match 0 instances of "s", leaving just the "i".

A more complicated regular expression example

Now you can solve a more complicated problem. Find all the numbers in the string "genius is 1 percent inspiration and 99 percent perspiration 3s". You could look for digits using the range [0-9], but there is another way to represent a digit using backslash-d. Let’s go with " d+ " for our regular expression. "d" matches a digit, and the "+" matches one or more of the preceding expressions, the digit in this case. Finally, the enclosing spaces constrain the digits to being standalone and not surrounded directly by characters. This regular expression will match the "1" and "99" in the string, but not the "3".

Using what we learned in Python

The regular expression module function to do a search is re.search(pattern, string, flags=0). re.search() returns a match object. Here’s a snippet of how to use it.


TheString = "genius is 1 percent inspiration and 99 percent perspiration 3s"
TheMatch = re.search(r" d+ ", TheString) #do the regex search
print(TheMatch.group().strip()) #print the result

This prints "1". Take a look at the actual parameter used for pattern. It’s r" d+ ". That r before the string tells Python to treat it as a raw string when passing it to the function. If you didn’t use a raw string you’d need to escape the backslash in the backslash-d with another backslash so it would look like ” d+ “. Escaping backslashes can get even more complicated if you are looking for the actual character, so it’s much easier to use and understand if you use raw strings. The Python docs have a good HOWTO on the backslash plague. Now you can try the following.

TheMatch = re.match(r" d+ ", TheString) #do the regex match
print(TheMatch) #print the result

This prints None. What happened? This illustrates the difference between re.search() and re.match(). re.search() looks for a match anywhere in the string, while re.match() only looks at the beginning of the string.


TheString = "60 percent of the time, it works every time"
TheMatch = re.match(r"d+ ", TheString) #do the regex match
print(TheMatch.group().strip()) #print the result

And this prints "60", as you’d expect. So far you’ve only found one match, but you wanted to find all the numbers in the string. Use re.findall() or re.finditer() to find all the matches. Let’s go through an re.findall() example.

TheString = "genius is 1 percent inspiration and 99 percent perspiration 3s"
TheMatches = re.findall(r" d+ ", TheString) #do the regex search and return a list, not a match object
for TheMatch in TheMatches:
    print(TheMatch) #print the results

And this prints

1
99

re.finditer() is similar but returns an iterator instead of a list.

Regular expression compilation

Up to now you’ve seen how to call the module search functions directly, passing in the regular expression pattern actual parameter each time. You could store the regular expression pattern in a string, say by using ThePattern = r" d+ ". A better way is to “compile” the pattern into a regular express, or pattern object using the re.compile(pattern, flags=0) module function, and then call search() from the pattern object.

ThePattern = re.compile(r" d+ ")
TheMatch = ThePattern.search(TheString)
print(TheMatch.group().strip()) #print the result

Compiling the regular expression and assigning a variable to it will cache it for future use which may speed up your program slightly. If you call the module functions directly Python is smart and creates and caches a pattern object for you even though you can’t reference the pattern object. If you wonder which way to code, look at this quote from the official documentation.

Should you use these module-level functions, or should you get the pattern and call its methods yourself? If you’re accessing a regex within a loop, pre-compiling it will save a few function calls. Outside of loops, there’s not much difference thanks to the internal cache.

More regular expression resources

That demonstrates the power of regular expressions. If you want to learn more about using regular expressions in Python, start with the regular expression HOWTO which gives a "gentler introduction" to the re module. Then read through regular expressions at the Python module of the week. And for all the gory details read the re library documentation.

To test how regular expressions work try out this regex "calculator" at regular expressions 101. For regex tutorials in Python try Google's course and their baby names exercise.