Regex in Python (part 1)
Search for string in text⌗
Assuming that we have a text the quick brown fox jumped over the lazy dog, and we want to search for e.g quick in the text.
import re
text = "the quick brown fox jumped over the lazy dog"
match = re.search("quick", text)
As said in .search() documentation, this method will look for the first location where it finds a match, and returns a re.Match object if found, otherwise returns None.
If we print(match), we’ll see <re.Match object; span=(4, 9), match='quick'> which indicate that the matching string starts at the index 4 and ends at index 9 exclusively.
To get the matched value that the re.Match object is holding, we can simply use
match.group()
Find characters by type⌗
Assuming we’re now working with a slightly different bit of text from the example above
import re
text = "the quick brown fox jumped over the lazy dog 1234567890 !@#$%^&*()_"
Find alphanumeric characters⌗
To find all the word characters, we can use regex expression \w.
characters = re.findall("\w", text)
When printing the result characters, we’ll get all the characters in the text splited into a list, however, !@#$%^&*() won’t be returned as they are not considered word characters, except _.
['t','h','e','q','u','i','c','k','b','r','o','w','n','f','o','x','j','u','m','p','e','d','o','v','e','r','t','h','e','l','a','z','y','d','o','g','1','2','3','4','5','6','7','8','9','0','_']
Find any characters⌗
To find any character, doesn’t matter if it’s word character or not, use .
any_characters = re.findall(".", text)
Note that now the result also contains whitespaces ' '
['t','h','e',' ','q','u','i','c','k',' ','b','r','o','w','n',' ','f','o','x',' ','j','u','m','p','e','d',' ','o','v','e','r',' ','t','h','e',' ','l','a','z','y',' ','d','o','g',' ','1','2','3','4','5','6','7','8','9','0',' ','!','@','#','$','%','^','&','*','(',')','_']
Find non-word characters⌗
Opposite to \w, we have \W (uppercase) that we can use to find all non-word characters
non_word_characters = re.findall(".", text)
The result now only contains whitespaces and symbols characters
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '!', '@', '#', '$', '%', '^', '&', '*', '(', ')']