Providing a couple of regex examples for commonly encountered situations
This articles assumes the reader is already familiar with the frequently used symbols in regex like ., *. If you are not, you may refer to this very detailed guide:
https://www3.ntu.edu.sg/home/ehchua/programming/howto/Regexe.html
I will be using the sample text below for all the examples in this article. However, the code in all the examples can be modified according to your specific text.
Code to read in sample text as a txt file
import re
#Use forward slash in file path
file='C:/Users/sample_text.txt'
with open (file, 'r') as a:
#readlines will read each line into an element in a list
file_lines=a.readlines()
sample_text="".join(file_lines) # we need to join each line into a variable
#Alternatively read in directly as a variable
sample_text= 'start\nstring\nsecond string\nend\n\nAfter months, the profit is 300**'
Example 1: Extracting text in the same line after a specific text (‘second’)
1. Create a pattern with “second”
(?<=xxx): will look for xxx that precedes the characters to be extracted
2. Look for the pattern in sample_text
extracted=sample_regex.search(sample_text)
3. Remove the leading/trailing spaces
extracted2=extracted.group().strip()
print(extracted2) #return string
Complete Code
sample_regex=re.compile(r'(?<=second).*')
extracted=sample_regex.search(sample_text)
extracted2=extracted.group().strip()
print(extracted2)
Example 2: Extracting text in the same line after a specific text with variation ('After')
1. Create a pattern with “After months” or “quarters”
extracted=sample_regex.search(sample_text)
print(extracted.group()) #return After months, the profit is 300**
2. Extract group(4) where 300** is:
extracted2=extracted.group(4).strip()
print(extracted2) #return 300**
3. Strip the “*” on the right and remove all empty spaces
extracted3=extracted2.rstrip('*').strip()
print(extracted3) #return 300
Complete Code
sample_regex=re.compile(r'(After )(months.+|quarters.+)(\s+)(.+)')
extracted=sample_regex.search(sample_text)
print('Example 2')
print(extracted.group())
extracted2=extracted.group(4).strip()
print(extracted2)
extracted3=extracted2.rstrip('*').strip()
print('Example 2')
print(extracted3)
Example 3: Extract all text between 2 text (non-inclusive) (“start” and “end”)
1. Create a pattern with “start” and “end”
(?<=xxx): will look for xxx that precedes the characters to be extracted
(?=xxx): will look for xxx that follows the characters to be extracted
extracted=sample_regex.search(sample_text)
print(extracted.group())
#return
string
second string
Example 4: Extract all text between 2 text (inclusive) (“start” and “end”)
1. Create a pattern with “start” and “end”
{m, n} : m to n times (inclusive)
{m,} : >=m times
{m} : exactly m times
extracted=sample_regex.search(sample_text)
print(extracted.group())
#return
start
string
second string
end
Example 5: How do we find the line with only the word “string”?
1. Extract the 2 lines after “start”
sample_regex=re.compile(r'(?<=start)(\n.*){2}’)
extracted=sample_regex.search(sample_text)
print(extracted.group())
#return
string
second string
2. Split the 2 lines by line. Will return a list.
line_split=re.split('\n',str(extracted.group()))
print(line_split)#return ['', 'string', 'second string’]
3. Check the items 1 & 2 in the list
item1= bool(re.search(r’^string$’, line_split[1]))
Boolean function: convert search results to true or false
^ : start of line
$: end of line
^string$: the entire line should be only “string”
print(item1) #return True
print(item2) #return False
Complete Code
sample_regex=re.compile(r'(?<=start)(\n.*){2}')
extracted=sample_regex.search(sample_text)
line_split=re.split('\n',str(extracted.group()))
item1= bool(re.search(r'^string$', line_split[1]))
item2= bool(re.search(r'^string$', line_split[2]))
print('Example 5')
print(item1)
print(item2)