Wednesday, 1 June 2022

Regular Expression (Regex) for Common Situations

                                              Photo by Bruno Martins on Unsplash  

Providing a couple of regex examples for commonly encountered situations 

This articles assumes the reader is already familiar with the frequently used symbols in regex like ., *. If you are not, you may refer to this very detailed guide:  

https://www3.ntu.edu.sg/home/ehchua/programming/howto/Regexe.html 

I will be using the sample text below for all the examples in this article. However, the code in all the examples can be modified according to your specific text.  

 


Code to read in sample text as a txt file


import re
#Use forward slash in file path
file='C:/Users/sample_text.txt'
with open (file, 'r') as a:    
	#readlines will read each line into an element in a list    
    file_lines=a.readlines()    
    sample_text="".join(file_lines) # we need to join each line into a variable
    
#Alternatively read in directly as a variable
sample_text= 'start\nstring\nsecond string\nend\n\nAfter months, the profit is 300**'

Example 1: Extracting text in the same line after a specific text (‘second’) 

1. Create a pattern with “second”

 


 (?<=xxx): will look for xxx that precedes the characters to be extracted 

 

2. Look for the pattern in sample_text

extracted=sample_regex.search(sample_text)  

3. Remove the leading/trailing spaces 

extracted2=extracted.group().strip()
print(extracted2) #return string     

Complete Code

sample_regex=re.compile(r'(?<=second).*')
extracted=sample_regex.search(sample_text)
extracted2=extracted.group().strip()
print(extracted2)

Example 2: Extracting text in the same line after a specific text with variation ('After') 

1. Create a pattern with “After months” or “quarters” 

extracted=sample_regex.search(sample_text)
print(extracted.group()) #return After months, the profit is 300**

2. Extract group(4) where 300** is: 

extracted2=extracted.group(4).strip()
print(extracted2) #return 300**

3. Strip the “*” on the right and remove all empty spaces 

extracted3=extracted2.rstrip('*').strip()
print(extracted3) #return 300

Complete Code

sample_regex=re.compile(r'(After )(months.+|quarters.+)(\s+)(.+)')
extracted=sample_regex.search(sample_text)
print('Example 2')
print(extracted.group())
extracted2=extracted.group(4).strip()
print(extracted2)
extracted3=extracted2.rstrip('*').strip()
print('Example 2')
print(extracted3)

Example 3: Extract all text between 2 text (non-inclusive) (“start” and “end”) 

1. Create a pattern with “start” and “end”


(?<=xxx): will look for xxx that precedes the characters to be extracted 

(?=xxx): will look for xxx that follows the characters to be extracted

extracted=sample_regex.search(sample_text)
print(extracted.group())
#return
string
second string

Example 4: Extract all text between 2 text (inclusive) (“start” and “end”)  

1. Create a pattern with “start” and “end” 

{m, n} : m to n times (inclusive) 

{m,} : >=m times 

{m} : exactly m times

extracted=sample_regex.search(sample_text)
print(extracted.group())

#return
start
string
second string
end

Example 5: How do we find the line with only the word “string”? 

1. Extract the 2 lines after “start”

sample_regex=re.compile(r'(?<=start)(\n.*){2}’) 
extracted=sample_regex.search(sample_text)
print(extracted.group())
#return
string
second string
2. Split the 2 lines by line. Will return a list.
line_split=re.split('\n',str(extracted.group()))
print(line_split)#return ['', 'string', 'second string’]
3. Check the items 1 & 2 in the list
item1= bool(re.search(r’^string$’, line_split[1]))

Boolean function: convert search results to true or false 

^ : start of line 

$: end of line 

^string$: the entire line should be only “string”

print(item1) #return True
print(item2) #return False
Complete Code
sample_regex=re.compile(r'(?<=start)(\n.*){2}')
extracted=sample_regex.search(sample_text)
line_split=re.split('\n',str(extracted.group()))
item1= bool(re.search(r'^string$', line_split[1]))
item2= bool(re.search(r'^string$', line_split[2]))
print('Example 5')
print(item1)
print(item2)

No comments:

Post a Comment

How to Read in 1 or All Excel Sheets into a Pandas DF Using Xlwings?

                                                                   Photo by Jeff Sheldon on Unsplash One of the advantages of using Xlwings...