Data World of the Red-haired Girl: June 2022

Photo by Bruno Martins on Unsplash

Providing a couple of regex examples for commonly encountered situations

This articles assumes the reader is already familiar with the frequently used symbols in regex like ., *. If you are not, you may refer to this very detailed guide:

https://www3.ntu.edu.sg/home/ehchua/programming/howto/Regexe.html

I will be using the sample text below for all the examples in this article. However, the code in all the examples can be modified according to your specific text.

Code to read in sample text as a txt file


import re
#Use forward slash in file path
file='C:/Users/sample_text.txt'
with open (file, 'r') as a:    
	#readlines will read each line into an element in a list    
    file_lines=a.readlines()    
    sample_text="".join(file_lines) # we need to join each line into a variable
    
#Alternatively read in directly as a variable
sample_text= 'start\nstring\nsecond string\nend\n\nAfter months, the profit is 300**'

Example 1: Extracting text in the same line after a specific text (‘second’)

1. Create a pattern with “second”

(?<=xxx): will look for xxx that precedes the characters to be extracted

2. Look for the pattern in sample_text

extracted=sample_regex.search(sample_text)

3. Remove the leading/trailing spaces

extracted2=extracted.group().strip()
print(extracted2) #return string

Complete Code

sample_regex=re.compile(r'(?<=second).*')
extracted=sample_regex.search(sample_text)
extracted2=extracted.group().strip()
print(extracted2)

Example 2: Extracting text in the same line after a specific text with variation ('After')

1. Create a pattern with “After months” or “quarters”

extracted=sample_regex.search(sample_text)
print(extracted.group()) #return After months, the profit is 300**

2. Extract group(4) where 300** is:

extracted2=extracted.group(4).strip()
print(extracted2) #return 300**

3. Strip the “*” on the right and remove all empty spaces

extracted3=extracted2.rstrip('*').strip()
print(extracted3) #return 300

Complete Code

sample_regex=re.compile(r'(After )(months.+|quarters.+)(\s+)(.+)')
extracted=sample_regex.search(sample_text)
print('Example 2')
print(extracted.group())
extracted2=extracted.group(4).strip()
print(extracted2)
extracted3=extracted2.rstrip('*').strip()
print('Example 2')
print(extracted3)

Example 3: Extract all text between 2 text (non-inclusive) (“start” and “end”)

1. Create a pattern with “start” and “end”

(?<=xxx): will look for xxx that precedes the characters to be extracted

(?=xxx): will look for xxx that follows the characters to be extracted

extracted=sample_regex.search(sample_text)
print(extracted.group())
#return
string
second string

Example 4: Extract all text between 2 text (inclusive) (“start” and “end”)

1. Create a pattern with “start” and “end”

{m, n} : m to n times (inclusive)

{m,} : >=m times

{m} : exactly m times

extracted=sample_regex.search(sample_text)
print(extracted.group())

#return
start
string
second string
end

Example 5: How do we find the line with only the word “string”?

1. Extract the 2 lines after “start”

sample_regex=re.compile(r'(?<=start)(\n.*){2}’) 
extracted=sample_regex.search(sample_text)
print(extracted.group())
#return
string
second string

2. Split the 2 lines by line. Will return a list.

line_split=re.split('\n',str(extracted.group()))
print(line_split)#return ['', 'string', 'second string’]

3. Check the items 1 & 2 in the list

item1= bool(re.search(r’^string$’, line_split[1]))

Boolean function: convert search results to true or false

^ : start of line

$: end of line

^string$: the entire line should be only “string”

print(item1) #return True
print(item2) #return False

Complete Code

sample_regex=re.compile(r'(?<=start)(\n.*){2}')
extracted=sample_regex.search(sample_text)
line_split=re.split('\n',str(extracted.group()))
item1= bool(re.search(r'^string$', line_split[1]))
item2= bool(re.search(r'^string$', line_split[2]))
print('Example 5')
print(item1)
print(item2)

Data World of the Red-haired Girl

Wednesday, 1 June 2022

Regular Expression (Regex) for Common Situations

How to Read in 1 or All Excel Sheets into a Pandas DF Using Xlwings?