python – Regex capture group returning wrong value
From a script that underscores suburbs in address lists. All suburbs must be underscored. However when there is a street with the same name as the suburb these have been underscored too which is not wanted. The lines of code shown are used to correct this situation. They have been copied from the script.
This is a small sample of what is contained within a larger data sample. The address lists being processed are from text files. Here is a small chunk where Howard Road has been underscored in error.
import re
addressList = ['118 Nile Street _Lower_Hutt Widow','88 _Howard Road _Point_Howard Student','168 Wellington Road _Wainuiomata Driver']
pattern1 = r'(\d?\/?)(\d+[A-z]?\s)_([A-z]\S+)(?=\s)(\s.+)|(\d+[A-z]?)(\d+[A-z]?\s)_([A-z]\S+)(?=\s)(\s.+)'
for i in range(len(addressList)):
replacedLine = addressList[i]
result = re.search(pattern1,replacedLine,flags=0)
if result:
print("There was a result!")
if result.group(1) is None or result.group(1) == '':
print("There was no group 1")
replacedLine = re.sub(pattern1, r'\2\3\4', replacedLine)
else:
print("All groups present")
replacedLine = re.sub(pattern1, r'\1\2\3\4', replacedLine)
print("New replaced line:",replacedLine)
The breakdown of the pattern is: demo This pattern must be able to match various street numbering like: 2/22, 2A. Which is why there is an OR statement included. The logic is to return all capture groups (less the underscore of the street name) If the street number is single digit then return all capture groups other than capture group 1.
This script returns the correct result. The only change to addressList
is the removal of the street underscore:
There was a result!
All groups present
New replaced line: 88 Howard Road _Point_Howard Student
The issue is that the results from my files have errors. In all cases the leftmost digit of this example return 1s. So the result shown here would be: 18 Howard Road _Point_Howard Student
Having carefully checked the output right before the code shown here and after in my full version code. I am sure the problem is here but unfortunately can’t reproduce it in this example. This behavior only occurs for underscores being removed from street numbers greater than 9. So single digit addresses remain correct. This means the problem is only occurring in the All groups present
branch.
Any ideas would be appreciated.
Read more here: Source link