python – What is the proper regex expression to split a long string with a word inside an angled bracket as the delimiter?
I have a few long strings that came from a HTML file that I need to split with certain words within an angled bracket as the delimiter. They look somewhat like this but with more tags such as <dd>
and <h2>
<span>Some Words Here</span><p>Description Here</p><p>Another part of the description</p>
This is one long string that I want to extract certain information from; this information is within the span, p, dd, h5 tags. Everything else is not needed.
I’m looking for an expression that will split the string by the tags instead so I can put them in a dictionary format like so:
myDict = {'span': '', 'p': '', 'dd': '', ...}
The dictionary part I’m OK with but I just can’t figure out the expression.
So far I’ve tried this regex expression : <([^>]*)>
but it splits the longs string by ANY angled bracket. When it splits like this it leaves a lot of empty strings but they’re not a problem since I can use list comprehension to get rid of them. However, those empty strings wouldn’t even be there if I could get the expression right.
This is a small snippet of the splitting part:
tag = re.split(r'<([^>]*)>',longString) tag = [noneEmptyString for noneEmptyString in tag if noneEmptyString.strip()]
And this is the result of that:
>> ['span', 'Some Words Here', '/span', 'p', 'Description Here', '/p', 'p', 'Another part of the description', '/p']
Ideally, I want the results to look like this:
>> tagCollection = {'spanTags': ['Some Words Here', ...], 'pTags': ['Description Here', 'Another part of the description', ...]... }
What else do I need to add?
Read more here: Source link