python – What is the proper regex expression to split a long string with a word inside an angled bracket as the delimiter?

I have a few long strings that came from a HTML file that I need to split with certain words within an angled bracket as the delimiter. They look somewhat like this but with more tags such as <dd> and <h2>

<span>Some Words Here</span><p>Description Here</p><p>Another part of the description</p>

This is one long string that I want to extract certain information from; this information is within the span, p, dd, h5 tags. Everything else is not needed.

I’m looking for an expression that will split the string by the tags instead so I can put them in a dictionary format like so:

myDict = {'span': '', 'p': '', 'dd': '', ...}

The dictionary part I’m OK with but I just can’t figure out the expression.

So far I’ve tried this regex expression : <([^>]*)> but it splits the longs string by ANY angled bracket. When it splits like this it leaves a lot of empty strings but they’re not a problem since I can use list comprehension to get rid of them. However, those empty strings wouldn’t even be there if I could get the expression right.

This is a small snippet of the splitting part:

tag = re.split(r'<([^>]*)>',longString) tag = [noneEmptyString for noneEmptyString in tag if noneEmptyString.strip()]

And this is the result of that:

>> ['span', 'Some Words Here', '/span', 'p', 'Description Here', '/p', 'p', 'Another part of the description', '/p']

Ideally, I want the results to look like this:

>> tagCollection = {'spanTags': ['Some Words Here', ...], 'pTags': ['Description Here', 'Another part of the description', ...]... }

What else do I need to add?

Read more here: Source link