python – Capturing inner string in lark using regex
Ok, so, this is probably a bit convoluted solution and I should probably use something simple, but it is what is.
I have a huge text I want to parse using python’s lark and I’m piecing together certain parts, and now I’m trying to get the text within the quotes.
Using regex that would be something like:
([\"\'])(.*)(\1)
- capture first quote
- capture all
- until you capture the last quote that is the same as first
However, adding that to Lark grammar, for instance like this:
grammar = r"""
start: contained_string
contained_string: quoted | parenthesis | braces | brackets
quoted : /([\"\'])(.*)(:\1)/
parenthesis : "(" CONT_STRING+ ")"
braces : "{" CONT_STRING+ "}"
brackets : "[" CONT_STRING+ "]"
CONT_STRING: /[\w.,!?:\- ]/
"""
And test it out on a list like this:
samples = ['"sample"', "'sam'ple'", """'samp"le'"""]
I get these as outputs:
"sample"
'sam'ple'
'samp"le'
Which is ok, it doesn’t get tripped the extra quote, either the same type or different. But it keeps the outer quotes. Now usually I could specify that I want the second group from the captured ones, but I’m not sure how to do it within lark like this.
Read more here: Source link