python – Capturing inner string in lark using regex

Ok, so, this is probably a bit convoluted solution and I should probably use something simple, but it is what is.

I have a huge text I want to parse using python’s lark and I’m piecing together certain parts, and now I’m trying to get the text within the quotes.

Using regex that would be something like:

([\"\'])(.*)(\1)
  • capture first quote
  • capture all
  • until you capture the last quote that is the same as first

However, adding that to Lark grammar, for instance like this:

grammar = r"""
start: contained_string

contained_string: quoted | parenthesis | braces | brackets

quoted      : /([\"\'])(.*)(:\1)/
parenthesis : "(" CONT_STRING+ ")"
braces      : "{" CONT_STRING+ "}"
brackets    : "[" CONT_STRING+ "]"

CONT_STRING: /[\w.,!?:\- ]/
"""

And test it out on a list like this:

samples = ['"sample"', "'sam'ple'", """'samp"le'"""]

I get these as outputs:

"sample"
'sam'ple'
'samp"le'

Which is ok, it doesn’t get tripped the extra quote, either the same type or different. But it keeps the outer quotes. Now usually I could specify that I want the second group from the captured ones, but I’m not sure how to do it within lark like this.

Read more here: Source link