Python regex to match pattern if not in double quotes or equal to list of keywords

I have a regex pattern which represents a valid variable name in a language I’m trying to parse:

R'\b([A-Z][A-Z0-9_]{0,35}\b' (e.g. VAR_NAME, TABLE_READ, SOME_OTHER_VAR etc..)

However, I don’t want to capture this pattern if it’s:

  1. From a list of keywords: OR, AND, IF, THEN, ELSE, READ_GENERIC_TABLE
  2. Preceded by a semicolon (;) because that starts an in-line comment in this language
  3. Immediately surrounded by double quotes (e.g. VAR_NAME is valid but “VAR_NAME” is not)

So far, using a third party regex module (github.com/mrabarnett/mrab-regex v2023.8.8), I’ve been able to come up with the following:

R'(?!\bOR\b|\bAND\b|\bIF\b|\bTHEN\b|\bELSE\b|\bREAD_GENERIC_TABLE\b)(?<!;.*)\b([A-Z][A-Z0-9_]{0,35})\b'

But I can’t quite figure out how to handle the double quotes part (#3). Would anyone be able to provide me with some direction?

I’ve been running against the following text as a test:

test_string = """ IF t = 0 OR t > POL_TERM_M THEN 0 ELSE IF mult(t+11,12) AND POLICY_YEAR <= TBL_VAL_INT_Y THEN READ_GENERIC_TABLE(TBL_VAL_INT, "Y", ENTRY_YEAR, MVA_OPT, "VAR_NAME", POLICY_YEAR(t)) ELSE VALINT_EL_PC(t-1) """

Which should capture the set of variable names:
POL_TERM_M, POLICY_YEAR, TBL_VAL_INT_Y, TBL_VAL_INT, ENTRY_YEAR, MVA_OPT and VALINT_EL_PC

I tried the following expression but it then started capturing the keywords:

R'(?!\bOR\b|\bAND\b|\bIF\b|\bTHEN\b|\bELSE\b|\bREAD_GENERIC_TABLE\b)(?<!;.*)[^"]\b([A-Z][A-Z0-9_]{0,35})\b[^"]'

Read more here: Source link