regex – Why does R appear to be a lazy match

You are calling stri_replace_all_regex with four arguments:

a is length 3. That’s the str argument.

"\\b" %s+% b %s+% "\\S+" is length 5. (It would be a lot easier to read if you had used paste0("\\b", b, "\\S+"), but that’s beside the point.) That’s the pattern argument.

b is length 5. That’s the replacement argument.

The last argument is vectorize_all=FALSE.

What it tries to do is documented as follows:

However, for stri_replace_all*, if vectorize_all is FALSE, then each
substring matching any of the supplied patterns is replaced by a
corresponding replacement string. In such a case, the vectorization is
over str, and – independently – over pattern and replacement. In other
words, this is equivalent to something like for (i in 1:npatterns) str <- stri_replace_all(str, pattern[i], replacement[i]). Note that you
must set length(pattern) >= length(replacement).

That’s pretty sloppy documentation (I want to know what it does, not “something like” what it does!), but I think the process is as follows:

Your first pattern is "\\bab\\S+". That says “word boundary followed by ab followed by one or more non-whitespace chars”. That matches all of a[1], so a[1] is replaced by b[1], which is "ab". It then tries the four other patterns, but none of them match, so you get "ab" as output.

The handling of a[3] is more complicated. The first match replaces it with "mnb", based on pattern[4]. Then a second replacement happens, because "mnb" matches pattern[5], and it gets changed again to "mn".

When you say R defaults to greedy matching, that’s when doing a single regular expression match. You’re doing five separate greedy matches, not one big greedy match.

Read more here: Source link