regex – Why does R appear to be a lazy match
You are calling stri_replace_all_regex with four arguments:
a is length 3. That’s the str argument.
"\\b" %s+% b %s+% "\\S+" is length 5. (It would be a lot easier to read if you had used paste0("\\b", b, "\\S+"), but that’s beside the point.) That’s the pattern argument.
b is length 5. That’s the replacement argument.
The last argument is vectorize_all=FALSE.
What it tries to do is documented as follows:
However, for
stri_replace_all*, ifvectorize_allisFALSE, then each
substring matching any of the supplied patterns is replaced by a
corresponding replacement string. In such a case, the vectorization is
overstr, and – independently – overpatternandreplacement. In other
words, this is equivalent to something likefor (i in 1:npatterns) str <- stri_replace_all(str, pattern[i], replacement[i]). Note that you
must setlength(pattern) >= length(replacement).
That’s pretty sloppy documentation (I want to know what it does, not “something like” what it does!), but I think the process is as follows:
Your first pattern is "\\bab\\S+". That says “word boundary followed by ab followed by one or more non-whitespace chars”. That matches all of a[1], so a[1] is replaced by b[1], which is "ab". It then tries the four other patterns, but none of them match, so you get "ab" as output.
The handling of a[3] is more complicated. The first match replaces it with "mnb", based on pattern[4]. Then a second replacement happens, because "mnb" matches pattern[5], and it gets changed again to "mn".
When you say R defaults to greedy matching, that’s when doing a single regular expression match. You’re doing five separate greedy matches, not one big greedy match.
Read more here: Source link
