regex – Why does R appear to be a lazy match
You are calling stri_replace_all_regex
with four arguments:
a
is length 3. That’s the str
argument.
"\\b" %s+% b %s+% "\\S+"
is length 5. (It would be a lot easier to read if you had used paste0("\\b", b, "\\S+")
, but that’s beside the point.) That’s the pattern
argument.
b
is length 5. That’s the replacement
argument.
The last argument is vectorize_all=FALSE
.
What it tries to do is documented as follows:
However, for
stri_replace_all*
, ifvectorize_all
isFALSE
, then each
substring matching any of the supplied patterns is replaced by a
corresponding replacement string. In such a case, the vectorization is
overstr
, and – independently – overpattern
andreplacement
. In other
words, this is equivalent to something likefor (i in 1:npatterns) str <- stri_replace_all(str, pattern[i], replacement[i])
. Note that you
must setlength(pattern) >= length(replacement)
.
That’s pretty sloppy documentation (I want to know what it does, not “something like” what it does!), but I think the process is as follows:
Your first pattern is "\\bab\\S+"
. That says “word boundary followed by ab followed by one or more non-whitespace chars”. That matches all of a[1]
, so a[1]
is replaced by b[1]
, which is "ab"
. It then tries the four other patterns, but none of them match, so you get "ab"
as output.
The handling of a[3]
is more complicated. The first match replaces it with "mnb"
, based on pattern[4]
. Then a second replacement happens, because "mnb"
matches pattern[5]
, and it gets changed again to "mn"
.
When you say R defaults to greedy matching, that’s when doing a single regular expression match. You’re doing five separate greedy matches, not one big greedy match.
Read more here: Source link