-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
<regex>
: Fix depth-first and leftmost-longest matching rules
#5218
base: main
Are you sure you want to change the base?
Conversation
After this PR, the matcher even seems to be compatible with Boost down to bugs. As Berglund et al. (2018) argue (and I agree with their assessment), Boost's leftmost-longest rule as specified implies that I think the fix is relatively straightforward (although I haven't tested it sufficiently yet), but I'm undecided whether we should apply it as this would mean deviation from Boost's behavior again: diff --git a/stl/inc/regex b/stl/inc/regex
index 98519f61..da509f6e 100644
--- a/stl/inc/regex
+++ b/stl/inc/regex
@@ -3262,8 +3262,14 @@ bool _Matcher<_BidIt, _Elem, _RxTraits, _It>::_Do_rep(_Node_rep* _Node, bool _Gr
if (0 <= _Node->_Max && _Node->_Max <= _Init_idx) {
_Matched0 = _Match_pat(_Node->_End_rep->_Next); // reps done, try tail
} else if (_Init_idx < _Node->_Min) { // try a required rep
- if (!_Progress) {
- _Matched0 = _Match_pat(_Node->_End_rep->_Next); // empty, try tail
+ if (!_Progress) { // empty match^M
+ if (_Longest) { // try one last repetition^M
+ _Psav->_Loop_idx = _Node->_Min;^M
+ _Psav->_Loop_iter = _STD addressof(_Cur_iter);^M
+ _Matched0 = _Match_pat(_Node->_Next);^M
+ } else { // try tail^M
+ _Matched0 = _Match_pat(_Node->_End_rep->_Next);^M
+ }^M
} else { // try another required match
_Psav->_Loop_idx = _Init_idx + 1;
_Psav->_Loop_iter = _STD addressof(_Cur_iter);
|
I just noticed that this PR fixes #996 as well. I will quickly add the test case to the PR. |
Fixes four bugs in the implementation of the ECMAScript (depth-first) and POSIX (leftmost-longest) matching rules. Fixes #731. Fixes #996.
Bug in ECMAScript matching rule implementation
ECMAScript prescribes depth-first matching. This effectively means: In a disjunction, try left alternative first; for a greedy quantifier, try greatest possible number of repetitions first; for a non-greedy quantifier, try smallest possible number of repetitions first. The matcher is supposed to yield the first match it finds using this strategy.
So why do we get incorrect match results in #731's test cases? Because the matcher deviates from ECMAScript evaluation rules: An optimization has been implemented in
_Matcher::_Do_rep0()
, the main purpose of which seems to be to reduce stack usage. This optimization essentially tries smallest number of repetitions first for greedy quantifiers as well, so the first matches are discovered as if a non-greedy quantifier is used. Consider the test case from #731 (comment) where "AAA BBB" is to be matched with regular expression(A+)\s*(B+)?\s*B*
. If quantifiers are processed smallest-first, then the first discovered full match is the following:(A+)
matches "AAA", first\s*
matches "" (smallest number of repetitions),(\B+)?
matches "", second\s*
matches " " andB*
matches "BBB". This is the wrong result produced byregex_match()
in the test case.However, the implementation of
_Matcher::_Do_rep0()
is somewhat aware that it is matching using the wrong strategy for greedy quantifiers: It doesn't just immediately stop matching when it encounters a first match, but keeps trying to add more repetitions until matching fails. Thus, the matcher will find the correct result as well:(A+)
matches "AAA", first\s*
matches " " (maximal number of repetitions),(\B+)?
matches "BBB" (maximal number of repetitions), second\s*
matches "" andB*
matches "".But when a new match is finally recognized when processing the
_N_end
node in_Matcher::_Match_pat()
, there is no clear way to tell whether this one should have been encountered before during depth-first search. The test in_Matcher::_Better_match()
is certainly wrong for this purpose, since it's implementing leftmost-longest rule (but incorrectly, see below), not the depth-first rule. One needs information about the path taken through the NFA to make this determination correctly, but the required information is buried somewhere in the call stack.This observation suggests the fix: In ECMAScript mode, all recursively called functions in the matcher except
_Matcher::_Do_rep0()
return immediately when a match has been encountered, so the first match is described by_Tgt_state
._Matcher::_Do_rep0()
does not return immediately for greedy quantifiers, since it's processing in smallest-first order, but it keeps track of the value of_Tgt_state
for the greatest number of successfully matching repetitions in the local variable_Final
and finally assigns_Final
to_Tgt_state
before it returns. So this function also ensures that_Tgt_state
describes the first match encountered under depth-first matching rules.So to make this long story short: The fix is that
_Matcher::_Match()
must construct the final result from_Tgt_state
in ECMAScript mode, not_Res
.Since
_Res
is effectively no longer used and_Matcher::_Better_match()
is only relevant for matching according to the leftmost-longest rule, we can also skip the assignment to_Res
and the call of_Matcher::_Better_match()
in_Matcher::_Match_pat()
when depth-first matching is applied.Two bugs in
_Matcher::_Better_match()
's implementation of leftmost-longest matching ruleInterestingly, POSIX regular expressions yield the same wrong result for the (equivalent) regular expression
(A+)[[:space:]]*(B+)?[[:space:]]*B*
: Matching "AAA BBB", the first capture group is "AAA" and the second is unmatched, even though the second capture group should have been "BBB". This is due to bugs in_Matcher::_Better_match()
which were already observed in #1547 (comment):_Better_match()
. After adding this rule, "AAA BBB" gets matched correctly by regular expression(A+)[[:space:]]*(B+)?[[:space:]]*B*
._Better_match()
kind of implemented the opposite rule, "rightmost-longest", because<
was used instead of>
. Matching "aabaac" with(aa|aabaac|ba|b|c)*
exposes the bug: The leftmost-longest match for the first capture group is "aabaac", but "c" is returned without this PR because it is the right-most choice of the first capture group when the whole string is matched. (First repetition matches "aa", second matches "b", third matches "aa", and finally the last one matches "c", which is assigned to the capture group.)Bug in
_Matcher::_Do_rep()
's implementation of leftmost-longest matching ruleWithout this PR,
_Matcher::_Do_rep()
only performs greedy matching when searching for a match under the leftmost-longest rule: When it finds a match with some number of repetitions, it does not check whether a match with fewer repetitions is better (e.g., longer) under the leftmost-longest rule. After this PR,_Matcher::_Do_rep()
now checks any possible number of repetitions.Remarks
MSVC STL's matcher uses the same interpretation of leftmost-longest as Boost (leftmost-longest capture groups, where the capture groups refer to the last matched strings), but the implementation has been quite buggy. This PR fixes the bugs in the implementation, so that the match results now agree with Boost's on the added test cases.
When applying the leftmost-longest rule, we still don't agree with libstdc++ (which sometimes seems to produce results closer to depth-first matching) and libc++ (namely for regular expression
(aa|aabaac|ba|b|c)*
as well as^[[:blank:]]*#([^\\n]*\\\\[[:space:]]+)*[^\\n]*
in awk mode and its sibling in extended mode) on some added test cases. I haven't checked, but there is a good chance that they implement some different interpretation of the leftmost-longest rule. The POSIX standard is surprisingly terse in describing leftmost-longest matching, leaving much room for interpretation.