Difference in multi-line match/pattern between ugrep and ripgrep/pcregrep/pcre2grep? #391

AndydeCleyre · 2024-05-14T15:42:28Z

AndydeCleyre
May 14, 2024

I have a pattern that I use with the non-ugrep tools mentioned to print the entire paragraph/block containing a matched term.

sample.forth:

6 7 * .        \ 42 ok
1360 23 - .    \ 1337 ok
12 12 / .      \ 1 ok
13 2 mod .     \ 1 ok

99 negate .    \ -99 ok
-99 abs .      \ 99 ok
52 23 max .    \ 52 ok
52 23 min .    \ 23 ok

$ rg --multiline '(^[^\n]+\n)*[^\n]*'1337'[^\n]*(\n[^\n]+)*' sample.forth
1:6 7 * .        \ 42 ok
2:1360 23 - .    \ 1337 ok
3:12 12 / .      \ 1 ok
4:13 2 mod .     \ 1 ok

$ pcregrep --multiline '(^[^\n]+\n)*[^\n]*'1337'[^\n]*(\n[^\n]+)*' sample.forth
6 7 * .        \ 42 ok
1360 23 - .    \ 1337 ok
12 12 / .      \ 1 ok
13 2 mod .     \ 1 ok

$ pcre2grep --multiline '(^[^\n]+\n)*[^\n]*'1337'[^\n]*(\n[^\n]+)*' sample.forth
6 7 * .        \ 42 ok
1360 23 - .    \ 1337 ok
12 12 / .      \ 1 ok
13 2 mod .     \ 1 ok

$ ugrep '(^[^\n]+\n)*[^\n]*'1337'[^\n]*(\n[^\n]+)*' sample.forth
6 7 * .        \ 42 ok
1360 23 - .    \ 1337 ok

Is the pattern syntax different, especially when it comes to multiline matching? Or is this a bug in ugrep?

Thanks for any help understanding!

Answered by genivia-inc

May 14, 2024

The anchor ^ is part of the initial parenthesized repetition, which causes some ambiguity so take it outside:

$ ugrep '^([^\n]+\n)*[^\n]*'1337'[^\n]*(\n[^\n]+)*' sample.forth
6 7 * .        \ 42 ok
1360 23 - .    \ 1337 ok
12 12 / .      \ 1 ok
13 2 mod .     \ 1 ok

Why is this? Please note that ugrep's default pattern matching is POSIX, which puts some restrictions on regex and anchors because of the internal matching machinery used. I may be able to work around the ^ anchor placement issue in a future update, but I'm not 100% sure.

Simpler is to write this regex with a dot . instead of [^\n] (because dot doesn't match newlines unless explicitly forced to do so with --dotall) so this loo…

View full answer

genivia-inc · 2024-05-14T16:01:58Z

genivia-inc
May 14, 2024
Maintainer

The anchor ^ is part of the initial parenthesized repetition, which causes some ambiguity so take it outside:

$ ugrep '^([^\n]+\n)*[^\n]*'1337'[^\n]*(\n[^\n]+)*' sample.forth
6 7 * .        \ 42 ok
1360 23 - .    \ 1337 ok
12 12 / .      \ 1 ok
13 2 mod .     \ 1 ok

Why is this? Please note that ugrep's default pattern matching is POSIX, which puts some restrictions on regex and anchors because of the internal matching machinery used. I may be able to work around the ^ anchor placement issue in a future update, but I'm not 100% sure.

Simpler is to write this regex with a dot . instead of [^\n] (because dot doesn't match newlines unless explicitly forced to do so with --dotall) so this looks nice and tidy:

$ ugrep '^(.+\n)*.*'1337'.*(\n.+)*' sample.forth
6 7 * .        \ 42 ok
1360 23 - .    \ 1337 ok
12 12 / .      \ 1 ok
13 2 mod .     \ 1 ok

Use -P for Perl matching and then you can keep the ^ in parenthesis:

$ ugrep -P '(^.+\n)*.*'1337'.*(\n.+)*' sample.forth
6 7 * .        \ 42 ok
1360 23 - .    \ 1337 ok
12 12 / .      \ 1 ok
13 2 mod .     \ 1 ok

0 replies

AndydeCleyre · 2024-05-14T16:08:45Z

AndydeCleyre
May 14, 2024
Author

The ^ was intentionally inside the repetition, marking the beginning of each line. But I'm not sure if that's necessary for all tools and cases. I chose the pattern a while ago, aiming for a cross-tool compatible one.

Thank you for this, I will experiment with my use cases and these patterns to see if I can use a single pattern for ripgrep+pcregrep+pcre2grep+ugrep.

0 replies

genivia-inc · 2024-05-14T16:24:43Z

genivia-inc
May 14, 2024
Maintainer

Yes, it is an interesting little twist in the way POSIX versus Perl matching differ that can be a bit surprising.

Note that the ^ anchoring is not required for your regex pattern, because the pattern starts matching any non-newline character. It will match a non-newline character either at the start of the input or at a new line (i.e. after a \n) which is always when ^ anchored.

0 replies

genivia-inc · 2024-05-14T16:51:00Z

genivia-inc
May 14, 2024
Maintainer

It's a bit more complicated to explain why the ^ causes ambiguity in this regex pattern when used inside a repetition, which "confuses" the pattern matcher. The pattern matcher uses an efficient DFA. The DFA is:

There are two back edges (dashed arrows) labeled BOL (begin of line which is ^). The first back edge consumes the part before the '1337'. The second back edge goes back to the start state after '1337' is matched. This is the problematic back edge that cuts the matching "too soon". This BOL back edge is ambiguous, because we may or may not have to take that edge, i.e. it is not deterministic.

By contrast, Perl matching with ugrep option -P for PCRE2 uses a backtracking matcher (an NFA essentially) that keeps matching when possible and can deal with this case to backtrack when it reaches a dead end. A backtracking matcher is much slower than a DFA-based matcher.

Now, ugrep does backtrack on anchors and word boundaries to a limited extent to match them when used within regex patterns, but doesn't do it as aggressively as Perl-based backtracking matchers do.

0 replies

AndydeCleyre · 2024-05-14T16:53:09Z

AndydeCleyre
May 14, 2024
Author

Thanks so much for this!

0 replies

AndydeCleyre · 2024-05-14T18:14:28Z

AndydeCleyre
May 14, 2024
Author

Everything seems all straightened out now, even in my more complicated use cases. I'll just note while I'm here that when using --format, with multiple matches, I needed to explicitly add %~ to the end of the format string to get newlines between matches, whereas the other tools seem to do that implicitly. This is in no way a problem, I'm mentioning it in case it helps others when similarly porting.

Thanks again!

0 replies

genivia-inc · 2024-05-14T18:52:10Z

genivia-inc
May 14, 2024
Maintainer

Yes, explicit newlines in formats with %~ are needed because there are use cases when we don't want them implicitly added. With bash you can also use the escaped form \n such as in `--format=$'something\n'.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difference in multi-line match/pattern between ugrep and ripgrep/pcregrep/pcre2grep? #391

{{title}}

Replies: 7 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Difference in multi-line match/pattern between ugrep and ripgrep/pcregrep/pcre2grep? #391

AndydeCleyre May 14, 2024

Replies: 7 comments

genivia-inc May 14, 2024 Maintainer

AndydeCleyre May 14, 2024 Author

genivia-inc May 14, 2024 Maintainer

genivia-inc May 14, 2024 Maintainer

AndydeCleyre May 14, 2024 Author

AndydeCleyre May 14, 2024 Author

genivia-inc May 14, 2024 Maintainer

AndydeCleyre
May 14, 2024

genivia-inc
May 14, 2024
Maintainer

AndydeCleyre
May 14, 2024
Author

genivia-inc
May 14, 2024
Maintainer

genivia-inc
May 14, 2024
Maintainer

AndydeCleyre
May 14, 2024
Author

AndydeCleyre
May 14, 2024
Author

genivia-inc
May 14, 2024
Maintainer