sub and gsub when the second argument is a stream #89

pkoppstein · 2023-07-04T07:50:17Z

Consider the discrepancy between the number of results produced by these two invocations of jaq:

jaq -n '"x"|sub("x"; "a", "b")'
"a"
"b"

whereas:

jaq -n '"x"|sub("y"; "a", "b")'
"x"

In the language of "regularity" (*), we can say that the sub and gsub
filters are regular in the second argument whenever a match is made.
Given that this is clearly intended, it is apparent that the behavior for
non-matches is at best anomalous, and at worst a bug.

Unfortunately, the waters are muddied here somewhat by the fact
that currently this anomaly can be found in jq and gojq as well.

What should the resolution of this anomaly be?

(*) If E is an expression, f a jq filter of arity 1, A a non-empty
array of JSON values, then f is said to be regular in the first
argument if, for all E and A:

[E | f(A[])] == [E | A[] as $a | f($a)]

The concept generalizes to filters of any arity greater than 0, and to an position.

The text was updated successfully, but these errors were encountered:

baodrate · 2023-07-06T23:38:52Z

Did a bit of trawling through the git histories because I wanted to see if
there was a clear intent and to compare the implementations. In case anyone
else is interested in the implementation history of this filter:

sub was first implemented in jq as part of: jqlang/jq@a696c6b

The specific lines:

def sub(re; s):
  . as $in
  | [match(re)]
  | .[0]
  | . as $r
  # create the \"capture\" object:
  | reduce ( $r | .captures | .[] | select(.name != null) | { (.name) : .string } ) as $pair
      ({}; . + $pair)
  | if . == {} then $in | .[0:$r.offset]+s+.[$r.offset+$r.length:]
    else (. | s)
    end ;

with the description:

sub(regex; tostring)

Emit the string obtained by replacing the first match of regex in the
input string with tostring, after interpolation. tostring should
be a jq string, and may contain references to named captures. The
named captures are, in effect, presented as a JSON object (as
constructed by capture) to tostring, so a reference to a captured
variable named "x" would take the form: "(.x)".

Other (minor) updates

length check bug: jqlang/jq#586
fixed in jqlang/jq@85f0e30

diff --git a/builtin.c b/builtin.c
index 6f66894..7ca0f81 100644
--- a/builtin.c
+++ b/builtin.c
@@ -1051,17 +1051,19 @@ static const char* const jq_builtins[] = {
   "def sub($re; s):"
   "  . as $in"
   "  | [match($re)]"
-  "  | .[0]"
-  "  | . as $r"
-     //  # create the \"capture\" object:
-  "  | reduce ( $r | .captures | .[] | select(.name != null) | { (.name) : .string } ) as $pair"
-  "      ({}; . + $pair)"
-  "  | if . == {} then $in | .[0:$r.offset]+s+.[$r.offset+$r.length:]"
-  "    else (. | s)"
+  "  | if length == 0 then $in"
+  "    else .[0]"
+  "    | . as $r"
+       //  # create the \"capture\" object:
+  "    | reduce ( $r | .captures | .[] | select(.name != null) | { (.name) : .string } ) as $pair"
+  "        ({}; . + $pair)"
+  "    | if . == {} then $in | .[0:$r.offset]+s+.[$r.offset+$r.length:]"
+  "      else (. | s)"
+  "      end"
   "    end ;",

Substitution bug: jqlang/jq#600
jqlang/jq@30e0082

diff --git a/builtin.c b/builtin.c
index ddd207b..0629ca9 100644
--- a/builtin.c
+++ b/builtin.c
@@ -1057,9 +1057,7 @@ static const char* const jq_builtins[] = {
        //  # create the \"capture\" object:
   "    | reduce ( $r | .captures | .[] | select(.name != null) | { (.name) : .string } ) as $pair"
   "        ({}; . + $pair)"
-  "    | if . == {} then $in | .[0:$r.offset]+s+.[$r.offset+$r.length:]"
-  "      else (. | s)"
-  "      end"
+  "    | $in[0:$r.offset] + s + $in[$r.offset+$r.length:]"
   "    end ;",
   //
   // repeated substitution of re (which may contain named captures)

Meanwhile sub was originally implemented in jaq in 491b332

def sub(re; f; flags): . as $s |
  reduce captures(re; flags)[], null as $m
  ( { i: 0, s: "" };
    if $m == null
    then .s += $s[.i:]
    else .s += $s[.i:$m[0].offset] + ($m | capture_of_match | f) |
         .i  = ($m[0] | .offset + .length)
    end
  ) | .s;

Split into separate functions

c05055f

make_builtin!("split_matches", 2, |r, f| Self::Matches(r, f, (true, true))),

...

Self::Matches(re, flags, sm) => {
    Box::new(Self::cartesian(flags, re, (cv.0, cv.1.clone())).map(
        move |(flags, re)| Ok(Val::Arr(cv.1.captures(&re?, &flags?, *sm)?.into())),
    ))
}

...

pub fn captures(&self, re: &Self, flags: &Self, sm: (bool, bool)) -> Result<Vec<Val>, Error> {
    let s = self.as_str()?;
    let flags = Flags::new(flags.as_str()?).map_err(Error::RegexFlag)?;
    let re = flags.regex(re.as_str()?).unwrap();
    let (split, matches) = sm;

    let mut last_byte = 0;
    let mut bc = ByteChar::new(s);

    let captures = re.captures_iter(s);
    let len = re.captures_len();

    let cap = if split { len + 1 } else { 0 } + if matches { len } else { 0 };
    let mut out = Vec::with_capacity(cap);

    for c in captures {
        let whole = c.get(0).unwrap();
        if whole.start() >= s.len() || (flags.n && whole.as_str().is_empty()) {
            continue;
        }
        let vs = c
            .iter()
            .zip(re.capture_names())
            .filter_map(|(match_, name)| Some(capture(&mut bc, match_?, name)));
        if split {
            out.push(Val::Str(s[last_byte..whole.start()].to_string().into()));
            last_byte = whole.end();
        }
        if matches {
            out.push(Val::Arr(vs.collect::<Vec<_>>().into()));
        }
        if !flags.g {
            break;
        }
    }
    if split {
        out.push(Val::Str(s[last_byte..].to_string().into()));
    }
    Ok(out)
}

def sub(re; f; flags): reduce split_matches(re; flags)[] as $x
  (""; . + if $x | isstring then $x else $x | capture_of_match | f end);

baodrate · 2023-07-06T23:50:20Z

Ah, I totally missed your recent addition to jq: jqlang/jq#2641

That (and the linked issues/comments) def helps clarify things

pkoppstein · 2023-07-07T07:26:11Z

@baud-rate - Currently (July 7), both jq and jqMaster give the same results as jaq.
This was surprising to me as jaq has a reputation for correctness.
That is, given the general agreement that the second arg should be "regular" in the case of matches,
the jaq behavior in the case of a mismatch struck me as very odd, so I was wondering if this behavior
was chosen for consistency with jq and/or gojq. Thanks.

baodrate · 2023-07-07T23:42:04Z

Yes, it just wasn't clear to me the significance of how "regular" the function was, until I read the discussion in this issue: jqlang/jq#513

Although I think I'm still missing a few logical steps from reaching the conclusion that the regularity you described is "clearly intended". These are the cases described in the documentation:

"abc" | sub("b"; "x")
"axc"

"abc" | sub("(?<x>b)c"; "\(.x)Y")
"abY"

Naively, there is no situation I can think of where it's useful for sub, intended to perform substring substitutions, to consume a value and produce multiple. Implementation wise I'm not sure how we would enforce otherwise, however, given the requirement that sub/2 support both simple string parameters as well as a filter that consumes a capture object. But this doesn't imply that it's either intended or useful for sub to support filters that produce multiple values. Do you mind elaborating on this?

pkoppstein · 2023-07-08T03:07:33Z

@baod-rate - On the question of clarity: In the case of a match, the def of sub (whether in jq or gojq) makes it clear that (barring errors), sub(_; $x1, $x2, ..., $xn) should emit n values. gojq's author emphasized that to me too. What's not clear (at least to me) is whether there is any reason why sub should behave differently in the case of a mis-match.(*) As I indicated, I was mainly interested in whether the decision about jaq's behavior in the case of a mismatch was influenced by jq's def or behavior.

On the question of utility - I cannot really say much except that the documentation should be clear :-)

For reference, in the issue you cited (Syntactic sugar for distinguishing value from closure arguments jqlang/jq#513) stedolan wrote:

There should be no functions that operate on "one value", only those that operate on one value at a time, and produce as many results as their arguments produce.

01mf02 · 2023-07-17T05:41:13Z

I would say that this issue scratches the top of the iceberg, because the behaviour of multiple values for regex filters is quite strange even within jq itself. Observe:

$ ./jq -nc '"abAB" | [scan("a", "b"; "", "i")]'
["a","a","A","b","b","B"]
$ ./jq -nc '"abAB" | [("a", "b") as $re | ("", "i") as $flags | scan($re; $flags)]'
["a","a","A","b","b","B"]

Here, when we pass multiple values as regex and flags, we get the same as binding (1) regex and (2) flags to variables, then evaluate the filter. Fair enough. Let's try match:

$ ./jq -nc '"ABab" | [match("a", "b"; "", "i") | .string]'
["a","b","A","B"]
$ ./jq -nc '"ABab" | [("a", "b") as $re | ("", "i") as $flags | match($re; $flags) | .string]'
["a","A","b","B"]

What? Doing the same thing as above, we get different results. Why?
It is because jq for match first evaluates the flags, and only then the regex, whereas for scan, it was the other way around.

This is of course utterly confusing --- however, this probably got unnoticed, because I suppose that few people actually tried to venture into the land of horror where regex filters and multiple output values are joined in unholy union.

I would propose to fix an evaluation order for all regex filters (test, scan, match, capture, split, sub), namely evaluating arguments from left to right. That means that for sub(re, string, flags), its output should be equivalent to re as $re | string as $string | flags as $flags | sub($re, $string, $flags). Similar for the other regex filters.

It would be best if jq/gojq could be made to do the same.

What do you think?

01mf02 · 2023-07-17T06:20:20Z

When I'm already at it: The next source of confusion is implicit global search.
For example, I just discovered that match does not imply g, whereas scan does:

$ jq -nc '"aa" | [match("a"; "").string], [match("a"; "g").string]'
["a"]
["a","a"]
$ jq -nc '"aa" | [scan("a"; "")], [scan("a"; "g")]'
["a","a"]
["a","a"]

This is, as far as I can see, also not documented. jaq currently implements most filters without g on by default, except for split. Until now, I was not even aware that there were other filters than split that enabled g by default. So the current state of affairs even in jaq is not satisfactory.

If I could break backwards compatibility in jq, then I would disable g by default for all regex filters, and require g to be passed if it should be enabled. But this would probably break so many filters that people use, because even split has g on by default ...

01mf02 · 2023-07-17T10:15:41Z

I would propose to fix an evaluation order for all regex filters (test, scan, match, capture, split, sub), namely evaluating arguments from left to right. That means that for sub(re, string, flags), its output should be equivalent to re as $re | string as $string | flags as $flags | sub($re, $string, $flags). Similar for the other regex filters.

Oh, I just realised by @pkoppstein's comment (#75 (comment)) that the string argument to sub is actually given as input the matching parts of the string input of sub.

So let me ask you: What should be the output of:

"a1bc2" | gsub("(?<x>[a-z])[0-9]"; .x, .x+"X")

The current output of jaq: "abc", "abcX", "aXbc", "aXbcX". (This is because for every output of the string argument, jaq branches and combines it with all other outcomes, similarly to (1,2) + (3, 4).)
The current output of jq: "abc", "aXbcX". (I have no idea why this is output.)
Taking only the first output of string: "abc".
Concatenating all outputs of string: "aXbcX".
Better ideas?

I opted for 1) because it is very simple to implement and does not incur overhead if string outputs only one value, which is the most likely use case.

pkoppstein · 2023-07-17T10:32:58Z

What should be the output of:

'"a1bc2" | gsub("(?[a-z])[0-9]"; .x, .x+"X")'

Evidently you meant: (?<x> … )
[EDIT: nonsense removed]There isn’t supposed to be anything magical or bizarre about gsub.

01mf02 · 2023-07-17T11:02:03Z

Evidently you meant: (?<x> … )

Yes, thanks for spotting this!

E | gsub($string; stream) should be just E | stream as $s | gsub($string; $s), as you’d expect for any filter that’s regular in the second argument. There isn’t supposed to be anything magical or bizarre about gsub.

That's also what I thought first, but then your example (#75 (comment)) does not work anymore: Consider the version before and after the transformation that you propose:

$ ./jq -n '"a1bc2" | gsub("(?<x>[a-z])[0-9]"; .x|ascii_upcase)'
"AbC"
$ ./jq -n '"a1bc2" | (.x|ascii_upcase) as $s | gsub("(?<x>[a-z])[0-9]"; $s)'
jq: error (at <unknown>): Cannot index string with string "x"

pkoppstein · 2023-07-17T11:27:15Z

My apologies. My attempt to formalize the main idea was wrong. But there is not supposed to be anything magical about gsub. Let me try again.

Given gsub($regex; s1, s2, …), the result should be the sequence gsub($regex; si) as i ranges from 1 on.

01mf02 · 2023-08-29T09:00:34Z

Given gsub($regex; s1, s2, …), the result should be the sequence gsub($regex; si) as i ranges from 1 on.

This might work if si would always be constant strings, but they might depend on their input; that is, substrings matching the regex.
That's why I think what you would like is not compatible with the current behaviour of jq.

There's nothing magical about sub, by the way. It is implemented by definition in jaq in three (hopefully comprehensible) lines. You may look at its implementation in #110.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sub and gsub when the second argument is a stream #89

sub and gsub when the second argument is a stream #89

pkoppstein commented Jul 4, 2023

baodrate commented Jul 6, 2023

baodrate commented Jul 6, 2023

pkoppstein commented Jul 7, 2023

baodrate commented Jul 7, 2023

pkoppstein commented Jul 8, 2023

01mf02 commented Jul 17, 2023

01mf02 commented Jul 17, 2023

01mf02 commented Jul 17, 2023 •

edited

Loading

pkoppstein commented Jul 17, 2023 •

edited

Loading

01mf02 commented Jul 17, 2023 •

edited

Loading

pkoppstein commented Jul 17, 2023

01mf02 commented Aug 29, 2023 •

edited

Loading

sub and gsub when the second argument is a stream #89

sub and gsub when the second argument is a stream #89

Comments

pkoppstein commented Jul 4, 2023

baodrate commented Jul 6, 2023

baodrate commented Jul 6, 2023

pkoppstein commented Jul 7, 2023

baodrate commented Jul 7, 2023

pkoppstein commented Jul 8, 2023

01mf02 commented Jul 17, 2023

01mf02 commented Jul 17, 2023

01mf02 commented Jul 17, 2023 • edited Loading

pkoppstein commented Jul 17, 2023 • edited Loading

01mf02 commented Jul 17, 2023 • edited Loading

pkoppstein commented Jul 17, 2023

01mf02 commented Aug 29, 2023 • edited Loading

01mf02 commented Jul 17, 2023 •

edited

Loading

pkoppstein commented Jul 17, 2023 •

edited

Loading

01mf02 commented Jul 17, 2023 •

edited

Loading

01mf02 commented Aug 29, 2023 •

edited

Loading