The streaming implementation does not take in account that a token may be split across two chunks #632

agladysh · 2023-04-04T13:22:30Z

Consider this code in the nearleyc implementation: https://github.com/kach/nearley/blob/master/bin/nearleyc.js#L29

It feeds the fs stream via a StreamWrapper instance to the parser's feed() function as is.

The Parser.feed() implementation calls lexer.reset() with the new chunk.

If the fs stream splits a token between two chunks, and we're lucky we will get a lexer error either at the end of the previous chunk, or at the beginning of the next chunk. (Otherwise the result is valid, but may not be what the user expects, as two tokens instead of one appear silently in the result.)

Ideally, the fs streams should be handled on lexer level, not on parser level. Barring that, the easiest solution, probably, would be to drop streaming support, at least in nearleyc, and deprecate the usage of streaming in the API.

In practice, this issue affects users with longer grammars with longer symbol names.

Here is the code to reproduce (probably can be whittled down to something simpler):

gen-long-grammar.ts

process.stdout.write(`
grammar ->
`);

const rules = [ 'FOO', 'FOOBAR', 'FOOBARBAZ', 'FOOBARBAZQUO' ].map(rule => rule + '_' + 'X'.repeat(100));

for (let i = 0; i < 400; ++i) {
  process.stdout.write(i === 0 ? ' ' : '|');
  for (let j = 0; j < (Math.random() * 50 | 0) + 1; ++j) {
    process.stdout.write(' ');
    if (j > 0) {
      process.stdout.write('" " ');
    }
    process.stdout.write(rules[rules.length * Math.random() | 0]);
  }
  process.stdout.write('\n');
}

for (const rule of rules) {
  process.stdout.write(`\n${rule} -> "${rule}"`);
}

process.stdout.write('\n');

Running this

npx tsx gen-long-grammar.ts | npx nearleyc

Results in an error similar to this:

Unexpected word token: "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX". 
Instead, I was expecting to see one of the following:
<...>

The text was updated successfully, but these errors were encountered:

naraen · 2023-04-05T23:14:46Z

One way to fix ... Parser already expects production rules to be delimited by newline. Parser.feed() could stash the characters in the chunk that are not terminated by newline and prepend it to the next chunk for processing.

Making the below change at https://github.com/kach/nearley/blob/master/lib/nearley.js#L279, works for your grammar.

This should work without additional ripple effects, since nearleyc appends a newline at the end of the feed for good measure - https://github.com/kach/nearley/blob/master/bin/nearleyc.js#L31.

    var tailPrevChunk='';
    Parser.prototype.feed = function(chunk) {
        var idxLastNewline = chunk.lastIndexOf('\n');
        if (idxLastNewline === -1 ){
            tailPrevChunk += chunk;
            return;
        }

        var alteredChunk = tailPrevChunk + chunk.substring(0,idxLastNewline);
        tailPrevChunk= chunk.substring(idxLastNewline);

        var lexer = this.lexer;
        lexer.reset(alteredChunk, this.lexerState);

         . . .

Having shared that. It is worth debating if supporting an arbitrarily long production rule is a good idea. Seems to me, having a defined max length is a good idea.

agladysh · 2023-04-05T23:22:52Z

@naraen Thank you, it is a good idea. This will likely fix parsing longer .ne files, but will this fix arbitrary user data with a Parser that nearleyc compiled?

One can imagine a valid data format without any newlines (or even set delimiters). For such corner cases, though, we may let the user configure when to stop working with the current chunk.

naraen · 2023-04-06T00:08:02Z

Agreed. This isn't the right location in the code flow for the fix. I wanted to do a quick proof.
Wondering ... https://github.com/kach/nearley/blob/master/lib/stream.js#L14-L15 might be a better place?

naraen · 2023-04-06T15:13:51Z

Opened a PR with the change in StreamWriter. As far as I can tell, only nearleyc uses the StreamWriter. So this should fix the issue identified without changing the behavior of the generated parser.

agladysh changed the title ~~The StreamWrapper implementation does not take in account that a tokens may be split across two chunks~~ The streaming implementation does not take in account that a tokens may be split across two chunks Apr 4, 2023

agladysh changed the title ~~The streaming implementation does not take in account that a tokens may be split across two chunks~~ The streaming implementation does not take in account that a token may be split across two chunks Apr 4, 2023

naraen mentioned this issue Apr 6, 2023

#632 Buffer chunks to avoid partial matches in nearleyc #635

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The streaming implementation does not take in account that a token may be split across two chunks #632

The streaming implementation does not take in account that a token may be split across two chunks #632

agladysh commented Apr 4, 2023 •

edited

naraen commented Apr 5, 2023

agladysh commented Apr 5, 2023

naraen commented Apr 6, 2023 •

edited

naraen commented Apr 6, 2023

The streaming implementation does not take in account that a token may be split across two chunks #632

The streaming implementation does not take in account that a token may be split across two chunks #632

Comments

agladysh commented Apr 4, 2023 • edited

naraen commented Apr 5, 2023

agladysh commented Apr 5, 2023

naraen commented Apr 6, 2023 • edited

naraen commented Apr 6, 2023

agladysh commented Apr 4, 2023 •

edited

naraen commented Apr 6, 2023 •

edited