Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The streaming implementation does not take in account that a token may be split across two chunks #632

Open
agladysh opened this issue Apr 4, 2023 · 4 comments

Comments

@agladysh
Copy link

agladysh commented Apr 4, 2023

Consider this code in the nearleyc implementation: https://github.com/kach/nearley/blob/master/bin/nearleyc.js#L29

It feeds the fs stream via a StreamWrapper instance to the parser's feed() function as is.

The Parser.feed() implementation calls lexer.reset() with the new chunk.

If the fs stream splits a token between two chunks, and we're lucky we will get a lexer error either at the end of the previous chunk, or at the beginning of the next chunk. (Otherwise the result is valid, but may not be what the user expects, as two tokens instead of one appear silently in the result.)

Ideally, the fs streams should be handled on lexer level, not on parser level. Barring that, the easiest solution, probably, would be to drop streaming support, at least in nearleyc, and deprecate the usage of streaming in the API.

In practice, this issue affects users with longer grammars with longer symbol names.

Here is the code to reproduce (probably can be whittled down to something simpler):

gen-long-grammar.ts

process.stdout.write(`
grammar ->
`);

const rules = [ 'FOO', 'FOOBAR', 'FOOBARBAZ', 'FOOBARBAZQUO' ].map(rule => rule + '_' + 'X'.repeat(100));

for (let i = 0; i < 400; ++i) {
  process.stdout.write(i === 0 ? ' ' : '|');
  for (let j = 0; j < (Math.random() * 50 | 0) + 1; ++j) {
    process.stdout.write(' ');
    if (j > 0) {
      process.stdout.write('" " ');
    }
    process.stdout.write(rules[rules.length * Math.random() | 0]);
  }
  process.stdout.write('\n');
}

for (const rule of rules) {
  process.stdout.write(`\n${rule} -> "${rule}"`);
}

process.stdout.write('\n');

Running this

npx tsx gen-long-grammar.ts | npx nearleyc

Results in an error similar to this:

Unexpected word token: "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX". 
Instead, I was expecting to see one of the following:
<...>
@agladysh agladysh changed the title The StreamWrapper implementation does not take in account that a tokens may be split across two chunks The streaming implementation does not take in account that a tokens may be split across two chunks Apr 4, 2023
@agladysh agladysh changed the title The streaming implementation does not take in account that a tokens may be split across two chunks The streaming implementation does not take in account that a token may be split across two chunks Apr 4, 2023
@naraen
Copy link

naraen commented Apr 5, 2023

One way to fix ... Parser already expects production rules to be delimited by newline. Parser.feed() could stash the characters in the chunk that are not terminated by newline and prepend it to the next chunk for processing.

Making the below change at https://github.com/kach/nearley/blob/master/lib/nearley.js#L279, works for your grammar.

This should work without additional ripple effects, since nearleyc appends a newline at the end of the feed for good measure - https://github.com/kach/nearley/blob/master/bin/nearleyc.js#L31.

    var tailPrevChunk='';
    Parser.prototype.feed = function(chunk) {
        var idxLastNewline = chunk.lastIndexOf('\n');
        if (idxLastNewline === -1 ){
            tailPrevChunk += chunk;
            return;
        }

        var alteredChunk = tailPrevChunk + chunk.substring(0,idxLastNewline);
        tailPrevChunk= chunk.substring(idxLastNewline);

        var lexer = this.lexer;
        lexer.reset(alteredChunk, this.lexerState);

         . . .

Having shared that. It is worth debating if supporting an arbitrarily long production rule is a good idea. Seems to me, having a defined max length is a good idea.

@agladysh
Copy link
Author

agladysh commented Apr 5, 2023

@naraen Thank you, it is a good idea. This will likely fix parsing longer .ne files, but will this fix arbitrary user data with a Parser that nearleyc compiled?

One can imagine a valid data format without any newlines (or even set delimiters). For such corner cases, though, we may let the user configure when to stop working with the current chunk.

@naraen
Copy link

naraen commented Apr 6, 2023

Agreed. This isn't the right location in the code flow for the fix. I wanted to do a quick proof.
Wondering ... https://github.com/kach/nearley/blob/master/lib/stream.js#L14-L15 might be a better place?

@naraen
Copy link

naraen commented Apr 6, 2023

Opened a PR with the change in StreamWriter. As far as I can tell, only nearleyc uses the StreamWriter. So this should fix the issue identified without changing the behavior of the generated parser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants