State of discarding tokenizers is sometimes not saved #628

rantvm · 2022-11-23T07:52:41Z

I have observed that the parser sometimes ignores the state of tokenizers that silently discard some tokens. In particular, the state is ignored if the first input chunk(s) only consist of discarded tokens. This results in the position information of the tokens becoming desynchronized from the input. Below is an example of a tokenizer that exhibits this behaviour.

const discard = { "whitespace": true, "comment": true };
function next() {
    let token;
    do {
         token = /* next token from the buffer */;
    } while (token && discard[token.type]);
   return token;
}

The cause appears to be the below if-statement in combination with the defined behavior of lexer.reset(chunk, info).

nearley/lib/nearley.js

Lines 356 to 358 in 6e24450

 if (column) { 

 this.lexerState = lexer.save() 

 }

This statement seems to assume that if there has been no tokens so far, there is no tokenizer state. Simply always executing this.lexerState = lexer.save() resolves the issue. There may be circumstances where the current behaviour is required (which I am unaware of), so it may be prudent to define a parser option that causes the tokenizer state to always be stored.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

State of discarding tokenizers is sometimes not saved #628

State of discarding tokenizers is sometimes not saved #628

rantvm commented Nov 23, 2022 •

edited

State of discarding tokenizers is sometimes not saved #628

State of discarding tokenizers is sometimes not saved #628

Comments

rantvm commented Nov 23, 2022 • edited

rantvm commented Nov 23, 2022 •

edited