Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removal of zero-width matches in generated tree-sitter grammar files, preliminary implementation #475

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

chaserhkj
Copy link
Contributor

Tree-sitter has a known limitation of not handling symbols that can match zero-width (a.k.a. "empty") strings. (Documented here) The only exception seems to be the start symbol.

However, in BNFC, zero-width matching symbols are supported and used in many examples and test cases. (e.g. The list macros in BNFC) It would be impractical to not support them in the tree-sitter backend.

It is possible to eliminate all zero-width matching symbols from a grammar: First, all symbols that could match zero-width need to be identified. Then these symbols can be converted to valid tree-sitter symbols by removing the RHS empty branches from its rules. Then all rules of the entire grammar need to be processed, wrapping all of the references to the aforementioned symbol with optional(..). Finally since the addition of optional(..) can populate zero-width-matching to other symbols, this needs to be done repeatedly until converging on a fixed point or the zero-width-matching is populated to the start symbol.

This PR implements a preliminary version of the above-described algorithm. The converging process is not done yet and currently, it only handles simple cases of zero-width-matching where there is only one layer of matching. It should be sufficient enough for most cases like the list-related internal macros provided in BNFC.

@andreasabel
Copy link
Member

andreasabel commented May 30, 2024

Thanks, @chaserhkj!
Sorry for the delay, I was busy teaching in early Spring, and then lost track of this PR.

I gave the testing a try but it seems like most (all?) of the tree-sitter tests fail.

What would be the way to use this backend? Could you write some backend documentation at https://github.com/BNFC/bnfc/blob/master/docs/user_guide.rst ?
Proposed content:

  • how to install tree-sitter (on macOS it is just brew install tree-sitter)
  • how to invoke BNFC to generate a tree-sitter parser (bnfc --tree-sitter GRAMMARFILE)
  • how to further process the generated grammar.js with tree-sitter to obtain a parser
  • maybe how to us the parser in some other code (Calculator example)

Also, we need some CHANGELOG entry.

@andreasabel andreasabel added the tree-sitter Concerning the tree-sitter backend label May 30, 2024
@andreasabel andreasabel added this to the 2.9.6 milestone May 30, 2024
@andreasabel
Copy link
Member

So it seems that the fix here is not complete, e.g. with CF file

separator Integer "  "

and the grammar.js generated by bnfc --tree-sitter ...

$ tree-sitter generate ./grammar.js 
The rule `list_token_Integer` matches the empty string.

Tree-sitter does not support syntactic rules that match the empty string
unless they are used only as the grammar's start rule

@chaserhkj
Copy link
Contributor Author

Sorry I wasn't able to look at this again in the past few months and I kinda forgot where I was.

I think I at least implemented the simplest cases, but I might be wrong. I might not have the immediate time to work on this, but I'll see if I could get back to this (and the documentations) in one week or two.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tree-sitter Concerning the tree-sitter backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants