only allowing valid tokenizations in grammars #1

mmoskal · 2024-07-26T00:18:36Z

See https://vivien000.github.io/blog/journal/llm-decoding-with-regex-constraints.html and https://github.com/vivien000/regex-constrained-decoding/blob/main/technical_appendix.pdf

Thoughts (unorganized):

there is an automaton that recognizes all valid sequences of tokens, and moreover this automaton only depends on the previous token; that is, there is a function from tokens that tells you the set of valid next tokens
by itself it's probably not that useful, since the automaton doesn't exclude enough states, for example if the regex is for a JSON string, and the previous token is foo then the set of allowed next tokens includes " and ",; however after sampling " it will not allow , anymore (so in other words we would have to look ahead in this automaton)

implement .forced_byte() method in derivre
use this for cheap .forced_byte() impl in llguidance
while walking token trie, remember all forced paths (there shouldn't be too many of them)

The tokens we most need to discard will be along the forced path, for example after " the , is forced. Note that if the grammar allows white space between " and ,, there is no forced paths and moreover the token " should be still allowed (unless there are tokens " , "\n, "\t etc covering all of white space; but I think this is very unlikely).

In toktrie walk, if we encounter a forced byte, we go into forced mode where we just chase all forced bytes. The first token we find on this path we put on some list. We do not add any of these tokens to the allow set.

Then, after token trie walk, for every token on this list we re-create the forced byte string, tokenize, chop excessive tokens, and add the first token from tokenization to allow set and remaining tokens (if any) as conditional
splice.

Transferred from hudson-ai/guidance#13

The text was updated successfully, but these errors were encountered:

mmoskal self-assigned this Jul 26, 2024

mmoskal mentioned this issue Jul 26, 2024

only allowing valid tokenizations in grammars hudson-ai/guidance#13

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

only allowing valid tokenizations in grammars #1

only allowing valid tokenizations in grammars #1

mmoskal commented Jul 26, 2024 •

edited

Loading

only allowing valid tokenizations in grammars #1

only allowing valid tokenizations in grammars #1

Comments

mmoskal commented Jul 26, 2024 • edited Loading

mmoskal commented Jul 26, 2024 •

edited

Loading