Circumvent Broken llama.cpp Pre-Tokenizer #892
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #820
Problem:
llama.cpp's pre-tokenizer doesn't handle unicode properly (draft PR: ggerganov/llama.cpp#5613). This results in tokens which are incompatible with Outlines byte-wise FSM and causes #820's error.
Solution:
models.llamacpp()
specifies aLlamaHFTokenizer
, populate the vocabulary used in index construction withtokenizer.get_vocab()
. This takes advantage of huggingface's working pre-tokenizer.LlamaHFTokenizer
:Debug Notes / Observations:
29333
(b' \xef\xbf\xbd'
)models.llamacpp
, but notmodels.transformers
AutoTokenizer
'sget_vocab()
is inconsistent with itsencode
/decode
output.get_vocab()[ b'\xc4\xa0\xc3\xaf\xc2\xbf\xc2\xbd'.decode()]
=29333
get_vocab()[ b' \xef\xbf\xbd'.decode()]
->KeyError
encode( b'\xc4\xa0\xc3\xaf\xc2\xbf\xc2\xbd'.decode())
=[144242, 37572, 30182, 26062]
encode(b' \xef\xbf\xbd'.decode())
=[29333]
tokenizer.get_vocab()
has a distinct mapping fromtokenizer.decode
due to the pre-tokenizer: