Circumvent Broken llama.cpp Pre-Tokenizer #892

lapp0 · 2024-05-15T21:32:32Z

Fixes #820

Problem:

llama.cpp's pre-tokenizer doesn't handle unicode properly (draft PR: ggerganov/llama.cpp#5613). This results in tokens which are incompatible with Outlines byte-wise FSM and causes #820's error.

Solution:

1. If models.llamacpp() specifies a LlamaHFTokenizer, populate the vocabulary used in index construction with tokenizer.get_vocab(). This takes advantage of huggingface's working pre-tokenizer.
1. Warn users that they should pass a LlamaHFTokenizer:

>>> from outlines import models, generate
>>> model = models.llamacpp("Qwen/Qwen1.5-0.5B-Chat-GGUF", "*q8*.gguf")
/opt/conda/lib/python3.10/site-packages/outlines/models/llamacpp.py:294: UserWarning: llama.cpp pre-tokenizer is broken. You may recieve an Outlines error during Regex index construction.
To avoid this error when using `models.llamacpp` you may pass `tokenizer=llama_cpp.llama_tokenizer.LlamaHFTokenizer.from_pretrained(<hf_repo_id>)` to `models.llamacpp()`
  warnings.warn(

Debug Notes / Observations:

The problematic token in Qwen1.5 is 29333 (b' \xef\xbf\xbd')
The issue can be reproduced with models.llamacpp, but not models.transformers
AutoTokenizer's get_vocab() is inconsistent with its encode / decode output.
- get_vocab()[ b'\xc4\xa0\xc3\xaf\xc2\xbf\xc2\xbd'.decode()] = 29333
- get_vocab()[ b' \xef\xbf\xbd'.decode()] -> KeyError
- encode( b'\xc4\xa0\xc3\xaf\xc2\xbf\xc2\xbd'.decode()) = [144242, 37572, 30182, 26062]
- encode(b' \xef\xbf\xbd'.decode()) = [29333]

tokenizer.get_vocab() has a distinct mapping from tokenizer.decode due to the pre-tokenizer:

>>> tokenizer = transformers.AutoTokenizer.from_pretrained("Qwen/Qwen1.5-0.5B-Chat")
>>> tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(b' \xef\xbf\xbd'.decode())[0][0].encode()
b'\xc4\xa0\xc3\xaf\xc2\xbf\xc2\xbd'

outlines/integrations/llamacpp.py

outlines/models/llamacpp.py

rlouf · 2024-05-17T16:51:14Z

Looks good, thank you for the fix. I think it will be valuable to many people in the community 🙏

lapp0 changed the title ~~Tests to Reproduce Issue #820~~ Tests to Reproduce Issue #820 (RuntimeError: Cannot convert token � (29333) to bytes: � Using generate() with models.llamacpp May 15, 2024

lapp0 changed the title ~~Tests to Reproduce Issue #820 (RuntimeError: Cannot convert token � (29333) to bytes: � Using generate() with models.llamacpp~~ Fix Issue #820 (RuntimeError: Cannot convert token � (29333) to bytes: � Using generate() with models.llamacpp May 15, 2024

lapp0 mentioned this pull request May 16, 2024

Encountering RuntimeError: Cannot convert token � (29333) to bytes: � for some model vocabularies when using llama.cpp #820

Closed

lapp0 force-pushed the test-issue-820 branch 2 times, most recently from 01b7390 to 496e0a6 Compare May 16, 2024 19:40

lapp0 marked this pull request as ready for review May 16, 2024 19:40