A missing space between <s> and [INST] #3

keskival · 2024-04-19T10:46:50Z

I believe there should be a space before [INST] here:

mistral-common/src/mistral_common/tokens/tokenizers/sentencepiece.py

Line 167 in fcf0316

message_txt = f"[INST] {content} [/INST]"

This is validated by testing. We cannot make the model emit its proper instruct template because it seems to be disinclined to emit it, probably because of training details.

However, the model is very keen on emitting spurious </s> if this space is omitted, especially after numbers such as phone numbers, address numbers and dates.

Notably, this page gives two different chat templates, one matching the one in this repository written out as such, and the pseudocode below it which is missing the spaces after [INST] and after [/INST]:
https://docs.mistral.ai/getting-started/open_weight_models/#chat-template

I believe both of these examples are actually incorrect.

Then there is this source:
https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1

It shows the space between <s> and [INST], which I believe is correct. Adding that space to the template gives superior output quality based on qualitative tests.

However, the pseudocode given in that source matches the pseudocode from Mistral documentation and is different from what is shown, and I believe incorrect as well.

Please look into this, and be clear what the chat template actually is; whitespace is very very important.

The text was updated successfully, but these errors were encountered:

keskival · 2024-04-19T10:53:30Z

Here is an example message which consistently cuts off with Mixtral-8x7B after numbers when a bad chat template without the space is used:
example_message.txt

The information in the message is synthetic; no real personal information. Message embedded in Python code to remove ambiguities about escapings and linefeeds.

timlacroix · 2024-04-19T16:45:27Z

Hello ! Agreed whitespaces are frustratingly important in templates (which is why we've moved to control tokens in our newest templates).

I am far from an HF chat template expert but my understanding is :

"<s>[INST]" won't be tokenized as [bos] + encode("[INST]") by HF tokenizer. To achieve this you need a space between the two strings.
in our template, we explicitely create the list of tokens as [bos] + encode("[INST] ... [/INST]") which is fine.

Your question prompted me to run the same test we have here : https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1#instruct-tokenizer but with v1 compared with the mixtral-8x7b-instruct-v0.1 on HF.

The template on HF is wrong. Our ref implementation in this repo is correct. This is what I tested:

from mistral_common.protocol.instruct.messages import (
    AssistantMessage,
    UserMessage,
)
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.tokens.instruct.normalize import ChatCompletionRequest

from transformers import AutoTokenizer

tokenizer_v1 = MistralTokenizer.v1()

mistral_query = ChatCompletionRequest(
    messages=[
        UserMessage(content="1"),
        AssistantMessage(content="2"),
        UserMessage(content="3"),
    ],
    model="test",
)
tokenized = tokenizer_v1.encode_chat_completion(mistral_query)

tokenizer_hf = AutoTokenizer.from_pretrained('mistralai/Mixtral-8x7B-Instruct-v0.1')
hf_messages = mistral_query.model_dump()['messages']
tokenized_hf = tokenizer_hf.apply_chat_template(hf_messages, tokenize=True)

print("MISTRAL_COMMON")
print(tokenized.text)
print(tokenizer_hf.convert_ids_to_tokens(tokenized_mistral))
print("HF : MIXTRAL-8x7B")
print(tokenizer_v1.instruct_tokenizer.tokenizer.to_string(tokenized_hf))
print(tokenizer_hf.convert_ids_to_tokens(tokenized_hf))

tokenized_mistral = tokenized.tokens
assert tokenized_hf == tokenized_mistral

The output is

MISTRAL_COMMON
<s>▁[INST]▁1▁[/INST]▁2</s>▁[INST]▁3▁[/INST]
['<s>', '▁[', 'INST', ']', '▁', '1', '▁[', '/', 'INST', ']', '▁', '2', '</s>', '▁[', 'INST', ']', '▁', '3', '▁[', '/', 'INST', ']']
HF : MIXTRAL-8x7B
<s>▁[INST]▁1▁[/INST]2</s>▁▁[INST]▁3▁[/INST]
['<s>', '▁[', 'INST', ']', '▁', '1', '▁[', '/', 'INST', ']', '2', '</s>', '▁[', 'INST', ']', '▁', '3', '▁[', '/', 'INST', ']']

The only error I can see in the chat template on HF is the missing space before [/INST] and the assistant message
Fun fact, the template for Mistral-7B-Instruct is wronger, there is an extra space between and [INST] ...

I would extremely gladly accept a PR that fixes it on all our 7B / 8x7B repos, hf tokenizers are a bit mysterious to me !

Another question since you're saying 8x7B behaves badly with mistral-common tokenization :

Once you've tokenized things , how exactly are you calling the model ?

Thanks !

keskival · 2024-04-25T12:56:56Z

We are calling the model through VLLM with a custom chat template.
As a tokenizer we use transformers and AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1").

We don't actually add the <s>, because it's added by the tokenizer. Regardless, it seems the space between the <s> and the [INST] is important.

What I need here is the confirmation that the chat template we use is correct from Mistral team, because they are the only ones who are able to check what their real chat template is in the training pipelines.
Also, it would be great if they corrected the documentations which are at the very least conflictory if not incorrect.

pandora-s-git · 2024-06-17T13:51:49Z

It actually depends on the tokenizer, after some verifications, it seems like almost all HF templates need to be rewritten (im working on it using mistral-common as ground truth) but basically there are a few differences between the tokenizer, and so, between the versions of each model.
Here are some examples with each tokenizer and a temporary Jinja chat template that I will latelly add to the HF repos:
V3 Output:

<s>[INST]▁Hello[/INST]▁Hi▁there!</s>[INST]▁How▁are▁you?[/INST]▁Fine▁and▁you?</s>[INST]▁Fine▁thank▁you.[/INST]

V3 Temporary ChatTemplte:

{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + '[/INST] ' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}

V3 Chat template Output:

<s>[INST] Hello[/INST] Hi there!</s>[INST] How are you?[/INST] Fine and you?</s>[INST] Fine thank you.[/INST]

V2 Output:

<s>[INST]▁Hello[/INST]▁Hi▁there!</s>[INST]▁How▁are▁you?[/INST]▁Fine▁and▁you?</s>[INST]▁Fine▁thank▁you.[/INST]

V2 Temp chat template:

{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + '[/INST] ' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}

V2 Chat template Output:

<s>[INST] Hello[/INST] Hi there!</s>[INST] How are you?[/INST] Fine and you?</s>[INST] Fine thank you.[/INST]

V1 Output

<s>▁[INST]▁Hello▁[/INST]▁Hi▁there!</s>▁[INST]▁How▁are▁you?▁[/INST]▁Fine▁and▁you?</s>▁[INST]▁Fine▁thank▁you.▁[/INST]

V3 Temp chat template:

{{bos_token}}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ ' [INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}

V3 Chat template output:

<s> [INST] Hello [/INST] Hi there!</s> [INST] How are you? [/INST] Fine and you?</s> [INST] Fine thank you. [/INST]

I will keep u updated, hopefully this will help a bit for now!

pandora-s-git · 2024-06-21T09:52:00Z

@keskival Following this, I've updated most of the chat templates on HF, the ones I provided previously are not 100% accurate so I recommend using the ones I've added to the HF repos. They are still not 100% perfect, as there seems to be some issues with the tokenizers themselves on some repos, but they are far better than before and should for most of them match with mistral_common now. But basically it's recommended to use mistral_common and take it as ground truth.

vince62s · 2024-10-07T17:12:29Z

when trying to address a similar issue, I am thinking that the thing is still unclear as per which flags to use with HF.
huggingface/transformers#31513 (comment)

pandora-s-git · 2024-10-08T10:20:16Z

Maybe this will help: https://github.com/mistralai/cookbook/blob/main/concept-deep-dive/tokenization/chat_templates.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A missing space between <s> and [INST] #3

A missing space between <s> and [INST] #3

keskival commented Apr 19, 2024 •

edited

Loading

keskival commented Apr 19, 2024 •

edited

Loading

timlacroix commented Apr 19, 2024 •

edited

Loading

keskival commented Apr 25, 2024 •

edited

Loading

pandora-s-git commented Jun 17, 2024

pandora-s-git commented Jun 21, 2024

vince62s commented Oct 7, 2024

pandora-s-git commented Oct 8, 2024

A missing space between <s> and [INST] #3

A missing space between <s> and [INST] #3

Comments

keskival commented Apr 19, 2024 • edited Loading

keskival commented Apr 19, 2024 • edited Loading

timlacroix commented Apr 19, 2024 • edited Loading

keskival commented Apr 25, 2024 • edited Loading

pandora-s-git commented Jun 17, 2024

pandora-s-git commented Jun 21, 2024

vince62s commented Oct 7, 2024

pandora-s-git commented Oct 8, 2024

keskival commented Apr 19, 2024 •

edited

Loading

keskival commented Apr 19, 2024 •

edited

Loading

timlacroix commented Apr 19, 2024 •

edited

Loading

keskival commented Apr 25, 2024 •

edited

Loading