Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A missing space between <s> and [INST] #3

Open
keskival opened this issue Apr 19, 2024 · 7 comments
Open

A missing space between <s> and [INST] #3

keskival opened this issue Apr 19, 2024 · 7 comments

Comments

@keskival
Copy link

keskival commented Apr 19, 2024

I believe there should be a space before [INST] here:

message_txt = f"[INST] {content} [/INST]"

This is validated by testing. We cannot make the model emit its proper instruct template because it seems to be disinclined to emit it, probably because of training details.

However, the model is very keen on emitting spurious </s> if this space is omitted, especially after numbers such as phone numbers, address numbers and dates.

Notably, this page gives two different chat templates, one matching the one in this repository written out as such, and the pseudocode below it which is missing the spaces after [INST] and after [/INST]:
https://docs.mistral.ai/getting-started/open_weight_models/#chat-template

I believe both of these examples are actually incorrect.

Then there is this source:
https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1

It shows the space between <s> and [INST], which I believe is correct. Adding that space to the template gives superior output quality based on qualitative tests.

However, the pseudocode given in that source matches the pseudocode from Mistral documentation and is different from what is shown, and I believe incorrect as well.

Please look into this, and be clear what the chat template actually is; whitespace is very very important.

@keskival
Copy link
Author

keskival commented Apr 19, 2024

Here is an example message which consistently cuts off with Mixtral-8x7B after numbers when a bad chat template without the space is used:
example_message.txt

The information in the message is synthetic; no real personal information. Message embedded in Python code to remove ambiguities about escapings and linefeeds.

@timlacroix
Copy link
Contributor

timlacroix commented Apr 19, 2024

Hello ! Agreed whitespaces are frustratingly important in templates (which is why we've moved to control tokens in our newest templates).

I am far from an HF chat template expert but my understanding is :

  • "<s>[INST]" won't be tokenized as [bos] + encode("[INST]") by HF tokenizer. To achieve this you need a space between the two strings.
  • in our template, we explicitely create the list of tokens as [bos] + encode("[INST] ... [/INST]") which is fine.

Your question prompted me to run the same test we have here : https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1#instruct-tokenizer but with v1 compared with the mixtral-8x7b-instruct-v0.1 on HF.

The template on HF is wrong. Our ref implementation in this repo is correct. This is what I tested:

from mistral_common.protocol.instruct.messages import (
    AssistantMessage,
    UserMessage,
)
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.tokens.instruct.normalize import ChatCompletionRequest

from transformers import AutoTokenizer

tokenizer_v1 = MistralTokenizer.v1()

mistral_query = ChatCompletionRequest(
    messages=[
        UserMessage(content="1"),
        AssistantMessage(content="2"),
        UserMessage(content="3"),
    ],
    model="test",
)
tokenized = tokenizer_v1.encode_chat_completion(mistral_query)

tokenizer_hf = AutoTokenizer.from_pretrained('mistralai/Mixtral-8x7B-Instruct-v0.1')
hf_messages = mistral_query.model_dump()['messages']
tokenized_hf = tokenizer_hf.apply_chat_template(hf_messages, tokenize=True)

print("MISTRAL_COMMON")
print(tokenized.text)
print(tokenizer_hf.convert_ids_to_tokens(tokenized_mistral))
print("HF : MIXTRAL-8x7B")
print(tokenizer_v1.instruct_tokenizer.tokenizer.to_string(tokenized_hf))
print(tokenizer_hf.convert_ids_to_tokens(tokenized_hf))

tokenized_mistral = tokenized.tokens
assert tokenized_hf == tokenized_mistral

The output is

MISTRAL_COMMON
<s>▁[INST]▁1▁[/INST]▁2</s>▁[INST]▁3▁[/INST]
['<s>', '▁[', 'INST', ']', '▁', '1', '▁[', '/', 'INST', ']', '▁', '2', '</s>', '▁[', 'INST', ']', '▁', '3', '▁[', '/', 'INST', ']']
HF : MIXTRAL-8x7B
<s>▁[INST]▁1▁[/INST]2</s>▁▁[INST]▁3▁[/INST]
['<s>', '▁[', 'INST', ']', '▁', '1', '▁[', '/', 'INST', ']', '2', '</s>', '▁[', 'INST', ']', '▁', '3', '▁[', '/', 'INST', ']']

The only error I can see in the chat template on HF is the missing space before [/INST] and the assistant message
Fun fact, the template for Mistral-7B-Instruct is wronger, there is an extra space between and [INST] ...

I would extremely gladly accept a PR that fixes it on all our 7B / 8x7B repos, hf tokenizers are a bit mysterious to me !

Another question since you're saying 8x7B behaves badly with mistral-common tokenization :

  • Once you've tokenized things , how exactly are you calling the model ?

Thanks !

@keskival
Copy link
Author

keskival commented Apr 25, 2024

We are calling the model through VLLM with a custom chat template.
As a tokenizer we use transformers and AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1").

We don't actually add the <s>, because it's added by the tokenizer. Regardless, it seems the space between the <s> and the [INST] is important.

What I need here is the confirmation that the chat template we use is correct from Mistral team, because they are the only ones who are able to check what their real chat template is in the training pipelines.
Also, it would be great if they corrected the documentations which are at the very least conflictory if not incorrect.

@pandora-s-git
Copy link
Contributor

It actually depends on the tokenizer, after some verifications, it seems like almost all HF templates need to be rewritten (im working on it using mistral-common as ground truth) but basically there are a few differences between the tokenizer, and so, between the versions of each model.
Here are some examples with each tokenizer and a temporary Jinja chat template that I will latelly add to the HF repos:
V3 Output:

<s>[INST]▁Hello[/INST]▁Hi▁there!</s>[INST]▁How▁are▁you?[/INST]▁Fine▁and▁you?</s>[INST]▁Fine▁thank▁you.[/INST]

V3 Temporary ChatTemplte:

{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + '[/INST] ' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}

V3 Chat template Output:

<s>[INST] Hello[/INST] Hi there!</s>[INST] How are you?[/INST] Fine and you?</s>[INST] Fine thank you.[/INST] 

V2 Output:

<s>[INST]▁Hello[/INST]▁Hi▁there!</s>[INST]▁How▁are▁you?[/INST]▁Fine▁and▁you?</s>[INST]▁Fine▁thank▁you.[/INST]

V2 Temp chat template:

{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + '[/INST] ' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}

V2 Chat template Output:

<s>[INST] Hello[/INST] Hi there!</s>[INST] How are you?[/INST] Fine and you?</s>[INST] Fine thank you.[/INST] 

V1 Output

<s>▁[INST]▁Hello▁[/INST]▁Hi▁there!</s>▁[INST]▁How▁are▁you?▁[/INST]▁Fine▁and▁you?</s>▁[INST]▁Fine▁thank▁you.▁[/INST]

V3 Temp chat template:

{{bos_token}}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ ' [INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}

V3 Chat template output:

<s> [INST] Hello [/INST] Hi there!</s> [INST] How are you? [/INST] Fine and you?</s> [INST] Fine thank you. [/INST]

I will keep u updated, hopefully this will help a bit for now!

@pandora-s-git
Copy link
Contributor

@keskival Following this, I've updated most of the chat templates on HF, the ones I provided previously are not 100% accurate so I recommend using the ones I've added to the HF repos. They are still not 100% perfect, as there seems to be some issues with the tokenizers themselves on some repos, but they are far better than before and should for most of them match with mistral_common now. But basically it's recommended to use mistral_common and take it as ground truth.

@vince62s
Copy link

vince62s commented Oct 7, 2024

when trying to address a similar issue, I am thinking that the thing is still unclear as per which flags to use with HF.
huggingface/transformers#31513 (comment)

@pandora-s-git
Copy link
Contributor

Maybe this will help: https://github.com/mistralai/cookbook/blob/main/concept-deep-dive/tokenization/chat_templates.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants