Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation script fix: conversation error #80

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

stefkarmakov
Copy link

When trying to validate a dataset, if the dataset messages only contain 1 system message, followed by 1 assistant message, such as

{
"messages": [
    {"role": "system", "content": "System prompt"}, 
    {"role": "assistant", "content": "Assistant prompt"}
]}

the utils/validate_data.py script results in a confusing error The data in line <line_num> of dataset <path/to/dataset.jsonl> is incorrectly formatted.Conversation must start with a user message or system message, even though the messages start with system.

The issues arises because finetune/data/tokenize.py passes the messages without the system prompt to the mistral_common/protocol/instruct/validator.py script, which then throws the error because of if-condition in func _validate_message_list_structure on line 253.

In many training scenarios, the system prompt is directly followed by the model/assistant message and that is the end of the conversation. The current validation flags this as being an erroneous structure, though it should be a valid one. To fix this inconsistency, this PR change adds the system prompt back into the messages for the validation step to fix this issue, and removes it right after the checks.

If messages only contain 1 system message, followed by 1 assistant message, the validation script results in a confusing error.

This change adds the system prompt back into the messages for the validation step to fix this issue and return True if the structure
of the messages is correct.
@pandora-s-git
Copy link
Collaborator

Hi @stefkarmakov ,I'm not sure if Im understanding, in what scenarios would it be required to have a system prompt without any instruction? Wouldnt the system prompt just be an instruction in those cases?

@stefkarmakov
Copy link
Author

stefkarmakov commented Jul 15, 2024

Hi @pandora-s-git, I'm not sure, but I'm assuming by "system prompt without any instruction" you mean a system prompt that is not followed by a user prompt to instruct the model what to do?

In some cases, you can have the system prompt to act as both the system and user/instruction prompt, telling the model how to behave and what to do. But you don't want to label that prompt as "user" for the fine-tuning process because:

  • the dataset "user" prompts could be much simpler and don't contain so much "lower level" model instructions. For fine-tuning, It'd be important to keep consistency, otherwise the model might not learn well.
  • you might want to teach the model to pay attention to any instructions in the system prompt too, not only coming from user prompts
  • simply gives more versatility to the user to experiment and control the fine-tuning process by controlling information distribution within the dataset roles

Hope this answers your question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants