Validation script fix: conversation error #80

stefkarmakov · 2024-07-12T18:36:25Z

When trying to validate a dataset, if the dataset messages only contain 1 system message, followed by 1 assistant message, such as

{
"messages": [
    {"role": "system", "content": "System prompt"}, 
    {"role": "assistant", "content": "Assistant prompt"}
]}

the utils/validate_data.py script results in a confusing error The data in line <line_num> of dataset <path/to/dataset.jsonl> is incorrectly formatted.Conversation must start with a user message or system message, even though the messages start with system.

The issues arises because finetune/data/tokenize.py passes the messages without the system prompt to the mistral_common/protocol/instruct/validator.py script, which then throws the error because of if-condition in func _validate_message_list_structure on line 253.

In many training scenarios, the system prompt is directly followed by the model/assistant message and that is the end of the conversation. The current validation flags this as being an erroneous structure, though it should be a valid one. To fix this inconsistency, this PR change adds the system prompt back into the messages for the validation step to fix this issue, and removes it right after the checks.

If messages only contain 1 system message, followed by 1 assistant message, the validation script results in a confusing error. This change adds the system prompt back into the messages for the validation step to fix this issue and return True if the structure of the messages is correct.

pandora-s-git · 2024-07-15T08:01:30Z

Hi @stefkarmakov ,I'm not sure if Im understanding, in what scenarios would it be required to have a system prompt without any instruction? Wouldnt the system prompt just be an instruction in those cases?

stefkarmakov · 2024-07-15T20:00:11Z

Hi @pandora-s-git, I'm not sure, but I'm assuming by "system prompt without any instruction" you mean a system prompt that is not followed by a user prompt to instruct the model what to do?

In some cases, you can have the system prompt to act as both the system and user/instruction prompt, telling the model how to behave and what to do. But you don't want to label that prompt as "user" for the fine-tuning process because:

the dataset "user" prompts could be much simpler and don't contain so much "lower level" model instructions. For fine-tuning, It'd be important to keep consistency, otherwise the model might not learn well.
you might want to teach the model to pay attention to any instructions in the system prompt too, not only coming from user prompts
simply gives more versatility to the user to experiment and control the fine-tuning process by controlling information distribution within the dataset roles

Hope this answers your question.

stefkarmakov added 3 commits July 6, 2024 14:50

Remove print message

a69e7b1

Adding comment for clarity

7f56d9f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validation script fix: conversation error #80

Validation script fix: conversation error #80

stefkarmakov commented Jul 12, 2024

pandora-s-git commented Jul 15, 2024

stefkarmakov commented Jul 15, 2024 •

edited

Loading

Validation script fix: conversation error #80

Are you sure you want to change the base?

Validation script fix: conversation error #80

Conversation

stefkarmakov commented Jul 12, 2024

pandora-s-git commented Jul 15, 2024

stefkarmakov commented Jul 15, 2024 • edited Loading

stefkarmakov commented Jul 15, 2024 •

edited

Loading