Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Completion only fine-tuning of instruction models with collections of HF datasets #1103

Open
wants to merge 45 commits into
base: main
Choose a base branch
from

Conversation

chimezie
Copy link
Contributor

@chimezie chimezie commented Nov 10, 2024

PR Merge of #825, #1090 and fix for #1095 due to combined benefit for use in fine-tuning instruction models with HF completion datasets

chimezie added 30 commits June 7, 2024 12:35
… an updated attempt to better sync with iterate_batches logic
Renamed the batch iteration function (iterate_delineated_batches -> iterate_completion_batches).
Ensure completion batching doesn't allow BOS dupes for instruction models with chat models whose tokenizer configurations have ```add_bos_token = True``` (see: 1095)
@awni
Copy link
Member

awni commented Nov 11, 2024

Should we close the other three PRs in favor of this one?

@chimezie
Copy link
Contributor Author

Should we close the other three PRs in favor of this one?

Yes. I'll do that. Just wasn't sure if they made sense being done all at once versus piecemeal

@ivanfioravanti
Copy link
Contributor

Yesterday I was having many issues during fine-tuning with error "apply_chat_template raise ValueError( ValueError: No chat template is set for this processor. Please either set the chat_template attribute, or provide a chat template as an argument."
Is this solving them? I will try this now.

@ivanfioravanti
Copy link
Contributor

chat_template was missing in tokenizer_config.json (old model), solved!

@ivanfioravanti
Copy link
Contributor

Is there an ETA for this PR? It's really useful to simplify training on existing HF datasets.

For use in calculating mask for everything up to the after the response prompt (i.e., the continuation/completion)
Follow example of trl's DataCollatorForCompletionOnlyLM to use response template to identify beginning of completion/continuation tokens for the purpose of masking out the other tokens during loss calculation
@chimezie
Copy link
Contributor Author

chimezie commented Dec 8, 2024

@ivanfioravanti , the last thing I needed to do (which is now complete) was to update how the completion mask is identified and calculated from either the string or the corresponding token sequence. I took a look at DataCollatorForCompletionOnlyLM and axolotl for guidance, and the latter had the most straightforward solution (see: #28950).

I was hoping to rely on a more standard approach via the return_assistant_tokens_mask keyword argument of apply_chat_template, which only seems to be available for chat templates that support it via the {% generation %} keyword, but it doesn't appear to be widely supported yet.

In any case, it is ready for a review from @awni , etc.

@ivanfioravanti
Copy link
Contributor

Amazing job 🤩

@chimezie
Copy link
Contributor Author

chimezie commented Dec 9, 2024

Amazing job 🤩

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants