Completion only fine-tuning of instruction models with collections of HF datasets #1103

chimezie · 2024-11-10T15:35:00Z

PR Merge of #825, #1090 and fix for #1095 due to combined benefit for use in fine-tuning instruction models with HF completion datasets

…, adds support for custom chat HF datasets (ml-explore#1088), and fixes (ml-explore#1087)

… an updated attempt to better sync with iterate_batches logic

…_only

…iterate_batches) by default.

…atasets

Renamed the batch iteration function (iterate_delineated_batches -> iterate_completion_batches).

…_only

Ensure completion batching doesn't allow BOS dupes for instruction models with chat models whose tokenizer configurations have ```add_bos_token = True``` (see: 1095)

awni · 2024-11-11T14:15:58Z

Should we close the other three PRs in favor of this one?

chimezie · 2024-11-13T12:06:12Z

Should we close the other three PRs in favor of this one?

Yes. I'll do that. Just wasn't sure if they made sense being done all at once versus piecemeal

ivanfioravanti · 2024-11-17T12:48:08Z

Yesterday I was having many issues during fine-tuning with error "apply_chat_template raise ValueError( ValueError: No chat template is set for this processor. Please either set the chat_template attribute, or provide a chat template as an argument."
Is this solving them? I will try this now.

ivanfioravanti · 2024-11-17T13:37:50Z

chat_template was missing in tokenizer_config.json (old model), solved!

…to completion_only_fix_bos_dupe

ivanfioravanti · 2024-12-06T12:14:03Z

Is there an ETA for this PR? It's really useful to simplify training on existing HF datasets.

…to completion_only_fix_bos_dupe

For use in calculating mask for everything up to the after the response prompt (i.e., the continuation/completion)

Follow example of trl's DataCollatorForCompletionOnlyLM to use response template to identify beginning of completion/continuation tokens for the purpose of masking out the other tokens during loss calculation

chimezie · 2024-12-08T16:44:00Z

@ivanfioravanti , the last thing I needed to do (which is now complete) was to update how the completion mask is identified and calculated from either the string or the corresponding token sequence. I took a look at DataCollatorForCompletionOnlyLM and axolotl for guidance, and the latter had the most straightforward solution (see: #28950).

I was hoping to rely on a more standard approach via the return_assistant_tokens_mask keyword argument of apply_chat_template, which only seems to be available for chat templates that support it via the {% generation %} keyword, but it doesn't appear to be widely supported yet.

In any case, it is ready for a review from @awni , etc.

ivanfioravanti · 2024-12-08T17:09:00Z

Amazing job 🤩

llms/mlx_lm/tuner/trainer.py

chimezie · 2024-12-09T18:40:35Z

Amazing job 🤩

Thank you.

chimezie added 30 commits June 7, 2024 12:35

Add input_masked loss calculation and batching w/ padding

59e937c

Merge branch 'ml-explore:main' into completion_only

8c1d33d

Merge branch 'ml-explore:main' into completion_only

0a3ec90

Merge branch 'ml-explore:main' into completion_only

1929f53

Generalize HF datasets to a collection of HF dataasets via datasets…

9df7bbb

…, adds support for custom chat HF datasets (ml-explore#1088), and fixes (ml-explore#1087)

Updates to LoRA documentation

1f6c370

Fixes to config format in documentattion

c721220

Fixes to references to hf_datasets

04cf93d

Fix keyword argument invokation

e477060

Fix iteration over HF dataset collection

24f40c3

Fix index calculation

78b24a2

Merge branch 'ml-explore:main' into completion_only

95fb224

Merge branch 'ml-explore:main' into completion_only

a1fbc52

Replace iterate_input_masked_batches with iterate_delineated_batches,…

b7b3332

… an updated attempt to better sync with iterate_batches logic

Merge branch 'ml-explore:main' into completion_only

603dab5

Minor documentation update

5579b48

Merge remote-tracking branch 'origin/completion_only' into completion…

e0d66f5

…_only

Updates CL lora tuner with input masking that uses default_loss (and …

4b88c33

…iterate_batches) by default.

Fix variable reference

3c76a25

Add ability to fetch raw prompt and completion text from completion d…

e45ce38

…atasets

Minor fix

90e2da8

Update sublist search and calculation of input id length

960ed79

Fix

bfa6c29

Merge branch 'ml-explore:main' into completion_only

7f89ace

Merge branch 'ml-explore:main' into completion_only

3080102

Add input masking for fine-tuning in documentation

01e330d

Renamed the batch iteration function (iterate_delineated_batches -> iterate_completion_batches).

Merge remote-tracking branch 'origin/completion_only' into completion…

791727f

…_only

Merge branch 'refs/heads/fix_bos_dupe' into completion_only_fix_bos_dupe

0a42079

Don't dupe BOS

cb73b95

Ensure completion batching doesn't allow BOS dupes for instruction models with chat models whose tokenizer configurations have ```add_bos_token = True``` (see: 1095)

Update documentation

4ddbb98

Merge branch 'completion_only' into completion_only_fix_bos_dupe

8cd0586

This was referenced Nov 13, 2024

Add functions for input-masked loss calculation and batching #825

Closed

Generalize HF datasets to a collection of HF datasets via hf_datasets #1090

Closed

Merge branch 'ml-explore:main' into completion_only_fix_bos_dupe

c5f37ac

chimezie added 6 commits November 21, 2024 10:34

Merge branch 'ml-explore:main' into completion_only_fix_bos_dupe

d89dce1

Merge branch 'ml-explore:main' into completion_only_fix_bos_dupe

7076c8f

Merge branch 'ml-explore:main' into completion_only_fix_bos_dupe

b308733

Default for hf_datasets configuration

2c41f15

Merge remote-tracking branch 'origin/completion_only_fix_bos_dupe' in…

5d57e80

…to completion_only_fix_bos_dupe

Merge branch 'ml-explore:main' into completion_only_fix_bos_dupe

c65b69f

chimezie added 4 commits December 6, 2024 07:20

Synch use of special tokens with iterate_batches

6b0bbfd

Merge remote-tracking branch 'origin/completion_only_fix_bos_dupe' in…

4349397

…to completion_only_fix_bos_dupe

Add response template (or token) argument

9a39f3b

For use in calculating mask for everything up to the after the response prompt (i.e., the continuation/completion)

Incorporate use of response template for completion masking

1ed63e9

Follow example of trl's DataCollatorForCompletionOnlyLM to use response template to identify beginning of completion/continuation tokens for the purpose of masking out the other tokens during loss calculation

Move response template to LoRA configuration

1981b13

chimezie commented Dec 8, 2024

View reviewed changes

llms/mlx_lm/tuner/trainer.py Show resolved Hide resolved

Generalize the get_item method to all CompletionDatasets

55339e7

Merge branch 'ml-explore:main' into completion_only_fix_bos_dupe

85723f4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Completion only fine-tuning of instruction models with collections of HF datasets #1103

Completion only fine-tuning of instruction models with collections of HF datasets #1103

chimezie commented Nov 10, 2024 •

edited

Loading

awni commented Nov 11, 2024

chimezie commented Nov 13, 2024

ivanfioravanti commented Nov 17, 2024

ivanfioravanti commented Nov 17, 2024

ivanfioravanti commented Dec 6, 2024

chimezie commented Dec 8, 2024 •

edited

Loading

ivanfioravanti commented Dec 8, 2024

chimezie commented Dec 9, 2024

Completion only fine-tuning of instruction models with collections of HF datasets #1103

Are you sure you want to change the base?

Completion only fine-tuning of instruction models with collections of HF datasets #1103

Conversation

chimezie commented Nov 10, 2024 • edited Loading

awni commented Nov 11, 2024

chimezie commented Nov 13, 2024

ivanfioravanti commented Nov 17, 2024

ivanfioravanti commented Nov 17, 2024

ivanfioravanti commented Dec 6, 2024

chimezie commented Dec 8, 2024 • edited Loading

ivanfioravanti commented Dec 8, 2024

chimezie commented Dec 9, 2024

chimezie commented Nov 10, 2024 •

edited

Loading

chimezie commented Dec 8, 2024 •

edited

Loading