10 add evaluation pipeline #25

J-Dymond · 2024-05-13T14:20:23Z

Evaluation pipeline

Has a few utils files containing metrics and utility functions, and some scripts which perform evaluation on a selected model. Will briefly go over the evaluation scripts as they are at the moment, and the changes I made to the evaluation dataset class. The evaluation scripts can be run periodically throughout training to allow get a clearer picture of model performance as it is being trained.

`quantitative_eval.py`

Performs quantitative evaluation over a test set with a selected model, it compares ground truth inputs against perturbed inputs like in the paper. Unlike the paper we don't generate these, rather they are answers to different questions, randomly sampled from within the same author. In the future we can change this according to our work package. The script outputs the truth ratio values, and raw losses which are outputted as a numpy array for further processing. Currently if running as main these are saved to a .np file in a separate folder in the parent folder of that where the model weights are stored.

`qualitative_eval.py`

This performs a qualitative evaluation of the model. Loops over the test data and generates an output answer according to the input question. These are both printed along with the target printed along with the target to allow qualitative comparison against the target answer.

`EvalQADataset()` Changes

I made some changes to allow the quantitative evaluation script to work. Namely adding a batch formatter which when given a question, outputs input IDs, labels, and attention masks with appropriate padding for batch computation. Furthermore, a method which locates perturbed answers is added, which when given a question index will locate a random question pertaining the same author which can be used as a perturbed answer.

src/arcsf/data/tofu.py

src/arcsf/data/data_module.py

tests/test_data_module.py

tests/test_eval.py

jack89roberts · 2024-05-22T14:20:45Z

Ensure we understand what's going on in the eval (e.g. document shapes etc.)
- including NaNs appearing in places
Compute and return alternative=greater in KS test (as well as default).
Try to understand why IDK model comes out worse for forgetting

jack89roberts · 2024-05-22T16:48:48Z

It would be nice if the tests capture some of what we were trying to think through earlier today, e.g. check truth ratio of one of the dummy forget models is larger than the dummy fine-tuned model and similar. I haven't checked back through the tests so it might be that you've already done that.

J-Dymond · 2024-05-23T19:09:49Z

I've made some changes to my pull request now, I've added a function in evaluate_model.py called evaluate_model :

def evaluate_model(
    model: torch.nn.Module,
    base_truth_ratios_path: str,
    tokenizer: transformers.AutoTokenizer,
    experiment_config: dict,
) -> dict[float, float, float, float, float]:

It outputs a dictionary containing:

result_dict = {
        "mean_tr_retain": retain_tr,
        "mean_rouge_score": rouge_score,
        "forget_quality_1": forget_quality_one_sided,
        "forget_quality_2": forget_quality_two_sided,
        "model_utility": model_utilty,
    }

This should be everything we want to track in the wandb. I've added some tests testing it, and I've moved the old scripts I wrote into a /scripts folder within eval. The functions for these scripts are all contained in utils now. So within the eval folder there are just three files:

metrics.py : containing the functions for the metrics
utils.py : containing all of the functions used in the scripts folder and the evaluate_model function
evaluate_model.py : containing the function for evaluating model, in hindsight, maybe this can be moved to utils.. but I leave that up to what you guys think is best

J-Dymond · 2024-05-23T23:34:11Z

Just something minor I didn't explicitly point out in the above: the path for the base model truth ratios should currently be the relative path to where the forget truth ratios are stored.

The all_eval script will calculate and save these, provided you give it the forget dataset.

These are the only values that need to be stored locally for evaluate_model to run, everything else should be calculated within the function.

jack89roberts

I haven't got my head around it fully yet, will look again next week. I left a few comments about things being hardcoded but not for everything - for this PR they can be kept hardcoded but if so we should make an issue with everything that's left outstanding/will need to be changed for future runs/experiments.

src/arcsf/data/data_module.py

src/arcsf/data/tofu.py

src/arcsf/eval/utils.py

…d losses across potential answers

…(defaults to forget)

…ions

…the start token of the answer, including the formatting tokens between the question and answer

…mmy_model not generating eos token quirk

- Fix random generator not being used in eval dataset - Move moving tensors to device to collator rather than dataset

J-Dymond

This was a test, please ignore

J-Dymond linked an issue May 13, 2024 that may be closed by this pull request

Add evaluation pipeline #10

Closed

jack89roberts requested changes May 15, 2024

View reviewed changes

jack89roberts added this to the Milestone 1: Working pipeline on small novel usecase milestone May 15, 2024

jack89roberts requested changes May 24, 2024

View reviewed changes

J-Dymond and others added 22 commits June 19, 2024 11:13

added some preliminary evaluation metrics with some tests

49f346d

added KS-Test to evaluation functions

6e49195

standardising pytest.approx() in tests

e034706

corrected the input for conditional probability: it accepts normalise…

6680d76

…d losses across potential answers

adding the loss function

5d004f8

added a loss and a test for it

b9eaf97

used the dummy model + tokeniser for tests

8d755ad

TODO: write tests and debug eval functions

b17c36a

added tests for the truth ratio, now working as intented

1dc32f9

adding evaluation pipeline

af4c4c2

trying to implement the evaluation pieplline

c30b3a7

added an end to end test in test_eval.py

42596bd

added an end to end test in test_eval.py

db26a3c

added a qualitative evalation function

24d4913

added argparser for the qualitative eval script

8a2d23f

added initial quanititative and qualitative evaluation scripts

dbf0416

commented and provided docstrings for all code

18a958f

added the data split to the input arguments of the evaluation script …

4913c5a

…(defaults to forget)

📌 Update poetry.lock

9eff99d

🔥 delete VSCode config

97d4043

added a third evaluation script which outputs all scores

6e6e032

broken test: fixed, to do with device when running non-tokenized vers…

bdb1094

…ions

J-Dymond and others added 26 commits July 5, 2024 13:18

small change in the formatting function, labels are now masked until …

8a6534f

…the start token of the answer, including the formatting tokens between the question and answer

removed arguments

29a845c

some simplifications to the dataset

4b44d35

eval_end_to_end test currently doesn't pass due to normalised loss/du…

32cee5f

…mmy_model not generating eos token quirk

add eos token to QA preprocessing when training dummy models

9d9b696

update q/a start/end finding & simplify EvalQADataset

c2defe9

trim extra padding from questions

a0a2ef8

Eval loader/collator tweaks/fixes

46d0064

- Fix random generator not being used in eval dataset - Move moving tensors to device to collator rather than dataset

use accelerate to handle moving to device

ca90c98

wip refactoring towards trainer integration

6c04b75

more refactoring, delete now unused files/functions

5aba965

make evaluator class

c3cca63

log mean losses

b2a3adc

log mean forget tr

bdd8c8e

add a normal trainer that can run forget quality/utility eval

8e8ddb0

remove idk loss as option for evaluation

7659950

update eval load/saving in train script

f724d53

remove loss_type in evaluator creation

1014f55

fix "None" vs None bug

6872be0

update test_tofu experiment configs

c5e2a43

set eval batch size in train script

3c90a06

set max_new_tokens for generate

5adce7a

log level info

25e5362

missing .forget_truth_ratios

f695f0d

fix log at info

6da30b9

added the generation private key file to the gitignore

fb39863

J-Dymond commented Jul 26, 2024

View reviewed changes

J-Dymond requested a review from jack89roberts July 26, 2024 09:21

jack89roberts approved these changes Jul 26, 2024

View reviewed changes

J-Dymond merged commit 413d209 into develop Jul 26, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

10 add evaluation pipeline #25

10 add evaluation pipeline #25

J-Dymond commented May 13, 2024

jack89roberts commented May 22, 2024 •

edited

Loading

jack89roberts commented May 22, 2024

J-Dymond commented May 23, 2024

J-Dymond commented May 23, 2024

jack89roberts left a comment

J-Dymond left a comment •

edited

Loading

10 add evaluation pipeline #25

10 add evaluation pipeline #25

Conversation

J-Dymond commented May 13, 2024

Evaluation pipeline

quantitative_eval.py

qualitative_eval.py

EvalQADataset() Changes

jack89roberts commented May 22, 2024 • edited Loading

jack89roberts commented May 22, 2024

J-Dymond commented May 23, 2024

J-Dymond commented May 23, 2024

jack89roberts left a comment

Choose a reason for hiding this comment

J-Dymond left a comment • edited Loading

Choose a reason for hiding this comment

`quantitative_eval.py`

`qualitative_eval.py`

`EvalQADataset()` Changes

jack89roberts commented May 22, 2024 •

edited

Loading

J-Dymond left a comment •

edited

Loading