-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
10 add evaluation pipeline #25
Conversation
|
It would be nice if the tests capture some of what we were trying to think through earlier today, e.g. check truth ratio of one of the dummy forget models is larger than the dummy fine-tuned model and similar. I haven't checked back through the tests so it might be that you've already done that. |
I've made some changes to my pull request now, I've added a function in
It outputs a dictionary containing:
This should be everything we want to track in the wandb. I've added some tests testing it, and I've moved the old scripts I wrote into a
|
Just something minor I didn't explicitly point out in the above: the path for the base model truth ratios should currently be the relative path to where the forget truth ratios are stored. The all_eval script will calculate and save these, provided you give it the forget dataset. These are the only values that need to be stored locally for evaluate_model to run, everything else should be calculated within the function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't got my head around it fully yet, will look again next week. I left a few comments about things being hardcoded but not for everything - for this PR they can be kept hardcoded but if so we should make an issue with everything that's left outstanding/will need to be changed for future runs/experiments.
…d losses across potential answers
…(defaults to forget)
…the start token of the answer, including the formatting tokens between the question and answer
…mmy_model not generating eos token quirk
- Fix random generator not being used in eval dataset - Move moving tensors to device to collator rather than dataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a test, please ignore
Evaluation pipeline
Has a few utils files containing metrics and utility functions, and some scripts which perform evaluation on a selected model. Will briefly go over the evaluation scripts as they are at the moment, and the changes I made to the evaluation dataset class. The evaluation scripts can be run periodically throughout training to allow get a clearer picture of model performance as it is being trained.
quantitative_eval.py
Performs quantitative evaluation over a test set with a selected model, it compares ground truth inputs against perturbed inputs like in the paper. Unlike the paper we don't generate these, rather they are answers to different questions, randomly sampled from within the same author. In the future we can change this according to our work package. The script outputs the truth ratio values, and raw losses which are outputted as a numpy array for further processing. Currently if running as main these are saved to a
.np
file in a separate folder in the parent folder of that where the model weights are stored.qualitative_eval.py
This performs a qualitative evaluation of the model. Loops over the test data and generates an output answer according to the input question. These are both printed along with the target printed along with the target to allow qualitative comparison against the target answer.
EvalQADataset()
ChangesI made some changes to allow the quantitative evaluation script to work. Namely adding a batch formatter which when given a question, outputs input IDs, labels, and attention masks with appropriate padding for batch computation. Furthermore, a method which locates perturbed answers is added, which when given a question index will locate a random question pertaining the same author which can be used as a perturbed answer.