-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Validation F1 score is consistently 0 across training epochs but test F1 is ~0.85 #3439
Comments
Hi @choomegan , could you check if you see that the best model was saved in previous epochs? I hypothesize that the best model with non Zero F1-Score was found in previous epochs, which would explain why you actually achieve a non-zero F1-Score on the test set. So maybe you can post the full log output here :) |
Hi @stefan-it I have attached the full logs here: flair_finetune.log I ran inference with Seems like the best model is not saved as only |
Hello @choomegan. Have you found the solution to the issue yet? I have the same problem with validation F1 being ~0 in the |
Hi @Aakame, I have not found a solution to the issue yet :( @stefan-it would you be able to assist? Thanks! |
I've downgraded Flair to version 12.2, and it appears that the |
The only difference between the faulty DEV evaluations that happen after each epoch and the correct final TEST evaluation is the storage of the embeddings, which doesnt happen in the latter case:
I found out that when I set the embeddings_storage_mode to "none" the DEV evaluation happens correctly again and the score becomes higher than zero. @stefan-it I guess the gold labels get wiped out as part as the data_point.clear_embeddings()? |
Describe the bug
When training a SequenceTagger for NER with the last layer of RoBERTa embeddings, the micro average F1 score on the validation set is consistently 0, but the training loss is decreasing (as expected). However, the test set F1 score is 0.8490. There is an issue with the logging of validation F1 scores.
My dataset only has 3 possible tags: B-SHI, I-SHI and O.
To Reproduce
Expected behavior
Non-zero F1 validation scores as the training loss is decreasing. Validation F1 score near the end of 150 epochs should be comparable to the test set F1.
Logs and Stack traces
The text was updated successfully, but these errors were encountered: