-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leaderboard: Missing results for C-MTEB #1797
Comments
I'm looking into this a bit more in detail now. I have found an interesting case, don't know how common it is: {
"dataset_revision": null,
"mteb_dataset_name": "Cmnli",
"mteb_version": "1.1.1.dev0",
"validation": {
"cos_sim": {
"accuracy": 0.6508719182200842,
"accuracy_threshold": 0.9246084690093994,
"ap": 0.7212026503090375,
"f1": 0.7057953600147888,
"f1_threshold": 0.8966324329376221,
"precision": 0.5836135738306328,
"recall": 0.8926817862988076
},
"dot": {
"accuracy": 0.5950691521346964,
"accuracy_threshold": 21.76702117919922,
"ap": 0.6483602531599347,
"f1": 0.6818997878243839,
"f1_threshold": 19.672679901123047,
"precision": 0.5237557979190172,
"recall": 0.9768529342997428
},
"euclidean": {
"accuracy": 0.6535177390258569,
"accuracy_threshold": 1.93505859375,
"ap": 0.7207827996588388,
"f1": 0.7047511312217195,
"f1_threshold": 2.1808691024780273,
"precision": 0.5904280524403728,
"recall": 0.8739770867430442
},
"evaluation_time": 7.19,
"manhattan": {
"accuracy": 0.6514732411304871,
"accuracy_threshold": 30.713165283203125,
"ap": 0.7190222188188342,
"f1": 0.7038932553463717,
"f1_threshold": 34.67675018310547,
"precision": 0.577794448612153,
"recall": 0.90039747486556
},
"max": {
"accuracy": 0.6535177390258569,
"ap": 0.7212026503090375,
"f1": 0.7057953600147888
}
}
} This prevents it from being loaded as we only load |
hmm another interesting thing is that the |
Results metrics structure was changed in #1037 |
hmm I know, but for some reason it still doesn't get loaded. There must be something with loading older results that doesn't work. @KennethEnevoldsen any thoughts on this? |
If I try to load just this JSON file, I get |
Which is made even stranger by the fact, that cmnli = TaskResult.from_disk(Path("cmnli.json"))
print(cmnli.scores)
{'validation': [{'main_score': None,
'hf_subset': 'default',
'languages': ['cmn-Hans'],
'cos_sim_accuracy': 0.6508719182200842,
'cos_sim_accuracy_threshold': 0.9246084690093994,
'cos_sim_ap': 0.7212026503090375,
'cos_sim_f1': 0.7057953600147888,
'cos_sim_f1_threshold': 0.8966324329376221,
'cos_sim_precision': 0.5836135738306328,
'cos_sim_recall': 0.8926817862988076,
'dot_accuracy': 0.5950691521346964,
'dot_accuracy_threshold': 21.76702117919922,
'dot_ap': 0.6483602531599347,
'dot_f1': 0.6818997878243839,
'dot_f1_threshold': 19.672679901123047,
'dot_precision': 0.5237557979190172,
'dot_recall': 0.9768529342997428,
'euclidean_accuracy': 0.6535177390258569,
'euclidean_accuracy_threshold': 1.93505859375,
'euclidean_ap': 0.7207827996588388,
'euclidean_f1': 0.7047511312217195,
'euclidean_f1_threshold': 2.1808691024780273,
'euclidean_precision': 0.5904280524403728,
'euclidean_recall': 0.8739770867430442,
'manhattan_accuracy': 0.6514732411304871,
'manhattan_accuracy_threshold': 30.713165283203125,
'manhattan_ap': 0.7190222188188342,
'manhattan_f1': 0.7038932553463717,
'manhattan_f1_threshold': 34.67675018310547,
'manhattan_precision': 0.577794448612153,
'manhattan_recall': 0.90039747486556,
'max_accuracy': 0.6535177390258569,
'max_ap': 0.7212026503090375,
'max_f1': 0.7057953600147888}]} |
It seems to me that the problem is threefold:
|
On second thoughts, we should probably be using the v1 clustering tasks, since those are in the original publication |
Ooooh I think I know what's going on. |
#1801 fixes some of this |
From #1801 Model: intfloat/multilingual-e5-small
|
I think all tasks in the Chinese benchmark are Chinese only, so selecting the correct hf_subsets shouldn't be an issue. |
Do we want to fix Cmnli and Ocnli or keep it as is? |
For |
ooh right, should I add that in the benchmark description or would you like to do it? |
I think it would be better if you did. |
I've also checked again and the results that we're missing now, seem indeed to be missing from the old benchmark. |
okey doke, will do |
I'll compile a list of model metas we're missing |
Seems like you did quite a bit of work on this, let me know if there is anything you want me to take a look at |
@KennethEnevoldsen I'd appreciate if you could take care of the stella models on issue #1803 (If you have the time and energy of course, otherwise I can do it) |
For some reason we're missing a lot of result on the Chinese benchmark.
I have been doing some investigations into this, and it seems that some of the results do not get loaded at all for some reason, rather than it being a case of
get_scores()
failing because of missing splits.The text was updated successfully, but these errors were encountered: