Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leaderboard: Missing results for C-MTEB #1797

Open
x-tabdeveloping opened this issue Jan 14, 2025 · 22 comments
Open

Leaderboard: Missing results for C-MTEB #1797

x-tabdeveloping opened this issue Jan 14, 2025 · 22 comments

Comments

@x-tabdeveloping
Copy link
Collaborator

For some reason we're missing a lot of result on the Chinese benchmark.
I have been doing some investigations into this, and it seems that some of the results do not get loaded at all for some reason, rather than it being a case of get_scores() failing because of missing splits.

@x-tabdeveloping
Copy link
Collaborator Author

@x-tabdeveloping
Copy link
Collaborator Author

I'm looking into this a bit more in detail now. I have found an interesting case, don't know how common it is:
intfloat/multilingual-e5-small is missing Cmnli.
I have checked and there is a file for it in the results repo. It seems, however that it is missing a main_score field:

{
  "dataset_revision": null,
  "mteb_dataset_name": "Cmnli",
  "mteb_version": "1.1.1.dev0",
  "validation": {
    "cos_sim": {
      "accuracy": 0.6508719182200842,
      "accuracy_threshold": 0.9246084690093994,
      "ap": 0.7212026503090375,
      "f1": 0.7057953600147888,
      "f1_threshold": 0.8966324329376221,
      "precision": 0.5836135738306328,
      "recall": 0.8926817862988076
    },
    "dot": {
      "accuracy": 0.5950691521346964,
      "accuracy_threshold": 21.76702117919922,
      "ap": 0.6483602531599347,
      "f1": 0.6818997878243839,
      "f1_threshold": 19.672679901123047,
      "precision": 0.5237557979190172,
      "recall": 0.9768529342997428
    },
    "euclidean": {
      "accuracy": 0.6535177390258569,
      "accuracy_threshold": 1.93505859375,
      "ap": 0.7207827996588388,
      "f1": 0.7047511312217195,
      "f1_threshold": 2.1808691024780273,
      "precision": 0.5904280524403728,
      "recall": 0.8739770867430442
    },
    "evaluation_time": 7.19,
    "manhattan": {
      "accuracy": 0.6514732411304871,
      "accuracy_threshold": 30.713165283203125,
      "ap": 0.7190222188188342,
      "f1": 0.7038932553463717,
      "f1_threshold": 34.67675018310547,
      "precision": 0.577794448612153,
      "recall": 0.90039747486556
    },
    "max": {
      "accuracy": 0.6535177390258569,
      "ap": 0.7212026503090375,
      "f1": 0.7057953600147888
    }
  }
}

This prevents it from being loaded as we only load main_score when using the leaderboard to avoid OOM errors (mteb.load_results(only_main_score=True))

@x-tabdeveloping x-tabdeveloping changed the title Loads of results missing for C-MTEB Missing results for C-MTEB Jan 14, 2025
@x-tabdeveloping
Copy link
Collaborator Author

hmm another interesting thing is that the main_score of the task object is max_accuracy, which is obviously not present here, instead we have {"max": {"accuracy": score}}

@Samoed
Copy link
Collaborator

Samoed commented Jan 14, 2025

Results metrics structure was changed in #1037

@x-tabdeveloping x-tabdeveloping changed the title Missing results for C-MTEB Leaderboard: Missing results for C-MTEB Jan 14, 2025
@x-tabdeveloping
Copy link
Collaborator Author

hmm I know, but for some reason it still doesn't get loaded. There must be something with loading older results that doesn't work. @KennethEnevoldsen any thoughts on this?

@x-tabdeveloping
Copy link
Collaborator Author

If I try to load just this JSON file, I get Main score max_accuracy not found in scores

@x-tabdeveloping
Copy link
Collaborator Author

Which is made even stranger by the fact, that max_accuracy is actually there xd

cmnli = TaskResult.from_disk(Path("cmnli.json"))
print(cmnli.scores)

{'validation': [{'main_score': None,
   'hf_subset': 'default',
   'languages': ['cmn-Hans'],
   'cos_sim_accuracy': 0.6508719182200842,
   'cos_sim_accuracy_threshold': 0.9246084690093994,
   'cos_sim_ap': 0.7212026503090375,
   'cos_sim_f1': 0.7057953600147888,
   'cos_sim_f1_threshold': 0.8966324329376221,
   'cos_sim_precision': 0.5836135738306328,
   'cos_sim_recall': 0.8926817862988076,
   'dot_accuracy': 0.5950691521346964,
   'dot_accuracy_threshold': 21.76702117919922,
   'dot_ap': 0.6483602531599347,
   'dot_f1': 0.6818997878243839,
   'dot_f1_threshold': 19.672679901123047,
   'dot_precision': 0.5237557979190172,
   'dot_recall': 0.9768529342997428,
   'euclidean_accuracy': 0.6535177390258569,
   'euclidean_accuracy_threshold': 1.93505859375,
   'euclidean_ap': 0.7207827996588388,
   'euclidean_f1': 0.7047511312217195,
   'euclidean_f1_threshold': 2.1808691024780273,
   'euclidean_precision': 0.5904280524403728,
   'euclidean_recall': 0.8739770867430442,
   'manhattan_accuracy': 0.6514732411304871,
   'manhattan_accuracy_threshold': 30.713165283203125,
   'manhattan_ap': 0.7190222188188342,
   'manhattan_f1': 0.7038932553463717,
   'manhattan_f1_threshold': 34.67675018310547,
   'manhattan_precision': 0.577794448612153,
   'manhattan_recall': 0.90039747486556,
   'max_accuracy': 0.6535177390258569,
   'max_ap': 0.7212026503090375,
   'max_f1': 0.7057953600147888}]}

@x-tabdeveloping
Copy link
Collaborator Author

It seems to me that the problem is threefold:

  1. We do miss some results on some models, this is not as problematic as these are not present in the old C-MTEB leaderboard either.
  2. Some benchmark tasks were specified with the old version while the results have v2, we'll have to fix that in benchmarks.py
  3. Results from before 1.1.1.0 don't get properly loaded. I suspect that this is due to the fact that the names of the fields are not fixed before scores=scores[task.metadata.main_score] is called.

@Samoed Samoed mentioned this issue Jan 14, 2025
2 tasks
@x-tabdeveloping
Copy link
Collaborator Author

On second thoughts, we should probably be using the v1 clustering tasks, since those are in the original publication

@x-tabdeveloping
Copy link
Collaborator Author

Ooooh I think I know what's going on.
Some of the folders that contain old results do not have a model_meta.json and we do not allow those to load, because the parameter require_model_meta is set to True by default on load_results().

@x-tabdeveloping
Copy link
Collaborator Author

#1801 fixes some of this

@Samoed
Copy link
Collaborator

Samoed commented Jan 14, 2025

From #1801

Model: intfloat/multilingual-e5-small

Task Name Old Leaderboard New Leaderboard
T2Retrieval 71.39 71.39
MMarcoRetrieval 73.17 73.17
DuRetrieval 81.35 81.35
CovidRetrieval 72.82 72.82
CmedqaRetrieval 24.38 24.38
EcomRetrieval 53.56 53.56
MedicalRetrieval 44.84 44.84
VideoRetrieval 58.09 58.09
T2Reranking 65.24 65.24
MMarcoReranking 24.33 24.33
CMedQAv1-reranking N/A 63.44
CMedQAv2-reranking N/A 62.41
Ocnli 60.77 58.69
Cmnli 72.12 65.35
CLSClusteringS2S 37.79 37.79
CLSClusteringP2P 39.14 39.14
ThuNewsClusteringS2S 48.93 48.93
ThuNewsClusteringP2P 55.18 55.18
ATEC 35.14 35.14
BQ 21.51 43.27
LCQMC 72.7 72.7
PAWSX 11.01 11.01
STSB 84.11 77.73
AFQMC 25.21 25.21
QBQTC 30.25 30.25
TNews 48.38 48.38
IFlyTek 47.35 47.35
Waimai 83.9 83.9
OnlineShopping 88.73 88.73
MultilingualSentiment 64.74 66.34
JDReview 79.34 79.34

@x-tabdeveloping
Copy link
Collaborator Author

I think all tasks in the Chinese benchmark are Chinese only, so selecting the correct hf_subsets shouldn't be an issue.
Both MultilingualSentiment and STSB only have a "default" hf_subset and they are both Chinese-only.

@x-tabdeveloping
Copy link
Collaborator Author

Do we want to fix Cmnli and Ocnli or keep it as is?

@Samoed
Copy link
Collaborator

Samoed commented Jan 14, 2025

For MultilingualSentiment I meant to use only test split without validation. For Cmnli and Ocnli I think to leave them as is

@x-tabdeveloping
Copy link
Collaborator Author

ooh right, should I add that in the benchmark description or would you like to do it?

@Samoed
Copy link
Collaborator

Samoed commented Jan 14, 2025

I think it would be better if you did.

@x-tabdeveloping
Copy link
Collaborator Author

I've also checked again and the results that we're missing now, seem indeed to be missing from the old benchmark.
I think the reason the old Chinese leaderboard seems more complete is because we haven't annotated a bunch of Chinese models in terms of metadata.

@x-tabdeveloping
Copy link
Collaborator Author

okey doke, will do

@x-tabdeveloping
Copy link
Collaborator Author

I'll compile a list of model metas we're missing

@KennethEnevoldsen
Copy link
Contributor

Seems like you did quite a bit of work on this, let me know if there is anything you want me to take a look at

@x-tabdeveloping
Copy link
Collaborator Author

@KennethEnevoldsen I'd appreciate if you could take care of the stella models on issue #1803 (If you have the time and energy of course, otherwise I can do it)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants