Leaderboard: Missing results for C-MTEB #1797

x-tabdeveloping · 2025-01-14T08:20:57Z

For some reason we're missing a lot of result on the Chinese benchmark.
I have been doing some investigations into this, and it seems that some of the results do not get loaded at all for some reason, rather than it being a case of get_scores() failing because of missing splits.

The text was updated successfully, but these errors were encountered:

x-tabdeveloping · 2025-01-14T08:21:12Z

@Samoed @KennethEnevoldsen @Muennighoff

x-tabdeveloping · 2025-01-14T08:23:48Z

I'm looking into this a bit more in detail now. I have found an interesting case, don't know how common it is:
intfloat/multilingual-e5-small is missing Cmnli.
I have checked and there is a file for it in the results repo. It seems, however that it is missing a main_score field:

{
  "dataset_revision": null,
  "mteb_dataset_name": "Cmnli",
  "mteb_version": "1.1.1.dev0",
  "validation": {
    "cos_sim": {
      "accuracy": 0.6508719182200842,
      "accuracy_threshold": 0.9246084690093994,
      "ap": 0.7212026503090375,
      "f1": 0.7057953600147888,
      "f1_threshold": 0.8966324329376221,
      "precision": 0.5836135738306328,
      "recall": 0.8926817862988076
    },
    "dot": {
      "accuracy": 0.5950691521346964,
      "accuracy_threshold": 21.76702117919922,
      "ap": 0.6483602531599347,
      "f1": 0.6818997878243839,
      "f1_threshold": 19.672679901123047,
      "precision": 0.5237557979190172,
      "recall": 0.9768529342997428
    },
    "euclidean": {
      "accuracy": 0.6535177390258569,
      "accuracy_threshold": 1.93505859375,
      "ap": 0.7207827996588388,
      "f1": 0.7047511312217195,
      "f1_threshold": 2.1808691024780273,
      "precision": 0.5904280524403728,
      "recall": 0.8739770867430442
    },
    "evaluation_time": 7.19,
    "manhattan": {
      "accuracy": 0.6514732411304871,
      "accuracy_threshold": 30.713165283203125,
      "ap": 0.7190222188188342,
      "f1": 0.7038932553463717,
      "f1_threshold": 34.67675018310547,
      "precision": 0.577794448612153,
      "recall": 0.90039747486556
    },
    "max": {
      "accuracy": 0.6535177390258569,
      "ap": 0.7212026503090375,
      "f1": 0.7057953600147888
    }
  }
}

This prevents it from being loaded as we only load main_score when using the leaderboard to avoid OOM errors (mteb.load_results(only_main_score=True))

x-tabdeveloping · 2025-01-14T08:58:11Z

hmm another interesting thing is that the main_score of the task object is max_accuracy, which is obviously not present here, instead we have {"max": {"accuracy": score}}

Samoed · 2025-01-14T09:03:45Z

Results metrics structure was changed in #1037

x-tabdeveloping · 2025-01-14T09:29:23Z

hmm I know, but for some reason it still doesn't get loaded. There must be something with loading older results that doesn't work. @KennethEnevoldsen any thoughts on this?

x-tabdeveloping · 2025-01-14T09:38:28Z

If I try to load just this JSON file, I get Main score max_accuracy not found in scores

x-tabdeveloping · 2025-01-14T09:41:40Z

Which is made even stranger by the fact, that max_accuracy is actually there xd

cmnli = TaskResult.from_disk(Path("cmnli.json"))
print(cmnli.scores)

{'validation': [{'main_score': None,
   'hf_subset': 'default',
   'languages': ['cmn-Hans'],
   'cos_sim_accuracy': 0.6508719182200842,
   'cos_sim_accuracy_threshold': 0.9246084690093994,
   'cos_sim_ap': 0.7212026503090375,
   'cos_sim_f1': 0.7057953600147888,
   'cos_sim_f1_threshold': 0.8966324329376221,
   'cos_sim_precision': 0.5836135738306328,
   'cos_sim_recall': 0.8926817862988076,
   'dot_accuracy': 0.5950691521346964,
   'dot_accuracy_threshold': 21.76702117919922,
   'dot_ap': 0.6483602531599347,
   'dot_f1': 0.6818997878243839,
   'dot_f1_threshold': 19.672679901123047,
   'dot_precision': 0.5237557979190172,
   'dot_recall': 0.9768529342997428,
   'euclidean_accuracy': 0.6535177390258569,
   'euclidean_accuracy_threshold': 1.93505859375,
   'euclidean_ap': 0.7207827996588388,
   'euclidean_f1': 0.7047511312217195,
   'euclidean_f1_threshold': 2.1808691024780273,
   'euclidean_precision': 0.5904280524403728,
   'euclidean_recall': 0.8739770867430442,
   'manhattan_accuracy': 0.6514732411304871,
   'manhattan_accuracy_threshold': 30.713165283203125,
   'manhattan_ap': 0.7190222188188342,
   'manhattan_f1': 0.7038932553463717,
   'manhattan_f1_threshold': 34.67675018310547,
   'manhattan_precision': 0.577794448612153,
   'manhattan_recall': 0.90039747486556,
   'max_accuracy': 0.6535177390258569,
   'max_ap': 0.7212026503090375,
   'max_f1': 0.7057953600147888}]}

x-tabdeveloping · 2025-01-14T10:10:03Z

It seems to me that the problem is threefold:

We do miss some results on some models, this is not as problematic as these are not present in the old C-MTEB leaderboard either.
Some benchmark tasks were specified with the old version while the results have v2, we'll have to fix that in benchmarks.py
Results from before 1.1.1.0 don't get properly loaded. I suspect that this is due to the fact that the names of the fields are not fixed before scores=scores[task.metadata.main_score] is called.

x-tabdeveloping · 2025-01-14T10:32:02Z

On second thoughts, we should probably be using the v1 clustering tasks, since those are in the original publication

x-tabdeveloping · 2025-01-14T11:11:20Z

Ooooh I think I know what's going on.
Some of the folders that contain old results do not have a model_meta.json and we do not allow those to load, because the parameter require_model_meta is set to True by default on load_results().

x-tabdeveloping · 2025-01-14T12:19:47Z

#1801 fixes some of this

Samoed · 2025-01-14T12:20:10Z

From #1801

Model: intfloat/multilingual-e5-small

Task Name	Old Leaderboard	New Leaderboard
T2Retrieval	71.39	71.39
MMarcoRetrieval	73.17	73.17
DuRetrieval	81.35	81.35
CovidRetrieval	72.82	72.82
CmedqaRetrieval	24.38	24.38
EcomRetrieval	53.56	53.56
MedicalRetrieval	44.84	44.84
VideoRetrieval	58.09	58.09
T2Reranking	65.24	65.24
MMarcoReranking	24.33	24.33
CMedQAv1-reranking	N/A	63.44
CMedQAv2-reranking	N/A	62.41
Ocnli	60.77	58.69
Cmnli	72.12	65.35
CLSClusteringS2S	37.79	37.79
CLSClusteringP2P	39.14	39.14
ThuNewsClusteringS2S	48.93	48.93
ThuNewsClusteringP2P	55.18	55.18
ATEC	35.14	35.14
BQ	21.51	43.27
LCQMC	72.7	72.7
PAWSX	11.01	11.01
STSB	84.11	77.73
AFQMC	25.21	25.21
QBQTC	30.25	30.25
TNews	48.38	48.38
IFlyTek	47.35	47.35
Waimai	83.9	83.9
OnlineShopping	88.73	88.73
MultilingualSentiment	64.74	66.34
JDReview	79.34	79.34

Results for Cmnli and Ocnli are different, because in old leaderboard models are compared by max_ap, but main_score for Cmnli is max_accuracy https://github.com/embeddings-benchmark/results/blob/4f6a9fc42a2e67922d36ad3d41baf3aa9c969e6e/results/intfloat__multilingual-e5-small/no_revision_available/Cmnli.json#L43-L47
Results for MultilingualSentiment are different, because it is multilingual tasks and tasks in benchmark are incrorectly selected
I can't find results for STSB with score 84.11

x-tabdeveloping · 2025-01-14T12:27:16Z

I think all tasks in the Chinese benchmark are Chinese only, so selecting the correct hf_subsets shouldn't be an issue.
Both MultilingualSentiment and STSB only have a "default" hf_subset and they are both Chinese-only.

x-tabdeveloping · 2025-01-14T12:27:43Z

Do we want to fix Cmnli and Ocnli or keep it as is?

Samoed · 2025-01-14T12:29:10Z

For MultilingualSentiment I meant to use only test split without validation. For Cmnli and Ocnli I think to leave them as is

x-tabdeveloping · 2025-01-14T12:29:49Z

ooh right, should I add that in the benchmark description or would you like to do it?

Samoed · 2025-01-14T12:30:57Z

I think it would be better if you did.

x-tabdeveloping · 2025-01-14T12:33:29Z

I've also checked again and the results that we're missing now, seem indeed to be missing from the old benchmark.
I think the reason the old Chinese leaderboard seems more complete is because we haven't annotated a bunch of Chinese models in terms of metadata.

x-tabdeveloping · 2025-01-14T12:33:35Z

okey doke, will do

x-tabdeveloping · 2025-01-14T12:50:23Z

I'll compile a list of model metas we're missing

KennethEnevoldsen · 2025-01-15T15:08:18Z

Seems like you did quite a bit of work on this, let me know if there is anything you want me to take a look at

x-tabdeveloping · 2025-01-15T15:29:31Z

@KennethEnevoldsen I'd appreciate if you could take care of the stella models on issue #1803 (If you have the time and energy of course, otherwise I can do it)

x-tabdeveloping changed the title ~~Loads of results missing for C-MTEB~~ Missing results for C-MTEB Jan 14, 2025

x-tabdeveloping changed the title ~~Missing results for C-MTEB~~ Leaderboard: Missing results for C-MTEB Jan 14, 2025

Samoed mentioned this issue Jan 14, 2025

fix: loading pre 11 #1798

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leaderboard: Missing results for C-MTEB #1797

Leaderboard: Missing results for C-MTEB #1797

x-tabdeveloping commented Jan 14, 2025

x-tabdeveloping commented Jan 14, 2025

x-tabdeveloping commented Jan 14, 2025

x-tabdeveloping commented Jan 14, 2025

Samoed commented Jan 14, 2025 •

edited

Loading

x-tabdeveloping commented Jan 14, 2025

x-tabdeveloping commented Jan 14, 2025

x-tabdeveloping commented Jan 14, 2025

x-tabdeveloping commented Jan 14, 2025

x-tabdeveloping commented Jan 14, 2025

x-tabdeveloping commented Jan 14, 2025

x-tabdeveloping commented Jan 14, 2025

Samoed commented Jan 14, 2025

x-tabdeveloping commented Jan 14, 2025

x-tabdeveloping commented Jan 14, 2025

Samoed commented Jan 14, 2025 •

edited

Loading

x-tabdeveloping commented Jan 14, 2025

Samoed commented Jan 14, 2025 •

edited

Loading

x-tabdeveloping commented Jan 14, 2025

x-tabdeveloping commented Jan 14, 2025

x-tabdeveloping commented Jan 14, 2025

KennethEnevoldsen commented Jan 15, 2025

x-tabdeveloping commented Jan 15, 2025

Leaderboard: Missing results for C-MTEB #1797

Leaderboard: Missing results for C-MTEB #1797

Comments

x-tabdeveloping commented Jan 14, 2025

x-tabdeveloping commented Jan 14, 2025

x-tabdeveloping commented Jan 14, 2025

x-tabdeveloping commented Jan 14, 2025

Samoed commented Jan 14, 2025 • edited Loading

x-tabdeveloping commented Jan 14, 2025

x-tabdeveloping commented Jan 14, 2025

x-tabdeveloping commented Jan 14, 2025

x-tabdeveloping commented Jan 14, 2025

x-tabdeveloping commented Jan 14, 2025

x-tabdeveloping commented Jan 14, 2025

x-tabdeveloping commented Jan 14, 2025

Samoed commented Jan 14, 2025

Model: intfloat/multilingual-e5-small

x-tabdeveloping commented Jan 14, 2025

x-tabdeveloping commented Jan 14, 2025

Samoed commented Jan 14, 2025 • edited Loading

x-tabdeveloping commented Jan 14, 2025

Samoed commented Jan 14, 2025 • edited Loading

x-tabdeveloping commented Jan 14, 2025

x-tabdeveloping commented Jan 14, 2025

x-tabdeveloping commented Jan 14, 2025

KennethEnevoldsen commented Jan 15, 2025

x-tabdeveloping commented Jan 15, 2025

Samoed commented Jan 14, 2025 •

edited

Loading

Samoed commented Jan 14, 2025 •

edited

Loading

Samoed commented Jan 14, 2025 •

edited

Loading