Running 7b models on the benchmark #269

wendlerc · 2024-03-19T19:35:27Z

wendlerc
Mar 19, 2024

How long does it take to run a llama2-size model over the benchmark?

Best,
Chris

KennethEnevoldsen · 2024-03-19T19:59:02Z

KennethEnevoldsen
Mar 19, 2024
Maintainer

Hi @wendlerc it really depends on the implementation and hardware. I did the Scandinavian segment in less than an hour on a v100 (just to give you some approximate) and that was using 16bit using the naive transformers implementation.

@Muennighoff might have a better estimate or suggestions for how to do it fast.

0 replies

wendlerc · 2024-03-19T20:44:46Z

wendlerc
Mar 19, 2024
Author

I think I might have a training-free method to turn any autoregressive model into a sentence embedding model and would like to do a quick evaluation.

If there is a representative subset of tasks that people particularly care about that would also be helpful, or if anybody wants to get on board & help running evals.

0 replies

KennethEnevoldsen · 2024-03-19T21:02:00Z

KennethEnevoldsen
Mar 19, 2024
Maintainer

I have specifically designed the Scandinavian embedding benchmark to be small. For testing purposes, I would probably go for that one. If you have your method wrapped in an encode(sentences: list[str], ...) -> np.array method I can def. help you there.

Alternatively, you use the table on the front page to select ~10 random datasets with relatively few samples.

0 replies

wendlerc · 2024-03-19T22:10:37Z

wendlerc
Mar 19, 2024
Author

Will implement the encode method tomorrow & keep you posted.

0 replies

wendlerc · 2024-03-20T20:21:46Z

wendlerc
Mar 20, 2024
Author

As promised, here is a quick and dirty implementation:

https://github.com/wendlerc/llama2-embeddings

Batch size is currently hard-coded and also a maximum sequence length in the tokenization step.
The original code was meant for batch size 1, so I am not sure whether I broke something by updating it to support larger batches. Also currently I log a ton of unnecessary activations for this task.

0 replies

wendlerc · 2024-03-21T07:42:59Z

wendlerc
Mar 21, 2024
Author

{'EmotionClassification': {'mteb_version': '1.2.0', 'dataset_revision': '4f58c6b202a23cf9a4da393831edf4f9183cad37', 'mteb_dataset_name': 'EmotionClassification', 'validation': {'accuracy': 0.21234999999999998, 'f1': 0.18402314056204205, 'accuracy_stderr': 0.03751336428527839, 'f1_stderr': 0.02261561372267123, 'main_score': 0.21234999999999998, 'evaluation_time': 4434.28}, 'test': {'accuracy': 0.20655, 'f1': 0.17523029518237007, 'accuracy_stderr': 0.03902463965240423, 'f1_stderr': 0.021088847119969625, 'main_score': 0.20655, 'evaluation_time': 4398.19}}}

I guess maybe the method needs some more work.

0 replies

KennethEnevoldsen · 2024-03-21T08:19:50Z

KennethEnevoldsen
Mar 21, 2024
Maintainer

How does it compare to raw LLama2?

You might want to add this approach as well:
https://arxiv.org/abs/2402.15449

btw this seems more like a discussion that an issue so will just move it over.

9 replies

wendlerc Mar 21, 2024
Author

I think it should be possible to address some of the limitations outlined in the echo-embeddings paper. E.g., one could maybe get rid of the increase in context length by computing the "echo-function-vector", see https://arxiv.org/abs/2310.15213

wendlerc Mar 21, 2024
Author

Also, some known results from mechanistic interpretability (I think) provide some explanation for why the echo-embeddings work.

KennethEnevoldsen Mar 21, 2024
Maintainer

E.g., one could maybe get rid of the increase in context length by computing the "echo-function-vector", see https://arxiv.org/abs/2310.15213

I would be very interested in seeing how well this approach performs

wendlerc Mar 21, 2024
Author

For llama2 it should be quite straightforward to try out.

Edit:
Actually, giving a bit more thought it might be less straightforward.

But very cool that this method achieves such good embeddings.

Here the idea: They use the prompt
“Rewrite the sentence: x, rewritten sentence: x”
and average over all of the embeddings of the tokens in the second occurence of x.

Instead I propose to compute a task vector for the rewrite task and only use the embedding of the
last token of "x".

So the question to me is: "How important is it to have the average over all the tokens of x."

Not sure how much time I have to try this out.

wendlerc Mar 21, 2024
Author

echo embeddings repo for further reference: https://github.com/jakespringer/echo-embeddings

wendlerc · 2024-03-26T09:14:24Z

wendlerc
Mar 26, 2024
Author

I did some basic tests with echoembeddings and without training they perform similar to the method I proposed. I.e., taking the sum alone does not do the trick, it gets slightly better though.

2 replies

KennethEnevoldsen Mar 26, 2024
Maintainer

If you can do as well as echo embeddings but without duplicating the text that sounds pretty good.

wendlerc Mar 27, 2024
Author

Indeed, sadly will not have much time to try making this work for the next few weeks.

KennethEnevoldsen · 2024-03-26T10:21:00Z

KennethEnevoldsen Mar 26, 2024
Maintainer

Seems like there was an error here...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running 7b models on the benchmark #269

{{title}}

Replies: 9 comments 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

This comment was marked as off-topic.

{{title}}

Select a reply

Running 7b models on the benchmark #269

wendlerc Mar 19, 2024

Replies: 9 comments · 12 replies

KennethEnevoldsen Mar 19, 2024 Maintainer

wendlerc Mar 19, 2024 Author

KennethEnevoldsen Mar 19, 2024 Maintainer

wendlerc Mar 19, 2024 Author

wendlerc Mar 20, 2024 Author

wendlerc Mar 21, 2024 Author

KennethEnevoldsen Mar 21, 2024 Maintainer

wendlerc Mar 21, 2024 Author

wendlerc Mar 21, 2024 Author

KennethEnevoldsen Mar 21, 2024 Maintainer

wendlerc Mar 21, 2024 Author

wendlerc Mar 21, 2024 Author

wendlerc Mar 26, 2024 Author

KennethEnevoldsen Mar 26, 2024 Maintainer

wendlerc Mar 27, 2024 Author

This comment was marked as off-topic.

KennethEnevoldsen Mar 26, 2024 Maintainer

wendlerc
Mar 19, 2024

Replies: 9 comments 12 replies

KennethEnevoldsen
Mar 19, 2024
Maintainer

wendlerc
Mar 19, 2024
Author

KennethEnevoldsen
Mar 19, 2024
Maintainer

wendlerc
Mar 19, 2024
Author

wendlerc
Mar 20, 2024
Author

wendlerc
Mar 21, 2024
Author

KennethEnevoldsen
Mar 21, 2024
Maintainer

wendlerc Mar 21, 2024
Author

wendlerc Mar 21, 2024
Author

KennethEnevoldsen Mar 21, 2024
Maintainer

wendlerc Mar 21, 2024
Author

wendlerc Mar 21, 2024
Author

wendlerc
Mar 26, 2024
Author

KennethEnevoldsen Mar 26, 2024
Maintainer

wendlerc Mar 27, 2024
Author

KennethEnevoldsen Mar 26, 2024
Maintainer