Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v2] convert category from length descriptor to modality in task metadata #1767

Open
KennethEnevoldsen opened this issue Jan 11, 2025 · 7 comments

Comments

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Jan 11, 2025

Not sure if this is a good idea. Currently it is already somewhat vaguely defined.

I believe the original intention is to tell us something about the length (s2p: sentence to paragraph), but we know have the descriptive statistics which is a much better source.

However in MIEB it is used as "t2i", text to image.

@Muennighoff would love to know what you think:

here is a sampel from the desc. statistics:

...
        "average_document_length": 20.28592186371801,
        "max_document_length": 214210,
        "unique_documents": 1005474,
        "min_query_length": 2,
        "average_query_length": 38.259317745096176,
...

@isaac-chung you have also been involved greatly in both parts.

(an alternative is to convert the annotation in mieb into "s2i" meaning sentence to image)

@Muennighoff
Copy link
Contributor

Muennighoff commented Jan 11, 2025

Converting it to modality makes sense to me! s2p, p2p are much less specific than the actual lengths!

@isaac-chung
Copy link
Collaborator

Yes! This would align MTEB and MIEB in a much better way. The change I see from this is:

  • Update "s2p" and "p2s" -> "t2t"

@Samoed
Copy link
Collaborator

Samoed commented Jan 12, 2025

I think s2p is still relevant because some models don’t use prompts for passages, and it can be helpful to differentiate between s2s and s2p.

@isaac-chung
Copy link
Collaborator

  1. Hmm true. Could you share an example of a model using a task's category to determine whether to use prompts or not? Might help us find a better way forward.
  2. What alternative do you propose? I see an option with t2t being a parent category, and s2p being a child category, e.g. {"t2t": "s2p"}

@Samoed
Copy link
Collaborator

Samoed commented Jan 12, 2025

For now in NV-Embed this used, but in simple way

instruction = ""
if prompt_type == PromptType.query:
instruction = self.get_instruction(task_name, prompt_type)

and I created also for jasper a bit more complicated
if prompt_type == PromptType.passage and task.metadata.type == "s2p":
instruction = None
(will add as new sentence instruct wrapper in #1768)

@KennethEnevoldsen
Copy link
Contributor Author

I can't see that is it used in nv-embed? Am I missing something?

I think s2p is still relevant because some models don’t use prompts for passages, and it can be helpful to differentiate between s2s and s2p.

I reviewed #1768 and I am not quite sure why s2s or s2p is required here. Read the model card for jasper but couldn't find any case.

I might be missing something, but queries and passages can the disambiguate by the prompt. p in s2p as I understand stands for paragraph not passage.

@isaac-chung
Copy link
Collaborator

Agree with Kenneth above, and p does mean paragraph in the paper. Based on the discussion in #1768 and here, I feel the best course of action for us now is to:

  1. continue relying on PromptType to determine prompts (and not category)
  2. Move forward with the change proposed in this comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants