You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
But as I continue my examination I've realized some problems about the labeling. Here are some examples:
An Example of False Positive Labeling
Eval Item
{
"id": "efc09699",
"question": "How can you create multiple test cases for an evaluation in the Anthropic Evaluation tool?",
"correct_chunks": [
"https://docs.anthropic.com/en/docs/test-and-evaluate/eval-tool#creating-test-cases", "https://docs.anthropic.com/en/docs/build-with-claude/develop-tests#building-evals-and-test-cases"
],
"correct_answer": "To create multiple test cases in the Anthropic Evaluation tool, click the 'Add Test Case' button, fill in values for each variable in your prompt, and repeat the process to create additional test case scenarios."
}
Problem
As you can see, although this positively labeled document have some similar keywords, it is definitely not an answer for the related question. I am afraid there are other similar examples. I believe this situation both affects the retrieval performance evaluation along with the end-to-end performance evaluation.
My Question
May I ask, how exactly is this eval dataset is obtained? I first thought it is human generated, but maybe it is not? I am also wondering can there be any "false negative labeled documents" meaning some documents are actually related with the question but not labeled positively.
Thank you for you response in advance. :)
The text was updated successfully, but these errors were encountered:
FYI, further examination showed me that, in addition to false positive labels, there are false negative labels (passages that are actually related to the question but not labeled as related) as well. So I have serious doubts that improving scores on this eval set actually mean a better retrieval method or not.
Hello there, I was examining the RAG guide:
https://github.com/anthropics/anthropic-cookbook/blob/main/skills/retrieval_augmented_generation/guide.ipynb
Especially I was trying to understand the evaluation logic. The example guide uses the following data for the evaluation purposes:
But as I continue my examination I've realized some problems about the labeling. Here are some examples:
An Example of False Positive Labeling
Eval Item
{
"id": "efc09699",
"question": "How can you create multiple test cases for an evaluation in the Anthropic Evaluation tool?",
"correct_chunks": [
"https://docs.anthropic.com/en/docs/test-and-evaluate/eval-tool#creating-test-cases",
"https://docs.anthropic.com/en/docs/build-with-claude/develop-tests#building-evals-and-test-cases"
],
"correct_answer": "To create multiple test cases in the Anthropic Evaluation tool, click the 'Add Test Case' button, fill in values for each variable in your prompt, and repeat the process to create additional test case scenarios."
}
Corresponding Highlighted Positive Labeled Document
{
"chunk_link": "https://docs.anthropic.com/en/docs/build-with-claude/develop-tests#building-evals-and-test-cases",
"chunk_heading": "Building evals and test cases",
"text": "Building evals and test cases\n\n\n"
}
Problem
As you can see, although this positively labeled document have some similar keywords, it is definitely not an answer for the related question. I am afraid there are other similar examples. I believe this situation both affects the retrieval performance evaluation along with the end-to-end performance evaluation.
My Question
May I ask, how exactly is this eval dataset is obtained? I first thought it is human generated, but maybe it is not? I am also wondering can there be any "false negative labeled documents" meaning some documents are actually related with the question but not labeled positively.
Thank you for you response in advance. :)
The text was updated successfully, but these errors were encountered: