Dataset Issues for RAG Guide #111

yildize · 2024-11-08T12:55:07Z

Hello there, I was examining the RAG guide:
https://github.com/anthropics/anthropic-cookbook/blob/main/skills/retrieval_augmented_generation/guide.ipynb

Especially I was trying to understand the evaluation logic. The example guide uses the following data for the evaluation purposes:

But as I continue my examination I've realized some problems about the labeling. Here are some examples:

An Example of False Positive Labeling

Eval Item
{
"id": "efc09699",
"question": "How can you create multiple test cases for an evaluation in the Anthropic Evaluation tool?",
"correct_chunks": [
"https://docs.anthropic.com/en/docs/test-and-evaluate/eval-tool#creating-test-cases",
"https://docs.anthropic.com/en/docs/build-with-claude/develop-tests#building-evals-and-test-cases"
],
"correct_answer": "To create multiple test cases in the Anthropic Evaluation tool, click the 'Add Test Case' button, fill in values for each variable in your prompt, and repeat the process to create additional test case scenarios."
}

Corresponding Highlighted Positive Labeled Document
{
"chunk_link": "https://docs.anthropic.com/en/docs/build-with-claude/develop-tests#building-evals-and-test-cases",
"chunk_heading": "Building evals and test cases",
"text": "Building evals and test cases\n\n\n"
}

Problem
As you can see, although this positively labeled document have some similar keywords, it is definitely not an answer for the related question. I am afraid there are other similar examples. I believe this situation both affects the retrieval performance evaluation along with the end-to-end performance evaluation.

My Question
May I ask, how exactly is this eval dataset is obtained? I first thought it is human generated, but maybe it is not? I am also wondering can there be any "false negative labeled documents" meaning some documents are actually related with the question but not labeled positively.

Thank you for you response in advance. :)

yildize · 2024-11-27T13:07:47Z

FYI, further examination showed me that, in addition to false positive labels, there are false negative labels (passages that are actually related to the question but not labeled as related) as well. So I have serious doubts that improving scores on this eval set actually mean a better retrieval method or not.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset Issues for RAG Guide #111

Dataset Issues for RAG Guide #111

yildize commented Nov 8, 2024

yildize commented Nov 27, 2024

Dataset Issues for RAG Guide #111

Dataset Issues for RAG Guide #111

Comments

yildize commented Nov 8, 2024

An Example of False Positive Labeling

yildize commented Nov 27, 2024