[R-259] Which is the best LLM for evaluation? #981

yadavshashank · 2024-05-21T12:49:33Z

I checked the documentation and related resources and couldn't find an answer to my question.

Your Question
Do RAGAS prompts work equally well with other LLMs like Claude 3 Sonnet and Llama 3? If not which model to choose?
Also, is there a way to print and modify the prompts?

Additional context

I can see a big variation in scores across different models.
GPT-3.5 gives me higher scores compared to others for all metrics. Claude 3 Sonnet is like the average of Llama 3 and GPT-4 Turbo. Llama 3 and Cohere Command often give NaN output for some metrics.
My evaluation set has 19 records.

Radar chart comparison of scores:

_R-259

shahules786 · 2024-05-27T10:16:58Z

Hey @yadavshashank This is a very interesting analysis. This is also a problem we have been thinking about. Fundamentally what matters here is which of these LLMs would suit your level of alignment between automated scoring vs manual scoring. This is the problem we can solve by adding some form of UI component before automated scoring that would allow developers to do some level of manual checking and make sure that scores align with their judgment. We will be working in this direction, but that will only come later.
As of today, my best recommendation is to go for the best overall performing LLMs like gpt-4 / Claude if closed source and llama-3 if opensource.

Would love to hop on a call and help/chat with you if you're open. My cal is here.

yadavshashank added the question Further information is requested label May 21, 2024

yadavshashank changed the title ~~Which is the best LLM to choose for evaluation?~~ Which is the best LLM for evaluation? May 21, 2024

jjmachan assigned shahules786 May 21, 2024

shahules786 added the linear Created by Linear-GitHub Sync label May 22, 2024

shahules786 changed the title ~~Which is the best LLM for evaluation?~~ [R-259] Which is the best LLM for evaluation? May 22, 2024

shahules786 modified the milestones: v0.1.9, v.5 May 22, 2024

shahules786 added the answered 🤖 The question has been answered. Will be closed automatically if no new comments label May 27, 2024

shahules786 closed this as completed May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R-259] Which is the best LLM for evaluation? #981

[R-259] Which is the best LLM for evaluation? #981

yadavshashank commented May 21, 2024 •

edited by shahules786

shahules786 commented May 27, 2024 •

edited

[R-259] Which is the best LLM for evaluation? #981

[R-259] Which is the best LLM for evaluation? #981

Comments

yadavshashank commented May 21, 2024 • edited by shahules786

shahules786 commented May 27, 2024 • edited

yadavshashank commented May 21, 2024 •

edited by shahules786

shahules786 commented May 27, 2024 •

edited