Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-259] Which is the best LLM for evaluation? #981

Closed
yadavshashank opened this issue May 21, 2024 · 1 comment
Closed

[R-259] Which is the best LLM for evaluation? #981

yadavshashank opened this issue May 21, 2024 · 1 comment
Assignees
Labels
answered 🤖 The question has been answered. Will be closed automatically if no new comments linear Created by Linear-GitHub Sync question Further information is requested
Milestone

Comments

@yadavshashank
Copy link

yadavshashank commented May 21, 2024

I checked the documentation and related resources and couldn't find an answer to my question.

Your Question
Do RAGAS prompts work equally well with other LLMs like Claude 3 Sonnet and Llama 3? If not which model to choose?
Also, is there a way to print and modify the prompts?

Additional context

  • I can see a big variation in scores across different models.
  • GPT-3.5 gives me higher scores compared to others for all metrics. Claude 3 Sonnet is like the average of Llama 3 and GPT-4 Turbo. Llama 3 and Cohere Command often give NaN output for some metrics.
  • My evaluation set has 19 records.

Radar chart comparison of scores:
ragas_radar_model_comp

R-259

@yadavshashank yadavshashank added the question Further information is requested label May 21, 2024
@yadavshashank yadavshashank changed the title Which is the best LLM to choose for evaluation? Which is the best LLM for evaluation? May 21, 2024
@shahules786 shahules786 added the linear Created by Linear-GitHub Sync label May 22, 2024
@shahules786 shahules786 changed the title Which is the best LLM for evaluation? [R-259] Which is the best LLM for evaluation? May 22, 2024
@shahules786 shahules786 modified the milestones: v0.1.9, v.5 May 22, 2024
@shahules786
Copy link
Member

shahules786 commented May 27, 2024

Hey @yadavshashank This is a very interesting analysis. This is also a problem we have been thinking about. Fundamentally what matters here is which of these LLMs would suit your level of alignment between automated scoring vs manual scoring. This is the problem we can solve by adding some form of UI component before automated scoring that would allow developers to do some level of manual checking and make sure that scores align with their judgment. We will be working in this direction, but that will only come later.
As of today, my best recommendation is to go for the best overall performing LLMs like gpt-4 / Claude if closed source and llama-3 if opensource.

Would love to hop on a call and help/chat with you if you're open. My cal is here.

@shahules786 shahules786 added the answered 🤖 The question has been answered. Will be closed automatically if no new comments label May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
answered 🤖 The question has been answered. Will be closed automatically if no new comments linear Created by Linear-GitHub Sync question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants