Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gpt4facts Eval #1363

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Conversation

mmtmn
Copy link
Contributor

@mmtmn mmtmn commented Sep 25, 2023

Eval details 📑

Eval name

gpt4facts

Eval description

Evaluate the model's ability to recall and provide accurate facts from gpt4.

What makes this a useful eval?

2309240522114RI6KBA7_gpt-3.5-turbo_gpt4facts.jsonl
[2023-09-23 23:24:29,221] [oaieval.py:245] Final report:
[2023-09-23 23:24:29,222] [oaieval.py:247] counts/B: 48
[2023-09-23 23:24:29,222] [oaieval.py:247] counts/D: 30
[2023-09-23 23:24:29,222] [oaieval.py:247] counts/A: 24

The above results, when:
fact.yaml: a factual consistency eval which, given a completion a and reference answer b, returns:

  • "A" if a $\subseteq$ b, i.e., the submitted answer is a subset of the expert answer and is fully consistent with it.
  • "B" if a $\supseteq$ b, i.e., the submitted answer is a superset of the expert answer and is fully consistent with it.
  • "C" if a $=$ b, i.e., the submitted answer contains all the same details as the expert answer.
  • "D" if a $\neq$ b, i.e., there is a disagreement between the submitted answer and the expert answer.
  • "E" if a $\approx$ b, i.e., the answers differ, but these differences don't matter from the perspective of factuality.

There was also interest in this eval according to the last review of it. I was using basic Match, when modelgraded was requested to be used.

Criteria for a good eval ✅

  • Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world.
  • Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not.
  • Includes good signal around what is the right behavior. This means either a correct answer for Basic evals or the Fact Model-graded eval, or an exhaustive rubric for evaluating answers for the Criteria Model-graded eval.
  • Include at least 15 high-quality examples.

Eval structure 🏗️

  • Check that your data is in evals/registry/data/{name}
  • Check that your YAML is registered at evals/registry/evals/{name}.yaml
  • Ensure you have the right to use the data you submit via this eval

Final checklist 👀

Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies).

  • I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies.

Email address validation

If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the commits on the merged pull request.

  • I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request.

Limited availability acknowledgment

We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and the high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR.

  • I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access be granted.

Submit eval

  • I have filled out all required fields of this form
  • I have used Git LFS for the Eval JSON data
  • (Ignore if not submitting code) I have run pip install pre-commit; pre-commit install and have verified that mypy, black, isort, and autoflake are running when I commit and push

Failure to fill out all required fields will result in the PR being closed.

Eval JSON data

Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here:

View evals in JSON

Eval

{"input": [{"role": "system", "content": "The user will ask you a question about gpt-4, please respond to the best of your abilities."}, {"role": "user", "content": "Why was reinforcement learning from human feedback used in GPT-4's fine-tuning?"}], "ideal": "Reinforcement learning from human feedback was used in GPT-4's fine-tuning to improve the model's performance and align it better with human values and expectations."}
{"input": [{"role": "system", "content": "The user will ask you a question about gpt-4, please respond to the best of your abilities."}, {"role": "user", "content": "What is the purpose of GPT-4's multimodal capabilities?"}], "ideal": "GPT-4's multimodal capabilities enable it to understand and generate responses based on a variety of inputs, including images and text, providing more versatile and accurate results in different contexts."}
{"input": [{"role": "system", "content": "The user will ask you a question about gpt-4, please respond to the best of your abilities."}, {"role": "user", "content": "How does GPT-4's performance compare in terms of passing standardized tests and bar exams?"}], "ideal": "GPT-4 has shown the ability to pass a bar exam and several standardized tests, demonstrating its improved capabilities compared to previous models."}
{"input": [{"role": "system", "content": "The user will ask you a question about gpt-4, please respond to the best of your abilities."}, {"role": "user", "content": "Did OpenAI provide detailed technical information about GPT-4?"}], "ideal": "OpenAI adopted a closed approach regarding GPT-4's technical details, refraining from specifying the model size, architecture, hardware, or training method due to the competitive landscape and safety implications of large-scale models."}
{"input": [{"role": "system", "content": "The user will ask you a question about gpt-4, please respond to the best of your abilities."}, {"role": "user", "content": "What are the rumored parameter counts for GPT-4 compared to GPT-3?"}], "ideal": "Rumors suggested that GPT-4 would substantially increase the parameter count from GPT-3's 175 billion to 100 trillion, but OpenAI CEO Sam Altman described these rumors as 'complete bullshit'."}

description was outdated, now it has been updated
@usama-openai
Copy link
Collaborator

Updated version of #255 .

Copy link
Collaborator

@usama-openai usama-openai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this eval! This PR looks good. I'm approving this PR.

@mmtmn
Copy link
Contributor Author

mmtmn commented Nov 9, 2023

I suppose this needs a samples.jsonl update after dev day

@logankilpatrick logankilpatrick removed their request for review January 3, 2024 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants