Add gpt4facts Eval #1363

mmtmn · 2023-09-25T01:21:50Z

Eval details 📑

Eval name

gpt4facts

Eval description

Evaluate the model's ability to recall and provide accurate facts from gpt4.

What makes this a useful eval?

2309240522114RI6KBA7_gpt-3.5-turbo_gpt4facts.jsonl
[2023-09-23 23:24:29,221] [oaieval.py:245] Final report:
[2023-09-23 23:24:29,222] [oaieval.py:247] counts/B: 48
[2023-09-23 23:24:29,222] [oaieval.py:247] counts/D: 30
[2023-09-23 23:24:29,222] [oaieval.py:247] counts/A: 24

The above results, when:
fact.yaml: a factual consistency eval which, given a completion a and reference answer b, returns:

"A" if a $\subseteq$ b, i.e., the submitted answer is a subset of the expert answer and is fully consistent with it.
"B" if a $\supseteq$ b, i.e., the submitted answer is a superset of the expert answer and is fully consistent with it.
"C" if a $=$ b, i.e., the submitted answer contains all the same details as the expert answer.
"D" if a $\neq$ b, i.e., there is a disagreement between the submitted answer and the expert answer.
"E" if a $\approx$ b, i.e., the answers differ, but these differences don't matter from the perspective of factuality.

There was also interest in this eval according to the last review of it. I was using basic Match, when modelgraded was requested to be used.

Criteria for a good eval ✅

Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world.
Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not.
Includes good signal around what is the right behavior. This means either a correct answer for Basic evals or the Fact Model-graded eval, or an exhaustive rubric for evaluating answers for the Criteria Model-graded eval.
Include at least 15 high-quality examples.

Eval structure 🏗️

Check that your data is in evals/registry/data/{name}
Check that your YAML is registered at evals/registry/evals/{name}.yaml
Ensure you have the right to use the data you submit via this eval

Final checklist 👀

Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies).

I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies.

Email address validation

If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the commits on the merged pull request.

I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request.

Limited availability acknowledgment

We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and the high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR.

I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access be granted.

Submit eval

I have filled out all required fields of this form
I have used Git LFS for the Eval JSON data
(Ignore if not submitting code) I have run pip install pre-commit; pre-commit install and have verified that mypy, black, isort, and autoflake are running when I commit and push

Failure to fill out all required fields will result in the PR being closed.

Eval JSON data

Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here:

View evals in JSON

Eval

{"input": [{"role": "system", "content": "The user will ask you a question about gpt-4, please respond to the best of your abilities."}, {"role": "user", "content": "Why was reinforcement learning from human feedback used in GPT-4's fine-tuning?"}], "ideal": "Reinforcement learning from human feedback was used in GPT-4's fine-tuning to improve the model's performance and align it better with human values and expectations."}
{"input": [{"role": "system", "content": "The user will ask you a question about gpt-4, please respond to the best of your abilities."}, {"role": "user", "content": "What is the purpose of GPT-4's multimodal capabilities?"}], "ideal": "GPT-4's multimodal capabilities enable it to understand and generate responses based on a variety of inputs, including images and text, providing more versatile and accurate results in different contexts."}
{"input": [{"role": "system", "content": "The user will ask you a question about gpt-4, please respond to the best of your abilities."}, {"role": "user", "content": "How does GPT-4's performance compare in terms of passing standardized tests and bar exams?"}], "ideal": "GPT-4 has shown the ability to pass a bar exam and several standardized tests, demonstrating its improved capabilities compared to previous models."}
{"input": [{"role": "system", "content": "The user will ask you a question about gpt-4, please respond to the best of your abilities."}, {"role": "user", "content": "Did OpenAI provide detailed technical information about GPT-4?"}], "ideal": "OpenAI adopted a closed approach regarding GPT-4's technical details, refraining from specifying the model size, architecture, hardware, or training method due to the competitive landscape and safety implications of large-scale models."}
{"input": [{"role": "system", "content": "The user will ask you a question about gpt-4, please respond to the best of your abilities."}, {"role": "user", "content": "What are the rumored parameter counts for GPT-4 compared to GPT-3?"}], "ideal": "Rumors suggested that GPT-4 would substantially increase the parameter count from GPT-3's 175 billion to 100 trillion, but OpenAI CEO Sam Altman described these rumors as 'complete bullshit'."}

description was outdated, now it has been updated

usama-openai · 2023-10-19T22:38:10Z

Updated version of #255 .

usama-openai

Thanks for submitting this eval! This PR looks good. I'm approving this PR.

mmtmn · 2023-11-09T01:49:22Z

I suppose this needs a samples.jsonl update after dev day

mmtmn and others added 5 commits March 16, 2023 15:45

Final gpt4facts

90db08f

Fix folder structure typo

660044a

Fix folder structure typo

958cbfa

Merge branch 'openai:main' into final_gpt4facts

c227358

Add requested changes to gpt4facts eval

d7d7c6f

mmtmn requested review from andrew-openai, jwang47 and logankilpatrick as code owners September 25, 2023 01:21

Update gpt4facts.yaml

13354d0

description was outdated, now it has been updated

usama-openai approved these changes Oct 19, 2023

View reviewed changes

logankilpatrick removed their request for review January 3, 2024 16:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gpt4facts Eval #1363

Add gpt4facts Eval #1363

mmtmn commented Sep 25, 2023 •

edited

usama-openai commented Oct 19, 2023

usama-openai left a comment

mmtmn commented Nov 9, 2023

Add gpt4facts Eval #1363

Are you sure you want to change the base?

Add gpt4facts Eval #1363

Conversation

mmtmn commented Sep 25, 2023 • edited

Eval details 📑

Eval name

Eval description

What makes this a useful eval?

Criteria for a good eval ✅

Eval structure 🏗️

Final checklist 👀

Submission agreement

Email address validation

Limited availability acknowledgment

Submit eval

Eval JSON data

Eval

usama-openai commented Oct 19, 2023

usama-openai left a comment

Choose a reason for hiding this comment

mmtmn commented Nov 9, 2023

mmtmn commented Sep 25, 2023 •

edited