Add GPT-4V as evaluator #276

drcege · 2024-03-22T13:51:54Z

Initial version to enrich the multimodal evaluation features, using GPT4V API to assess models
Welcome further testing and refinement

drcege · 2024-03-25T12:29:51Z

@HYLcool Tested and improved with @zhijianma

drcege · 2024-03-25T12:39:05Z

Maybe postpone the merge until the sandbox builds the pipeline.

BeachWang · 2024-03-26T06:25:09Z

tools/mm_eval/gpt4v/README_ZH.md

+以图像到文本（image-to-text）的生成任务为例，每个 JSON 对象应该包括 `image` 和 `text` 键。样例输入文件格式如下：
+
+```JSON
+{"image": "/path/to/image0", "text": "generated caption"}


需不需要保持跟data-juicer的jsonl结构一致呢？sandbox整个流程都保持一种数据结构可能会更好

都是 JSONL 结构，是说 key 的不同？需要 sandbox 确定之后微调对接。
目前 DJ 里的 text/image/video/audio 应该也不是写死的，可以通过传入 text_key / image_key /... 等参数指定。

这里还有两个相关问题：

@HYLcool 当前 image_key / video_key / audio_key 的默认值都采用复数 images/videos/audios，似乎始终定义为列表。考虑评测场景下，通常是根据输入的 prompt 生成一张图片/视频，或者根据给定的图片/视频生成一段 caption，每个测试样例应该只有一个图片/视频输出，要始终包围在列表中吗？如果是这么理解，看起来会比较繁琐；我倾向于将默认的 key 改为单数，只代表类别/模态的概念，允许单个元素或列表。

@BeachWang 我这里还实现了一种 pairwise comparison 的评测方法，对比一个输入的两种输出（相当于打擂台），比如 text-to-image 任务下需要 text 和 image_0, image_1 三个key，必然跟 DJ 默认的输出结构不一致，期望用户自己构建。

这种可以输入两个json文件吗？保持顺序一致这样子呢？就可以保持跟dj格式一样了，感觉sandbox需要先确定一个统一的数据格式@HYLcool

@HYLcool 当前 image_key / video_key / audio_key 的默认值都采用复数 images/videos/audios，似乎始终定义为列表。考虑评测场景下，通常是根据输入的 prompt 生成一张图片/视频，或者根据给定的图片/视频生成一段 caption，每个测试样例应该只有一个图片/视频输出，要始终包围在列表中吗？如果是这么理解，看起来会比较繁琐；我倾向于将默认的 key 改为单数，只代表类别/模态的概念，允许单个元素或列表。

主要是如果一个数据集里既有单个元素也有列表的话，这个数据集的这一列会被认为类型不匹配，从而不能被正确载入，因此当时就选了列表来兼容这些不同的情况。虽然大部分数据集（包括评测数据集）的确通常只包括一个多模态数据，但是按照最新一些MLLM工作中的数据集组成来看，也会存在单个样本中包括多个多模态数据的情况。

这种可以输入两个json文件吗？保持顺序一致这样子呢？就可以保持跟dj格式一样了，感觉sandbox需要先确定一个统一的数据格式@HYLcool

@yxdyc 需要你看下

BeachWang · 2024-03-26T07:50:04Z

tools/mm_eval/gpt4v/compare.py

compare函数应该随机一下位置，比如text0和text1，随机互换一下，记录winner再换回原来的顺序。因为有工作证明LLM对顺序是有偏的，我们应该让E(eval(texts0, texts1) = E(eval(texts1, texts0))。

github-actions · 2024-04-17T09:31:57Z

This PR is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this PR will be closed in 3 day.

github-actions · 2024-04-20T09:31:59Z

Close this stale PR.

yxdyc

LGTM. Plz implement GPT-4V Evaluator accordingly in sandbox later

github-actions · 2024-05-14T09:31:58Z

This PR is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this PR will be closed in 3 day.

github-actions · 2024-05-18T09:31:59Z

Close this stale PR.

drcege added 11 commits March 20, 2024 15:48

init gpt4v eval libs

ba367b0

add base64 tool

c3b58bf

add compare_prompt_to_frames

ed5a3d9

refine prompts

202b2f6

organize

d060b71

Merge branch 'main' into eval/gpt4v

02e930b

wrap grade input/output

2b5a2dc

reorganize & update prompt

85c3f72

add docs

c4de2ce

fix config

1a7e62a

minor

c6563db

drcege added enhancement New feature or request dj:multimodal issues/PRs about multimodal data processing labels Mar 22, 2024

drcege added this to the DJ-SORA milestone Mar 22, 2024

drcege requested review from chenhesen, BeachWang, zhijianma and yxdyc March 22, 2024 13:51

drcege self-assigned this Mar 22, 2024

zhijianma and others added 2 commits March 25, 2024 11:01

add json dump

1fb2136

Improve documentation

ab37011

drcege requested a review from HYLcool March 25, 2024 12:30

Merge branch 'main' into eval/gpt4v

ac69094

refine eval output

7676e28

BeachWang reviewed Mar 26, 2024

View reviewed changes

github-actions bot added the stale-pr label Apr 17, 2024

github-actions bot closed this Apr 20, 2024

HYLcool reopened this Apr 22, 2024

HYLcool removed the stale-pr label Apr 22, 2024

yxdyc approved these changes Apr 22, 2024

View reviewed changes

github-actions bot added the stale-pr label May 14, 2024

github-actions bot closed this May 18, 2024

HYLcool reopened this Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPT-4V as evaluator #276

Add GPT-4V as evaluator #276

drcege commented Mar 22, 2024

drcege commented Mar 25, 2024

drcege commented Mar 25, 2024 •

edited

Loading

BeachWang Mar 26, 2024

drcege Mar 26, 2024 •

edited

Loading

BeachWang Mar 27, 2024

HYLcool Mar 27, 2024

HYLcool Mar 27, 2024

BeachWang Mar 26, 2024

github-actions bot commented Apr 17, 2024

github-actions bot commented Apr 20, 2024

yxdyc left a comment

github-actions bot commented May 14, 2024

github-actions bot commented May 18, 2024

Add GPT-4V as evaluator #276

Are you sure you want to change the base?

Add GPT-4V as evaluator #276

Conversation

drcege commented Mar 22, 2024

drcege commented Mar 25, 2024

drcege commented Mar 25, 2024 • edited Loading

BeachWang Mar 26, 2024

Choose a reason for hiding this comment

drcege Mar 26, 2024 • edited Loading

Choose a reason for hiding this comment

BeachWang Mar 27, 2024

Choose a reason for hiding this comment

HYLcool Mar 27, 2024

Choose a reason for hiding this comment

HYLcool Mar 27, 2024

Choose a reason for hiding this comment

BeachWang Mar 26, 2024

Choose a reason for hiding this comment

github-actions bot commented Apr 17, 2024

github-actions bot commented Apr 20, 2024

yxdyc left a comment

Choose a reason for hiding this comment

github-actions bot commented May 14, 2024

github-actions bot commented May 18, 2024

drcege commented Mar 25, 2024 •

edited

Loading

drcege Mar 26, 2024 •

edited

Loading