Does InternVL support multi-image interleaved conversations: #153

irexyc · 2024-05-08T05:52:19Z

According to the demo code in readme, the images are put in the first round chat and the image token are put in the front of question.

# Demo code in readme.

# multi-round multi-image conversation
pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

question = "详细描述这两张图片" # Describe the two pictures in detail
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(question, response)

question = "这两张图片的相同点和区别分别是什么" # What are the similarities and differences between these two pictures
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(question, response)

# prompt looks like this:
# <|im_start|>system\n{system_message}<|im_end|><|im_start|>user\n<img>placeholder ... </img>\n{question}<|im_end|><|im_start|>assistant\n

我想知道InternVL-chat 是否支持像DeepSpeed-VisualChat那样的图像-文字交错对话，如果支持的话，每一轮对话中，图像的token应该如何插入，希望可以给一个例子。

I want to know if InternVL support interleaved text-and-image conversations. If so, where the image token should be put in each
conversations?

# Does InternVL support something like this? (I know pixel_values should be passed, 
# but I can't find demo code about putting pixel_values in interleaved text-and-image conversations)


pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
question = "Describe the two pictures in detail" # Describe the two pictures in detail
response, history = model.chat(tokenizer, pixel_values1, question, generation_config, history=None, return_history=True)
print(question, response)

pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
question = "Describe the two pictures in detail" # Describe the two pictures in detail
response, history = model.chat(tokenizer, pixel_values2, question, generation_config, history=history, return_history=True)
print(question, response)

question = "What is the difference about the two images?" # Describe the two pictures in detail
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
print(question, response)

The text was updated successfully, but these errors were encountered:

hjh0119 · 2024-05-08T08:23:45Z

model.chat只支持history为None时传入新的图片

    def chat(self, tokenizer, pixel_values, question, generation_config, history=None, return_history=False,
             IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>', IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'):

        img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
        self.img_context_token_id = img_context_token_id
        if tokenizer.convert_tokens_to_ids('<|im_end|>') != 0:
            eos_token_id = tokenizer.convert_tokens_to_ids('<|im_end|>')  # 92542, InternLM2
        else:
            eos_token_id = tokenizer.eos_token_id

        from .conversation import get_conv_template

        template = get_conv_template(self.template)
        image_bs = pixel_values.shape[0]
        print(f'dynamic ViT batch size: {image_bs}')
        if history is None:
            history = []
            image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.num_image_token * image_bs + IMG_END_TOKEN
            question = image_tokens + '\n' + question
        else:
            for (old_question, old_answer) in history:
                template.append_message(template.roles[0], old_question)
                template.append_message(template.roles[1], old_answer)

你可以仿照chat方法封装generate方法
或许你也可以尝试swift框架#129

irexyc · 2024-05-08T08:32:20Z

@hjh0119
现在的代码不支持我说的用法，像你说的他可能对输入有一些限制。

@czczup
我的疑问是InternVL-chat 是否具备图像-文字交错对话的能力，即我可以在任意round给图片输入(类似 DeepSpeed-VisualChat 给的图例)。还是说目前只能在第一轮插入图片。

hjh0119 · 2024-05-08T08:39:14Z

@hjh0119 现在的代码不支持我说的用法，像你说的他可能对输入有一些限制。

@czczup 我的疑问是InternVL-chat 是否具备图像-文字交错对话的能力，即我可以在任意round给图片输入(类似 DeepSpeed-VisualChat 给的图例)。还是说目前只能在第一轮插入图片。

图像-文字交错对话是可以的，你可以参考这里

irexyc · 2024-05-08T09:32:43Z

@hjh0119

我看了一下你们的代码，拼法貌似跟internvl-demo一样，都是放在了第一轮的user里面，跟我理解的“交错”不太一样。我理解的交错是像你们处理deepseek-vl那样，image的token在每一轮的user里面，而不是集中在第一轮的user里面。

所以还是想跟internvl的作者确认一下，对于多轮带图片的对话，internvl正确的处理方式是什么。

hjh0119 · 2024-05-08T09:35:53Z

@irexyc
我理解你的交错是指每次输入都支持新的图片? 就像这个案例一样

<<< Describe this image.
Input a media path or URL <<<  http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
This is a high-resolution image of a kitten. The kitten has striking blue eyes and a fluffy white and grey coat. The fur pattern suggests that it may be a Maine Coon or a similar breed. The kitten's ears are perked up, and it has a curious and innocent expression. The background is blurred, which brings the focus to the kitten's face.
--------------------------------------------------
<<< How many sheep are in the picture?
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png
There are four sheep in the picture.
--------------------------------------------------
<<< What is the calculation result?
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png
The calculation result is 59,856.

irexyc · 2024-05-08T09:46:30Z

@hjh0119

对于internvl：
你们的代码，输入看起来是交错的，每次都有新的图片，但是你们其实是在维护一个图片列表，然后最终的prompt还是用的这个函数拼在了最开始的user里面

对于deepseek-vl
你们没有维护image_list，而是根据<image_placeholder>来插入图片的embedding，而<image_placeholder>是在每轮的user当中的。

前者，如果新一轮的对话中有图片，会改变历史prompt(kv-cache没办法复用，需要重新算)。后者并不会改变，这两者我觉得并不一样。

hjh0119 · 2024-05-08T11:40:15Z

我理解了主要还是历史图片tokens处理官方这里确实没有看到一个处理方式

irexyc mentioned this issue May 8, 2024

Why did the model not understand the previous conversations? InternLM/lmdeploy#1551

Open

2 tasks

irexyc mentioned this issue May 30, 2024

[Docs] How are multiple images handled? InternLM/lmdeploy#1686

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does InternVL support multi-image interleaved conversations: #153

Does InternVL support multi-image interleaved conversations: #153

irexyc commented May 8, 2024 •

edited

hjh0119 commented May 8, 2024

irexyc commented May 8, 2024

hjh0119 commented May 8, 2024

irexyc commented May 8, 2024

hjh0119 commented May 8, 2024

irexyc commented May 8, 2024 •

edited

hjh0119 commented May 8, 2024

Does InternVL support multi-image interleaved conversations: #153

Does InternVL support multi-image interleaved conversations: #153

Comments

irexyc commented May 8, 2024 • edited

hjh0119 commented May 8, 2024

irexyc commented May 8, 2024

hjh0119 commented May 8, 2024

irexyc commented May 8, 2024

hjh0119 commented May 8, 2024

irexyc commented May 8, 2024 • edited

hjh0119 commented May 8, 2024

irexyc commented May 8, 2024 •

edited

irexyc commented May 8, 2024 •

edited