Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does InternVL support multi-image interleaved conversations: #153

Open
irexyc opened this issue May 8, 2024 · 7 comments
Open

Does InternVL support multi-image interleaved conversations: #153

irexyc opened this issue May 8, 2024 · 7 comments

Comments

@irexyc
Copy link

irexyc commented May 8, 2024

According to the demo code in readme, the images are put in the first round chat and the image token are put in the front of question.

# Demo code in readme.

# multi-round multi-image conversation
pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

question = "详细描述这两张图片" # Describe the two pictures in detail
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(question, response)

question = "这两张图片的相同点和区别分别是什么" # What are the similarities and differences between these two pictures
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(question, response)
# prompt looks like this:
# <|im_start|>system\n{system_message}<|im_end|><|im_start|>user\n<img>placeholder ... </img>\n{question}<|im_end|><|im_start|>assistant\n

我想知道InternVL-chat 是否支持像DeepSpeed-VisualChat那样的图像-文字交错对话,如果支持的话,每一轮对话中,图像的token应该如何插入,希望可以给一个例子。

I want to know if InternVL support interleaved text-and-image conversations. If so, where the image token should be put in each
conversations?

# Does InternVL support something like this? (I know pixel_values should be passed, 
# but I can't find demo code about putting pixel_values in interleaved text-and-image conversations)


pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
question = "Describe the two pictures in detail" # Describe the two pictures in detail
response, history = model.chat(tokenizer, pixel_values1, question, generation_config, history=None, return_history=True)
print(question, response)

pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
question = "Describe the two pictures in detail" # Describe the two pictures in detail
response, history = model.chat(tokenizer, pixel_values2, question, generation_config, history=history, return_history=True)
print(question, response)

question = "What is the difference about the two images?" # Describe the two pictures in detail
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
print(question, response)
image
@hjh0119
Copy link
Contributor

hjh0119 commented May 8, 2024

model.chat只支持history为None时传入新的图片

    def chat(self, tokenizer, pixel_values, question, generation_config, history=None, return_history=False,
             IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>', IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'):

        img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
        self.img_context_token_id = img_context_token_id
        if tokenizer.convert_tokens_to_ids('<|im_end|>') != 0:
            eos_token_id = tokenizer.convert_tokens_to_ids('<|im_end|>')  # 92542, InternLM2
        else:
            eos_token_id = tokenizer.eos_token_id

        from .conversation import get_conv_template

        template = get_conv_template(self.template)
        image_bs = pixel_values.shape[0]
        print(f'dynamic ViT batch size: {image_bs}')
        if history is None:
            history = []
            image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.num_image_token * image_bs + IMG_END_TOKEN
            question = image_tokens + '\n' + question
        else:
            for (old_question, old_answer) in history:
                template.append_message(template.roles[0], old_question)
                template.append_message(template.roles[1], old_answer)

你可以仿照chat方法封装generate方法
或许你也可以尝试swift框架#129

@irexyc
Copy link
Author

irexyc commented May 8, 2024

@hjh0119
现在的代码不支持我说的用法,像你说的他可能对输入有一些限制。

@czczup
我的疑问是InternVL-chat 是否具备图像-文字交错对话的能力,即我可以在任意round给图片输入(类似 DeepSpeed-VisualChat 给的图例)。还是说目前只能在第一轮插入图片。

@hjh0119
Copy link
Contributor

hjh0119 commented May 8, 2024

@hjh0119 现在的代码不支持我说的用法,像你说的他可能对输入有一些限制。

@czczup 我的疑问是InternVL-chat 是否具备图像-文字交错对话的能力,即我可以在任意round给图片输入(类似 DeepSpeed-VisualChat 给的图例)。还是说目前只能在第一轮插入图片。

图像-文字交错对话是可以的,你可以参考这里

@irexyc
Copy link
Author

irexyc commented May 8, 2024

@hjh0119

我看了一下你们的代码,拼法貌似跟internvl-demo一样,都是放在了第一轮的user里面,跟我理解的“交错”不太一样。我理解的交错是像你们处理deepseek-vl那样,image的token在每一轮的user里面,而不是集中在第一轮的user里面。

所以还是想跟internvl的作者确认一下,对于多轮带图片的对话,internvl正确的处理方式是什么。

@hjh0119
Copy link
Contributor

hjh0119 commented May 8, 2024

@irexyc
我理解你的交错是指每次输入都支持新的图片? 就像这个案例一样

<<< Describe this image.
Input a media path or URL <<<  http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
This is a high-resolution image of a kitten. The kitten has striking blue eyes and a fluffy white and grey coat. The fur pattern suggests that it may be a Maine Coon or a similar breed. The kitten's ears are perked up, and it has a curious and innocent expression. The background is blurred, which brings the focus to the kitten's face.
--------------------------------------------------
<<< How many sheep are in the picture?
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png
There are four sheep in the picture.
--------------------------------------------------
<<< What is the calculation result?
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png
The calculation result is 59,856.

@irexyc
Copy link
Author

irexyc commented May 8, 2024

@hjh0119

对于internvl:
你们的代码,输入看起来是交错的,每次都有新的图片,但是你们其实是在维护一个图片列表,然后最终的prompt还是用的这个函数拼在了最开始的user里面

对于deepseek-vl
你们没有维护image_list,而是根据<image_placeholder>来插入图片的embedding,而<image_placeholder>是在每轮的user当中的。

前者,如果新一轮的对话中有图片,会改变历史prompt(kv-cache没办法复用,需要重新算)。后者并不会改变,这两者我觉得并不一样。

@hjh0119
Copy link
Contributor

hjh0119 commented May 8, 2024

我理解了 主要还是历史图片tokens处理 官方这里确实没有看到一个处理方式

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants