Question about choosing multi-image input mode and replacing image decoder #279

charlierabea · 2023-09-24T09:58:26Z

          Thank you for your interest.

Accomplishing a multi-image response with a single instruction can be easily done by adhering to the dataset format found here:

Otter/pipeline/mimicit_utils/mimicit_dataset.py

Line 432 in 9b34a44

 def process_spot_the_difference(self, instruction_id, instruction, answer, image_ids, in_context_example_ids): 

To achieve this, you may follow these steps:

Format your data following the guidelines provided here: https://github.com/Luodian/Otter/tree/main/mimic-it. Assume the prefix of your instruction id is "MED", like so:

"MED_INS_00001": {
            "instruction":"XXX",
            "answer":"XXX.",
            "image_ids":["XXX",",..."], # The multi-images corresponding to this instruction
            "rel_ins_ids":[], # This value can be []. If you have a multi-round conversation, it should be filled with the instruction ids of the other rounds.
        },

Modify this line from:

elif cur_train_id.startswith("SD"):

to:

elif cur_train_id.startswith("SD") or cur_train_id.startswith("MED"):

This is because your instruction uses the same data format (multi-image, one conversation) as the "Spot-the-difference" data.

Begin tuning your data with Otter by altering your specific instruction/image/train configuration from:

--mimicit_path="path/to/DC_instruction.json" \
--images_path="path/to/DC.json" \
--train_config_path="path/to/DC_train.json" \

to:

--mimicit_vt_path="path/to/MED_instruction.json" \
--images_vt_path="path/to/MED.json" \

If you have any further inquiries, don't hesitate to reach out via email. We can also add you to our Slack community for more immediate communication.

Originally posted by @ZhangYuanhan-AI in #234 (comment)

I was delighted to stumble upon this remarkable project. Thank you for your valuable contribution.

I am now doing a medical image(with multi slices and one description for each patient) captioning task. According to the above comment, I formed the training data MED.json and MED_instruction. Here's how the instruction json looks like:
{
"meta": {
"version": "",
"time": "",
"author": ""
},
"data": {
"test_INS_00000": {
"instruction": "",
"answer": ".\n ",
"image_ids": [
"MED_IMG_1",
"MED_IMG_2",
"MED_IMG_3",
"MED_IMG_4",
"MED_IMG_5",
"MED_IMG_6",
"MED_IMG_7",
"MED_IMG_8",
"MED_IMG_9",
"MED_IMG_10",
"MED_IMG_11",
"MED_IMG_12",
"MED_IMG_13",
"MED_IMG_14",
"MED_IMG_15",
"MED_IMG_16",
"MED_IMG_17",
"MED_IMG_18",
"MED_IMG_19",
"MED_IMG_20",
"MED_IMG_21",
"MED_IMG_22",
"MED_IMG_23",
"MED_IMG_24"
],
"rel_ins_ids": []
},
.....
}

The version of Otter I'm using is the 8/17 commit, and I've successfully got the generated caption and evaluated them with BLEU and CIDEr. However, I accidentally discovered that using the VQA mode has on par performance compared to SD mode, and different instruction is resulting in more diverse performance. Does that mean the SD mode doesn't suit my training scenerio, and VQA mode can help me test my instructions?

Furthermore, I'm trying to use the BiomedCLIP image decoder like the LLaVA-Med paper did. However, the 0817 instruction_following.py had no customized_config statement, and adding customized_config statements on the instruction_following.py from the 0830 commit does nothing. The resulting checkpoint config still writes CLIP.

Here's the config.json I created as the 0830 commit suggested.
{
"model_type": "otter",
"cross_attn_every_n_layers": 4,
"tie_word_embeddings": false,
"use_media_placement_augmentation": true,
"only_attend_previous": true,
"text_config": {
"_name_or_path": "luodian/llama-7b-hf",
"model_type": "llama"
},
"vision_config": {
"_name_or_path": "microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224",
"model_type": "clip_vision_model",
"hidden_size": 768,
"intermediate_size": 3072,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"image_size": 224,
"patch_size": 16
}
}

Looking forward to exploring this topic and citing you and your colleagues on any possible publication!

The text was updated successfully, but these errors were encountered:

ZhangYuanhan-AI · 2023-10-02T01:35:06Z

Does that mean the SD mode doesn't suit my training scenario, and VQA mode can help me test my instructions?
In your case, one instruction pairing with multiple images, we recommend to use SD mode. Though the achieved performance based on SD mode or VQA mode might be same in your case, the SD mode is logically reasonable in your data construction scenario.

charlierabea · 2023-10-03T17:29:17Z

Does that mean the SD mode doesn't suit my training scenario, and VQA mode can help me test my instructions?
In your case, one instruction pairing with multiple images, we recommend to use SD mode. Though the achieved performance based on SD mode or VQA mode might be same in your case, the SD mode is logically reasonable in your data construction scenario.

Thank you so much for your reply. We'll continue on our SD experiment.
Regarding the vision encoder, do you have any solution to replacing it?

ZhangYuanhan-AI · 2023-10-04T07:41:36Z

Maybe one solution is injecting the parameter of "microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224" into Otter checkpoint

king159 added the area:dataset dataset related label Sep 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about choosing multi-image input mode and replacing image decoder #279

Question about choosing multi-image input mode and replacing image decoder #279

charlierabea commented Sep 24, 2023 •

edited

ZhangYuanhan-AI commented Oct 2, 2023

charlierabea commented Oct 3, 2023

ZhangYuanhan-AI commented Oct 4, 2023

Question about choosing multi-image input mode and replacing image decoder #279

Question about choosing multi-image input mode and replacing image decoder #279

Comments

charlierabea commented Sep 24, 2023 • edited

ZhangYuanhan-AI commented Oct 2, 2023

charlierabea commented Oct 3, 2023

ZhangYuanhan-AI commented Oct 4, 2023

charlierabea commented Sep 24, 2023 •

edited