-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about choosing multi-image input mode and replacing image decoder #279
Labels
area:dataset
dataset related
Comments
|
Thank you so much for your reply. We'll continue on our SD experiment. |
Maybe one solution is injecting the parameter of "microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224" into Otter checkpoint |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Accomplishing a multi-image response with a single instruction can be easily done by adhering to the dataset format found here:
Otter/pipeline/mimicit_utils/mimicit_dataset.py
Line 432 in 9b34a44
To achieve this, you may follow these steps:
to:
This is because your instruction uses the same data format (multi-image, one conversation) as the "Spot-the-difference" data.
to:
If you have any further inquiries, don't hesitate to reach out via email. We can also add you to our Slack community for more immediate communication.
Originally posted by @ZhangYuanhan-AI in #234 (comment)
I was delighted to stumble upon this remarkable project. Thank you for your valuable contribution.
I am now doing a medical image(with multi slices and one description for each patient) captioning task. According to the above comment, I formed the training data MED.json and MED_instruction. Here's how the instruction json looks like:
{
"meta": {
"version": "",
"time": "",
"author": ""
},
"data": {
"test_INS_00000": {
"instruction": "",
"answer": ".\n ",
"image_ids": [
"MED_IMG_1",
"MED_IMG_2",
"MED_IMG_3",
"MED_IMG_4",
"MED_IMG_5",
"MED_IMG_6",
"MED_IMG_7",
"MED_IMG_8",
"MED_IMG_9",
"MED_IMG_10",
"MED_IMG_11",
"MED_IMG_12",
"MED_IMG_13",
"MED_IMG_14",
"MED_IMG_15",
"MED_IMG_16",
"MED_IMG_17",
"MED_IMG_18",
"MED_IMG_19",
"MED_IMG_20",
"MED_IMG_21",
"MED_IMG_22",
"MED_IMG_23",
"MED_IMG_24"
],
"rel_ins_ids": []
},
.....
}
The version of Otter I'm using is the 8/17 commit, and I've successfully got the generated caption and evaluated them with BLEU and CIDEr. However, I accidentally discovered that using the VQA mode has on par performance compared to SD mode, and different instruction is resulting in more diverse performance. Does that mean the SD mode doesn't suit my training scenerio, and VQA mode can help me test my instructions?
Furthermore, I'm trying to use the BiomedCLIP image decoder like the LLaVA-Med paper did. However, the 0817 instruction_following.py had no customized_config statement, and adding customized_config statements on the instruction_following.py from the 0830 commit does nothing. The resulting checkpoint config still writes CLIP.
Here's the config.json I created as the 0830 commit suggested.
{
"model_type": "otter",
"cross_attn_every_n_layers": 4,
"tie_word_embeddings": false,
"use_media_placement_augmentation": true,
"only_attend_previous": true,
"text_config": {
"_name_or_path": "luodian/llama-7b-hf",
"model_type": "llama"
},
"vision_config": {
"_name_or_path": "microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224",
"model_type": "clip_vision_model",
"hidden_size": 768,
"intermediate_size": 3072,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"image_size": 224,
"patch_size": 16
}
}
Looking forward to exploring this topic and citing you and your colleagues on any possible publication!
The text was updated successfully, but these errors were encountered: