To perform parameter-efficient visual language classification, we select the VQA foundation model, BLIP-2. As shown in the following figures, we propose parameter-efficient prompt-based learning, FLYPE, for visual language understanding in computational social science (CSS).
Figure 1. The universal model architecture of FLYPE, cross-modal prompt tuning for large visual language models ## Guidelines The task is offered in Arabic and English. The scripts of our winning model should be able to run on a single GPU. Both sections are marked while the Section A has 30% and Section B has 70%. ## Section A: Run the winning model (Check-Worthiness)Download the dataset, and unzip the structure as follows, run the merge script to build symbolic links for image processing
bash script/run_merge.sh
Your data directory should be as follows,
data
|
en
├── features
├── test_data
| └── images_labeled
| ├── 925657889473052672.jpg
| ├── 925746311772692481.jpg
| ├── 925887908996714497.jpg
| └── ...
│ ├── CT23_1A_checkworthy_multimodal_english_test_gold.jsonl
│ ├── CT23_1A_checkworthy_multimodal_english_test.jsonl
│ └── features
│ └── test_feats.json
└── train_data
└── images_labeled
├── dev
└── 1032635895864877056.jpg
└── ...
├── dev_test
└── 1032635895864877056.jpg
└── ...
├── merge
└── 1032635895864877056.jpg
└── ...
├── train
└── 1032635895864877056.jpg
└── ...
├── CT23_1A_checkworthy_multimodal_english_dev.jsonl
├── CT23_1A_checkworthy_multimodal_english_dev_test.jsonl
├── CT23_1A_checkworthy_multimodal_english_merge.jsonl
├── CT23_1A_checkworthy_multimodal_english_train.jsonl
└── features
├── dev_feats.json
├── dev_test_feats.json
├── merge_feats.json
└── train_feats.json
Please use the following method to extract your own features as the features under en are from some old models.
bash scripts/run_feature_extraction_full.sh
After extracting features to the features folder, use bash cmds to put the extracted features and the data into the following structure,
prompt_ocr_adapter
├── test_data
│ ├── CT23_1A_checkworthy_multimodal_english_test_gold.jsonl
│ ├── CT23_1A_checkworthy_multimodal_english_test.jsonl
│ └── features
│ └── test_feats.json
└── train_data
├── CT23_1A_checkworthy_multimodal_english_dev.jsonl
├── CT23_1A_checkworthy_multimodal_english_dev_test.jsonl
├── CT23_1A_checkworthy_multimodal_english_merge.jsonl
├── CT23_1A_checkworthy_multimodal_english_train.jsonl
└── features
├── dev_feats.json
├── dev_test_feats.json
├── merge_feats.json
└── train_feats.json
Run the following command to train a transformer fusion layer with the english_dev set as the monitor. Record the performance on dev_test.
python blip_feature_extractor.py --train-data-dir ./data/prompt_ocr_adapter \
-d data/prompt_ocr_adapter/train_data \
-s dev \
-tr CT23_1A_checkworthy_multimodal_english_train.jsonl \
-te CT23_1A_checkworthy_multimodal_english_dev.jsonl \
-l en \
--lr 1e-3 \
--train-batch-size 64 \
--heads 12 \
--d 480 \
--model-type adapter \
--num-layers 1
With the hyper parameters, now run the following script with all samples and record the performance on the english_test_gold set. Note that the monitor is still the dev. You are welcome to play around with the hyper parameters but be aware of overfitting the test set is not important as the challege is finished.
python blip_feature_extractor.py --train-data-dir ./data/prompt_ocr_adapter \
-d data/prompt_ocr_adapter/train_data \
-s dev \
-tr CT23_1A_checkworthy_multimodal_english_train.jsonl \
-te CT23_1A_checkworthy_multimodal_english_dev.jsonl \
-l en \
--lr 1e-3 \
--train-batch-size 64 \
--heads 12 \
--d 480 \
--model-type adapter \
--num-layers 1
Run the experiments with different random seeds. Well done, if you see similar results as follows,
Split | F1 | Accuracy | Precision | Recall |
---|---|---|---|---|
dev_test | 0.7075471698 | 0.764945652 | 0.673333333 | 0.8620689655 |
test | 0.7167902098 | 0.775815217 | 0.678343949 | 0.7689530686 |
Remove the soft prompt of the BLIP2. Repeat the experiments and record the result. What do you observe?
bash scripts/run_feature_extraction_full.sh
bash scripts/train_transformer_fusion.sh
bash scripts/train_transformer_merge.sh
Hint: Read Blip2Qformer and see what you can do with the attention to query embeds
Record your results. Do not waste your time on overfitting the test set as your results will be checked.
Likewise, record your results.
Given only 64 examples for two classes, ideate some solutions to solve the problem in Step 2. Specifically, how would you change the model?
Congratulation for finishing Section A!
Again run the merge script to build symbolic links for image processing
bash script/YOUR_MERGE_SCRIPT_NOW.sh
Your data directories should be similar to the following tree
├── features
├── test_data
| └── images_labeled
| ├── 1_72_0.jpg
| ├── 2_29_3.jpg
| └── ...
└── train_data
└── images_labeled
├── dev
└── 1032635895864877056.jpg
└── ...
├── dev_test
└── ...
├── merge
└── ...
├── train
└── 0_3_4.jpg
└── ...
├── dataset.csv
├── test_dataset.csv
├── test_data.csv
├── train_data.csv
├── train_emotion_labels.csv
├── train_transcriptions.json
└── val_data.csv
To preprocess the data into a format similar to our baseline, use tools such as csv or json reader. The closer the data is to a check-worthy format, the less effort will be required to customize the dataset. Note that Step 7 requires cross-task validation, so we do not expect the data format to change a lot. Please save a copy of your preprocessing script for our format checking test. Hint: For OCR, use one of the SOTA models easyocr. Check OCR/easyocr.ipynb and their awesome git repo.
mm_feature_extractor.py is a modified feature extractor from blip_feature_extractor.py for a different benchmark dataset. Please modify the script for EmoRegCom_DATA.
Like the step 2 of section A, train transformer fusion layer. Keep the model architecture intact, find a group of useful hyperparameters. Record the average performance on this multimodal NLU dataset and the checkpoints.
Hint: Check the Wandb hypersearch notebook. Also check the sweep_fc.yaml.
Do not change hyper now. Likewise, record the average performance and the checkpoints.
Likewise, record the average performance and the checkpoints.
Likewise, record the average performance and the checkpoints.
Test this multimodal model to the features of Section A step 2. Record the results.
Test the model of Section A step 2 on the features of Section B step 3. Record the results.
Please write a brief README file on how to run your script.
zip -r submission.zip ./PATH_TO_RESULT_AND_SCRIPT
Congratulation for finishing Section B! You just did a great job!
Flype ranks at top3 in CheckThat Task1A!
Figure 2. Results for CheckThat! Task1A.