Flype 🥏 : Foundational Models for Parameter Efficient Visual Language Understanding

To perform parameter-efficient visual language classification, we select the VQA foundation model, BLIP-2. As shown in the following figures, we propose parameter-efficient prompt-based learning, FLYPE, for visual language understanding in computational social science (CSS).

Figure 1. The universal model architecture of FLYPE, cross-modal prompt tuning for large visual language models ## Guidelines The task is offered in Arabic and English. The scripts of our winning model should be able to run on a single GPU. Both sections are marked while the Section A has 30% and Section B has 70%. ## Section A: Run the winning model (Check-Worthiness)

Step 0: Prepare the dataset

Download the dataset, and unzip the structure as follows, run the merge script to build symbolic links for image processing

bash script/run_merge.sh
Your data directory should be as follows,
data
|
en
├── features
├── test_data
|    └── images_labeled
|        ├── 925657889473052672.jpg
|        ├── 925746311772692481.jpg
|        ├── 925887908996714497.jpg
|        └── ...
│   ├── CT23_1A_checkworthy_multimodal_english_test_gold.jsonl
│   ├── CT23_1A_checkworthy_multimodal_english_test.jsonl
│   └── features
│       └── test_feats.json
└── train_data
    └── images_labeled
        ├── dev
            └── 1032635895864877056.jpg   
            └── ...           
        ├── dev_test
            └── 1032635895864877056.jpg   
            └── ...   
        ├── merge
            └── 1032635895864877056.jpg   
            └── ...     
        ├── train
            └── 1032635895864877056.jpg   
            └── ...     
    ├── CT23_1A_checkworthy_multimodal_english_dev.jsonl
    ├── CT23_1A_checkworthy_multimodal_english_dev_test.jsonl
    ├── CT23_1A_checkworthy_multimodal_english_merge.jsonl
    ├── CT23_1A_checkworthy_multimodal_english_train.jsonl
    └── features
        ├── dev_feats.json
        ├── dev_test_feats.json
        ├── merge_feats.json
        └── train_feats.json

Please use the following method to extract your own features as the features under en are from some old models.

Step 1: Feature Extraction (All)

bash scripts/run_feature_extraction_full.sh

After extracting features to the features folder, use bash cmds to put the extracted features and the data into the following structure,

prompt_ocr_adapter
├── test_data
│   ├── CT23_1A_checkworthy_multimodal_english_test_gold.jsonl
│   ├── CT23_1A_checkworthy_multimodal_english_test.jsonl
│   └── features
│       └── test_feats.json
└── train_data
    ├── CT23_1A_checkworthy_multimodal_english_dev.jsonl
    ├── CT23_1A_checkworthy_multimodal_english_dev_test.jsonl
    ├── CT23_1A_checkworthy_multimodal_english_merge.jsonl
    ├── CT23_1A_checkworthy_multimodal_english_train.jsonl
    └── features
        ├── dev_feats.json
        ├── dev_test_feats.json
        ├── merge_feats.json
        └── train_feats.json

Step 2: Train a Transformer Fusion Layer

Run the following command to train a transformer fusion layer with the english_dev set as the monitor. Record the performance on dev_test.

python blip_feature_extractor.py --train-data-dir ./data/prompt_ocr_adapter \
-d data/prompt_ocr_adapter/train_data \
-s dev \
-tr CT23_1A_checkworthy_multimodal_english_train.jsonl \
-te CT23_1A_checkworthy_multimodal_english_dev.jsonl \
-l en \
--lr 1e-3 \
--train-batch-size 64 \
--heads 12 \
--d 480 \
--model-type adapter \
--num-layers 1

With the hyper parameters, now run the following script with all samples and record the performance on the english_test_gold set. Note that the monitor is still the dev. You are welcome to play around with the hyper parameters but be aware of overfitting the test set is not important as the challege is finished.

python blip_feature_extractor.py --train-data-dir ./data/prompt_ocr_adapter \
-d data/prompt_ocr_adapter/train_data \
-s dev \
-tr CT23_1A_checkworthy_multimodal_english_train.jsonl \
-te CT23_1A_checkworthy_multimodal_english_dev.jsonl \
-l en \
--lr 1e-3 \
--train-batch-size 64 \
--heads 12 \
--d 480 \
--model-type adapter \
--num-layers 1

Run the experiments with different random seeds. Well done, if you see similar results as follows,

Split	F1	Accuracy	Precision	Recall
dev_test	0.7075471698	0.764945652	0.673333333	0.8620689655
test	0.7167902098	0.775815217	0.678343949	0.7689530686

Step 3: Train an additional Transformer Fusion Layer with soft prompt removal

Remove the soft prompt of the BLIP2. Repeat the experiments and record the result. What do you observe?

bash scripts/run_feature_extraction_full.sh
bash scripts/train_transformer_fusion.sh
bash scripts/train_transformer_merge.sh

Hint: Read Blip2Qformer and see what you can do with the attention to query embeds

Step 4: Train an additional Transformer Fusion Layer with image removal

Record your results. Do not waste your time on overfitting the test set as your results will be checked.

Step 5: Train an additional Transformer Fusion Layer with text removal

Likewise, record your results.

Step 6: Brain storm

Given only 64 examples for two classes, ideate some solutions to solve the problem in Step 2. Specifically, how would you change the model?

Congratulation for finishing Section A!

Section B: Train a baseline on EmoRegCom_DATA

Step 0: Prepare the dataset

Again run the merge script to build symbolic links for image processing

bash script/YOUR_MERGE_SCRIPT_NOW.sh
Your data directories should be similar to the following tree
├── features
├── test_data
|    └── images_labeled
|        ├── 1_72_0.jpg
|        ├── 2_29_3.jpg
|        └── ...
└── train_data
    └── images_labeled
        ├── dev
            └── 1032635895864877056.jpg   
            └── ...           
        ├── dev_test   
            └── ...   
        ├── merge
            └── ...     
        ├── train
            └── 0_3_4.jpg  
            └── ... 
├── dataset.csv
├── test_dataset.csv
├── test_data.csv
├── train_data.csv
├── train_emotion_labels.csv
├── train_transcriptions.json
└── val_data.csv

Step 1: Preprocessing

To preprocess the data into a format similar to our baseline, use tools such as csv or json reader. The closer the data is to a check-worthy format, the less effort will be required to customize the dataset. Note that Step 7 requires cross-task validation, so we do not expect the data format to change a lot. Please save a copy of your preprocessing script for our format checking test. Hint: For OCR, use one of the SOTA models easyocr. Check OCR/easyocr.ipynb and their awesome git repo.

Step 2: Feature extraction for EmoRegCom_DATA

mm_feature_extractor.py is a modified feature extractor from blip_feature_extractor.py for a different benchmark dataset. Please modify the script for EmoRegCom_DATA.

Step 3: Train a Transformer Fusion Layer

Like the step 2 of section A, train transformer fusion layer. Keep the model architecture intact, find a group of useful hyperparameters. Record the average performance on this multimodal NLU dataset and the checkpoints.

Hint: Check the Wandb hypersearch notebook. Also check the sweep_fc.yaml.

Step 4: Train another additional Transformer Fusion Layer with soft prompt removal

Do not change hyper now. Likewise, record the average performance and the checkpoints.

Step 5: Train an additional Transformer Fusion Layer with image removal

Likewise, record the average performance and the checkpoints.

Step 6: Train an additional Transformer Fusion Layer with text removal

Likewise, record the average performance and the checkpoints.

Step 7: Cross task validation

Test this multimodal model to the features of Section A step 2. Record the results.

Test the model of Section A step 2 on the features of Section B step 3. Record the results.

Please write a brief README file on how to run your script.

zip -r submission.zip ./PATH_TO_RESULT_AND_SCRIPT

Congratulation for finishing Section B! You just did a great job!

Final results

Flype ranks at top3 in CheckThat Task1A!

Figure 2. Results for CheckThat! Task1A.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.idea		.idea
format_checker		format_checker
modeling		modeling
ocr		ocr
preprocessing		preprocessing
scorer		scorer
scripts		scripts
trainer		trainer
.gitignore		.gitignore
README.md		README.md
blip_feature_extractor.py		blip_feature_extractor.py
checkthat_dataset.py		checkthat_dataset.py
flype.png		flype.png
main.py		main.py
main_mami.py		main_mami.py
mami_dataset.py		mami_dataset.py
mm_feature_extractor.py		mm_feature_extractor.py
requirements.txt		requirements.txt
results.png		results.png
sweep.py		sweep.py
sweep.yaml		sweep.yaml
sweep_fc.yaml		sweep_fc.yaml
train-ft-blip2-4.py		train-ft-blip2-4.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Flype 🥏 : Foundational Models for Parameter Efficient Visual Language Understanding

Step 0: Prepare the dataset

Step 1: Feature Extraction (All)

Step 2: Train a Transformer Fusion Layer

Step 3: Train an additional Transformer Fusion Layer with soft prompt removal

Step 4: Train an additional Transformer Fusion Layer with image removal

Step 5: Train an additional Transformer Fusion Layer with text removal

Step 6: Brain storm

Section B: Train a baseline on EmoRegCom_DATA

Step 0: Prepare the dataset

Step 1: Preprocessing

Step 2: Feature extraction for EmoRegCom_DATA

Step 3: Train a Transformer Fusion Layer

Step 4: Train another additional Transformer Fusion Layer with soft prompt removal

Step 5: Train an additional Transformer Fusion Layer with image removal

Step 6: Train an additional Transformer Fusion Layer with text removal

Step 7: Cross task validation

Final results

About

Releases

Packages

Languages

pengbohua/Flype-LAVIS

Folders and files

Latest commit

History

Repository files navigation

Flype 🥏 : Foundational Models for Parameter Efficient Visual Language Understanding

Step 0: Prepare the dataset

Step 1: Feature Extraction (All)

Step 2: Train a Transformer Fusion Layer

Step 3: Train an additional Transformer Fusion Layer with soft prompt removal

Step 4: Train an additional Transformer Fusion Layer with image removal

Step 5: Train an additional Transformer Fusion Layer with text removal

Step 6: Brain storm

Section B: Train a baseline on EmoRegCom_DATA

Step 0: Prepare the dataset

Step 1: Preprocessing

Step 2: Feature extraction for EmoRegCom_DATA

Step 3: Train a Transformer Fusion Layer

Step 4: Train another additional Transformer Fusion Layer with soft prompt removal

Step 5: Train an additional Transformer Fusion Layer with image removal

Step 6: Train an additional Transformer Fusion Layer with text removal

Step 7: Cross task validation

Final results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages