You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi @Tangshengku,
I've been trying to obtain the MuE weights following the steps provided, but I seem to be encountering some issues. I would greatly appreciate it if you could take a look at my process and point out any potential problems or share your script, which might be helpful to me.
train_caption_stage1_base_MuE.sh
#!/usr/bin/env
# The port for communication. Note that if you want to run multiple tasks on the same machine,
# you need to specify different port numbers.
export MASTER_PORT=1061
log_dir=./stg1_8_3e-6_MuE
save_dir=./stg1_8_3e-6_chk_MuE
mkdir -p $log_dir $save_dir
bpe_dir=../../utils/BPE
user_dir=../../ofa_module
data_dir=../../dataset/caption_data
data=${data_dir}/caption_stage1_train.tsv,${data_dir}/caption_val.tsv
restore_file=../../checkpoints/ofa_base.pt
selected_cols=0,4,2
task=caption
arch=ofa_base
criterion=MuE_Task_Loss
label_smoothing=0.1
lr=3e-5
max_epoch=6
warmup_ratio=0.06
batch_size=8
update_freq=4
resnet_drop_path_rate=0.0
encoder_drop_path_rate=0.1
decoder_drop_path_rate=0.1
dropout=0.1
attention_dropout=0.0
max_src_length=80
max_tgt_length=20
num_bins=1000
patch_image_size=480
eval_cider_cached=${data_dir}/cider_cached_tokens/coco-valid-words.p
drop_worst_ratio=0.2
for max_epoch in {6,}; do
echo "max_epoch "${max_epoch}
for warmup_ratio in {0.06,}; do
echo "warmup_ratio "${warmup_ratio}
for drop_worst_after in {6000,}; do
echo "drop_worst_after "${drop_worst_after}
log_file=${log_dir}/${max_epoch}"_"${warmup_ratio}"_"${drop_worst_after}".log"
save_path=${save_dir}/${max_epoch}"_"${warmup_ratio}"_"${drop_worst_after}
mkdir -p $save_path
CUDA_VISIBLE_DEVICES=0,1 python3 -m torch.distributed.launch --nproc_per_node=2 --master_port=${MASTER_PORT} ../../train.py \
$data \
--selected-cols=${selected_cols} \
--bpe-dir=${bpe_dir} \
--user-dir=${user_dir} \
--restore-file=${restore_file} \
--reset-optimizer --reset-dataloader --reset-meters \
--save-dir=${save_path} \
--task=${task} \
--arch=${arch} \
--criterion=${criterion} \
--label-smoothing=${label_smoothing} \
--batch-size=${batch_size} \
--update-freq=${update_freq} \
--encoder-normalize-before \
--decoder-normalize-before \
--share-decoder-input-output-embed \
--share-all-embeddings \
--layernorm-embedding \
--patch-layernorm-embedding \
--code-layernorm-embedding \
--resnet-drop-path-rate=${resnet_drop_path_rate} \
--encoder-drop-path-rate=${encoder_drop_path_rate} \
--decoder-drop-path-rate=${decoder_drop_path_rate} \
--dropout=${dropout} \
--attention-dropout=${attention_dropout} \
--weight-decay=0.01 --optimizer=adam --adam-betas="(0.9,0.999)" --adam-eps=1e-08 --clip-norm=1.0 \
--lr-scheduler=polynomial_decay --lr=${lr} \
--max-epoch=${max_epoch} --warmup-ratio=${warmup_ratio} \
--log-format=simple --log-interval=10 \
--fixed-validation-seed=7 \
--no-epoch-checkpoints --keep-best-checkpoints=1 \
--save-interval=1 --validate-interval=1 \
--save-interval-updates=500 --validate-interval-updates=500 \
--eval-cider \
--eval-cider-cached-tokens=${eval_cider_cached} \
--eval-args='{"beam":5,"max_len_b":16,"no_repeat_ngram_size":3}' \
--best-checkpoint-metric=cider --maximize-best-checkpoint-metric \
--max-src-length=${max_src_length} \
--max-tgt-length=${max_tgt_length} \
--find-unused-parameters \
--freeze-encoder-embedding \
--freeze-decoder-embedding \
--add-type-embedding \
--scale-attn \
--scale-fc \
--scale-heads \
--disable-entangle \
--num-bins=${num_bins} \
--patch-image-size=${patch_image_size} \
--drop-worst-ratio=${drop_worst_ratio} \
--drop-worst-after=6000 \
--fp16 \
--fp16-scale-window=512 \
--train_mue\
--num-workers=0 > ${log_file} 2>&1
done
done
done
train_caption_stage2_base.sh
#!/usr/bin/env
# The port for communication. Note that if you want to run multiple tasks on the same machine,
# you need to specify different port numbers.
export MASTER_PORT=1062
log_dir=./stg2_bs8_lr1e-5_uf4_ngpu2
save_dir=./stg2_bs8_lr1e-5_uf4_ngpu2_check
mkdir -p $log_dir $save_dir
bpe_dir=../../utils/BPE
user_dir=../../ofa_module
data_dir=../../dataset/caption_data
data=${data_dir}/caption_stage2_train.tsv,${data_dir}/caption_val.tsv
restore_file=stage1/stg1_bs8_lr1e-5_uf4_nmgpu2/stg1_10_1e-5_chk_MuE/best3/checkpoint_best.pt
selected_cols=1,4,2
task=caption
arch=ofa_base
criterion=scst_reward_criterion
label_smoothing=0.1
lr=1e-5
max_epoch=5
warmup_ratio=0.06
batch_size=8
update_freq=4
resnet_drop_path_rate=0.0
encoder_drop_path_rate=0.0
decoder_drop_path_rate=0.0
dropout=0.0
attention_dropout=0.0
max_src_length=80
max_tgt_length=20
num_bins=1000
patch_image_size=480
eval_cider_cached=${data_dir}/cider_cached_tokens/coco-valid-words.p
scst_cider_cached=${data_dir}/cider_cached_tokens/coco-train-words.p
for lr in {1e-5,}; do
echo "lr "${lr}
for max_epoch in {5,}; do
echo "max_epoch "${max_epoch}
log_file=${log_dir}/${lr}"_"${max_epoch}".log"
save_path=${save_dir}/${lr}"_"${max_epoch}
mkdir -p $save_path
CUDA_VISIBLE_DEVICES=0,1 python3 -m torch.distributed.launch --nproc_per_node=2 --master_port=${MASTER_PORT} ../../train.py \
$data \
--selected-cols=${selected_cols} \
--bpe-dir=${bpe_dir} \
--user-dir=${user_dir} \
--restore-file=${restore_file} \
--reset-optimizer --reset-dataloader --reset-meters \
--save-dir=${save_path} \
--task=${task} \
--arch=${arch} \
--criterion=${criterion} \
--batch-size=${batch_size} \
--update-freq=${update_freq} \
--encoder-normalize-before \
--decoder-normalize-before \
--share-decoder-input-output-embed \
--share-all-embeddings \
--layernorm-embedding \
--patch-layernorm-embedding \
--code-layernorm-embedding \
--resnet-drop-path-rate=${resnet_drop_path_rate} \
--encoder-drop-path-rate=${encoder_drop_path_rate} \
--decoder-drop-path-rate=${decoder_drop_path_rate} \
--dropout=${dropout} \
--attention-dropout=${attention_dropout} \
--weight-decay=0.01 --optimizer=adam --adam-betas="(0.9,0.999)" --adam-eps=1e-08 --clip-norm=1.0 \
--lr-scheduler=polynomial_decay --lr=${lr} \
--max-epoch=${max_epoch} --warmup-ratio=${warmup_ratio} \
--log-format=simple --log-interval=10 \
--fixed-validation-seed=7 \
--no-epoch-checkpoints --keep-best-checkpoints=1 \
--save-interval=1 --validate-interval=1 \
--save-interval-updates=500 --validate-interval-updates=500 \
--eval-cider \
--eval-cider-cached-tokens=${eval_cider_cached} \
--eval-args='{"beam":5,"max_len_b":16,"no_repeat_ngram_size":3}' \
--best-checkpoint-metric=cider --maximize-best-checkpoint-metric \
--max-src-length=${max_src_length} \
--max-tgt-length=${max_tgt_length} \
--find-unused-parameters \
--freeze-encoder-embedding \
--freeze-decoder-embedding \
--add-type-embedding \
--scale-attn \
--scale-fc \
--scale-heads \
--disable-entangle \
--num-bins=${num_bins} \
--patch-image-size=${patch_image_size} \
--scst \
--scst-cider-cached-tokens=${scst_cider_cached} \
--scst-args='{"beam":5,"max_len_b":16,"no_repeat_ngram_size":3}' \
--memory-efficient-fp16 \
--fp16-scale-window=512 \
--num-workers=0 > ${log_file} 2>&1
done
done
evaluate_caption_base_MuE.sh
#!/usr/bin/env bash
# The port for communication. Note that if you want to run multiple tasks on the same machine,
# you need to specify different port numbers.
export MASTER_PORT=1091
user_dir=../../ofa_module
bpe_dir=../../utils/BPE
data=../../dataset/caption_data/caption_test.tsv
path=2_batch_64_check/3e-5_5/checkpoint_best.pt
result_path=../../results/caption
selected_cols=1,4,2
split='test'
CUDA_VISIBLE_DEVICES=0 python3 -m torch.distributed.launch --nproc_per_node=1 --master_port=${MASTER_PORT} ../../evaluate.py \
${data} \
--path=${path} \
--user-dir=${user_dir} \
--task=caption \
--batch-size=1 \
--log-format=simple --log-interval=10 \
--seed=7 \
--gen-subset=${split} \
--results-path=${result_path} \
--beam=5 \
--max-len-b=16 \
--no-repeat-ngram-size=3 \
--fp16 \
--num-workers=0 \
--img_thres=0.9 \
--txt_thres=0.95 \
--decoder_thres=0.9 \
--model-overrides="{\"data\":\"${data}\",\"bpe_dir\":\"${bpe_dir}\",\"eval_cider\":False,\"selected_cols\":\"${selected_cols}\"}"
python coco_eval.py ../../results/caption/test_predict.json ../../dataset/caption_data/test_caption_coco_format.json
When employing the provided script, stage2 finetuning employs TransformerEncoder, as indicated in the log I shared, and it does not employ early exit during inference. The performance scores are identical regardless of the different thresholds applied.
train_caption_stage2_base.sh with argument "--train_mue" added
#!/usr/bin/env
# The port for communication. Note that if you want to run multiple tasks on the same machine,
# you need to specify different port numbers.
export MASTER_PORT=1062
log_dir=./stg2_bs8_lr1e-5_uf4_ngpu2
save_dir=./stg2_bs8_lr1e-5_uf4_ngpu2_check
mkdir -p $log_dir $save_dir
bpe_dir=../../utils/BPE
user_dir=../../ofa_module
data_dir=../../dataset/caption_data
data=${data_dir}/caption_stage2_train.tsv,${data_dir}/caption_val.tsv
restore_file=stage1/stg1_bs8_lr1e-5_uf4_nmgpu2/stg1_10_1e-5_chk_MuE/best3/checkpoint_best.pt
selected_cols=1,4,2
task=caption
arch=ofa_base
criterion=scst_reward_criterion
label_smoothing=0.1
lr=1e-5
max_epoch=5
warmup_ratio=0.06
batch_size=8
update_freq=4
resnet_drop_path_rate=0.0
encoder_drop_path_rate=0.0
decoder_drop_path_rate=0.0
dropout=0.0
attention_dropout=0.0
max_src_length=80
max_tgt_length=20
num_bins=1000
patch_image_size=480
eval_cider_cached=${data_dir}/cider_cached_tokens/coco-valid-words.p
scst_cider_cached=${data_dir}/cider_cached_tokens/coco-train-words.p
for lr in {1e-5,}; do
echo "lr "${lr}
for max_epoch in {5,}; do
echo "max_epoch "${max_epoch}
log_file=${log_dir}/${lr}"_"${max_epoch}".log"
save_path=${save_dir}/${lr}"_"${max_epoch}
mkdir -p $save_path
CUDA_VISIBLE_DEVICES=0,1 python3 -m torch.distributed.launch --nproc_per_node=2 --master_port=${MASTER_PORT} ../../train.py \
$data \
--selected-cols=${selected_cols} \
--bpe-dir=${bpe_dir} \
--user-dir=${user_dir} \
--restore-file=${restore_file} \
--reset-optimizer --reset-dataloader --reset-meters \
--save-dir=${save_path} \
--task=${task} \
--arch=${arch} \
--criterion=${criterion} \
--batch-size=${batch_size} \
--update-freq=${update_freq} \
--encoder-normalize-before \
--decoder-normalize-before \
--share-decoder-input-output-embed \
--share-all-embeddings \
--layernorm-embedding \
--patch-layernorm-embedding \
--code-layernorm-embedding \
--resnet-drop-path-rate=${resnet_drop_path_rate} \
--encoder-drop-path-rate=${encoder_drop_path_rate} \
--decoder-drop-path-rate=${decoder_drop_path_rate} \
--dropout=${dropout} \
--attention-dropout=${attention_dropout} \
--weight-decay=0.01 --optimizer=adam --adam-betas="(0.9,0.999)" --adam-eps=1e-08 --clip-norm=1.0 \
--lr-scheduler=polynomial_decay --lr=${lr} \
--max-epoch=${max_epoch} --warmup-ratio=${warmup_ratio} \
--log-format=simple --log-interval=10 \
--fixed-validation-seed=7 \
--no-epoch-checkpoints --keep-best-checkpoints=1 \
--save-interval=1 --validate-interval=1 \
--save-interval-updates=500 --validate-interval-updates=500 \
--eval-cider \
--eval-cider-cached-tokens=${eval_cider_cached} \
--eval-args='{"beam":5,"max_len_b":16,"no_repeat_ngram_size":3}' \
--best-checkpoint-metric=cider --maximize-best-checkpoint-metric \
--max-src-length=${max_src_length} \
--max-tgt-length=${max_tgt_length} \
--find-unused-parameters \
--freeze-encoder-embedding \
--freeze-decoder-embedding \
--add-type-embedding \
--scale-attn \
--scale-fc \
--scale-heads \
--disable-entangle \
--num-bins=${num_bins} \
--patch-image-size=${patch_image_size} \
--scst \
--scst-cider-cached-tokens=${scst_cider_cached} \
--scst-args='{"beam":5,"max_len_b":16,"no_repeat_ngram_size":3}' \
--memory-efficient-fp16 \
--fp16-scale-window=512 \
--train_mue \
--num-workers=0 > ${log_file} 2>&1
done
done
log for stage2 finetuning with argument "--train_mue" added
When I add the "--train_mue" argument during stage2 finetuning, it utilizes the Transformer_MuE and effectively performs an early exit during inference. However, the performance score is notably lower than the one reported in the paper.
The text was updated successfully, but these errors were encountered:
Hi @Tangshengku,
I've been trying to obtain the MuE weights following the steps provided, but I seem to be encountering some issues. I would greatly appreciate it if you could take a look at my process and point out any potential problems or share your script, which might be helpful to me.
train_caption_stage1_base_MuE.sh
train_caption_stage2_base.sh
evaluate_caption_base_MuE.sh
training log for finetuning stage2
When employing the provided script, stage2 finetuning employs TransformerEncoder, as indicated in the log I shared, and it does not employ early exit during inference. The performance scores are identical regardless of the different thresholds applied.
train_caption_stage2_base.sh with argument "--train_mue" added
log for stage2 finetuning with argument "--train_mue" added
When I add the "--train_mue" argument during stage2 finetuning, it utilizes the Transformer_MuE and effectively performs an early exit during inference. However, the performance score is notably lower than the one reported in the paper.
The text was updated successfully, but these errors were encountered: