为什么loss一直为0 #141

xienan0326 · 2024-04-08T02:10:35Z

运行脚本
export WANDB_MODE='offline'

JSON_FOLDER="train_json"
IMAGE_FOLDER="/workspace/vl-data/"
VIDEO_FOLDER="/workspace/vl-data/"

HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 deepspeed videollava/train/train_mem.py
--deepspeed ./scripts/zero2.json
--model_name_or_path ./vicuna-7b-v1.5
--version v1
--data_path ${JSON_FOLDER}/llava_image_tune_.json ${JSON_FOLDER}/videochatgpt_tune_.json ${JSON_FOLDER}/nlp_tune.json
--image_folder ${IMAGE_FOLDER}
--image_tower ./LanguageBind/LanguageBind_Image
--video_folder ${VIDEO_FOLDER}
--video_tower ./LanguageBind/LanguageBind_Video_merge
--mm_projector_type mlp2x_gelu
--pretrain_mm_mlp_adapter ./checkpoints/videollava-7b-pretrain/mm_projector.bin
--mm_vision_select_layer -2
--mm_use_im_start_end False
--mm_use_im_patch_token False
--image_aspect_ratio pad
--group_by_modality_length True
--bf16 True
--output_dir ./checkpoints/videollava-sb
--num_train_epochs 1
--per_device_train_batch_size 16
--per_device_eval_batch_size 4
--gradient_accumulation_steps 1
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 50000
--save_total_limit 1
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--tf32 True
--model_max_length 2048 --tokenizer_model_max_length 3072
--gradient_checkpointing True
--dataloader_num_workers 4
--lazy_preprocess True
--cache_dir "./cache_dir"

运行日志
reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
[h264 @ 0xa1e6680] mmco: unref short failure
{'loss': 1.8232, 'learning_rate': 1.1111111111111112e-07, 'epoch': 0.0}
{'loss': 1.737, 'learning_rate': 2.2222222222222224e-07, 'epoch': 0.0}
0%| | 2/5979 [00:33<26:09:12, 15.75s/it][h264 @ 0xb311740] mmco: unref short failure
[h264 @ 0xb311740] mmco: unref short failure
[h264 @ 0xaaffd80] mmco: unref short failure
{'loss': 1.7796, 'learning_rate': 3.3333333333333335e-07, 'epoch': 0.0}
{'loss': 1.708, 'learning_rate': 4.444444444444445e-07, 'epoch': 0.0}
0%| | 4/5979 [00:54<20:20:27, 12.26s/it][h264 @ 0xbabe380] mmco: unref short failure
{'loss': 0.0, 'learning_rate': 5.555555555555555e-07, 'epoch': 0.0}
0%| | 5/5979 [01:05<19:08:34, 11.54s/it][h264 @ 0x25e23040] mmco: unref short failure
{'loss': 0.0, 'learning_rate': 6.666666666666667e-07, 'epoch': 0.0}
0%| | 6/5979 [01:15<18:37:10, 11.22s/it][h264 @ 0x1e8e4c80] Missing reference picture, default is 65530
[h264 @ 0x1e7b9780] Missing reference picture, default is 65530
[h264 @ 0x8d7e80] mmco: unref short failure
[h264 @ 0x8d7e80] mmco: unref short failure
{'loss': 0.0, 'learning_rate': 7.777777777777779e-07, 'epoch': 0.0}
0%| | 7/5979 [01:26<18:32:16, 11.17s/it][h264 @ 0x106b5ec00] mmco: unref short failure
[h264 @ 0x106b5ec00] mmco: unref short failure
{'loss': 0.0, 'learning_rate': 8.88888888888889e-07, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 1.111111111111111e-06, 'epoch': 0.0}
..
[20:06:01] /github/workspace/src/video/video_reader.cc:83: ERROR opening: /workspace/vl-data/videochatgpt_tune/v_D2JvqkKa-qM.mp4, Invalid data found when processing input
Error with Error reading /workspace/vl-data/videochatgpt_tune/v_D2JvqkKa-qM.mp4...
{'loss': 0.0, 'learning_rate': 1.99712517503872e-05, 'epoch': 0.05}
5%|▌ | 320/5979 [54:49<16:08:41, 10.27s/it][h264 @ 0xfa115240] mmco: unref short failure
[h264 @ 0xfa115240] mmco: unref short failure
{'loss': 0.0, 'learning_rate': 1.9970839794784918e-05, 'epoch': 0.05}
5%|▌ | 321/5979 [54:59<16:09:54, 10.29s/it][h264 @ 0xabcbf80] mmco: unref short failure
[h264 @ 0xabcbf80] mmco: unref short failure
[h264 @ 0xabcbf80] mmco: unref short failure
[mov,mp4,m4a,3gp,3g2,mj2 @ 0xd33399c0] moov atom not found
[20:06:22] /github/workspace/src/video/video_reader.cc:83: ERROR opening: /workspace/vl-data/videochatgpt_tune/v_Nx4rK_jvvR4.mp4, Invalid data found when processing input
Error with Error reading /workspace/vl-data/videochatgpt_tune/v_Nx4rK_jvvR4.mp4...
[h264 @ 0xd2a44b00] Missing reference picture, default is 65530
[h264 @ 0xaa8b7840] Missing reference picture, default is 65530
[h264 @ 0xb8fa740] mmco: unref short failure
[h264 @ 0xb8fa740] mmco: unref short failure
[h264 @ 0xd2a44b00] mmco: unref short failure
[h264 @ 0xd2a44b00] mmco: unref short failure
[h264 @ 0xd2a44b00] mmco: unref short failure
{'loss': 0.0, 'learning_rate': 1.9970424912839455e-05, 'epoch': 0.05}
{'loss': 0.0, 'learning_rate': 1.997000710467258e-05, 'epoch': 0.05}
{'loss': 0.0, 'learning_rate': 1.9969586370406913e-05, 'epoch': 0.05}
{'loss': 0.0, 'learning_rate': 1.996916271016593e-05, 'epoch': 0.05}
{'loss': 0.0, 'learning_rate': 1.996873612407397e-05, 'epoch': 0.05}
...
[h264 @ 0xd2a43940] mmco: unref short failure
[h264 @ 0xd2a43940] mmco: unref short failure
[h264 @ 0xa2373c0] Missing reference picture, default is 65530
[h264 @ 0xd2a43940] Missing reference picture, default is 65530
[h264 @ 0xb94c21c0] mmco: unref short failure
[h264 @ 0xb94c21c0] mmco: unref short failure
{'loss': 0.0, 'learning_rate': 7.757440216011661e-06, 'epoch': 0.58}
{'loss': 0.0, 'learning_rate': 7.752161053801734e-06, 'epoch': 0.59}
{'loss': 0.0, 'learning_rate': 7.746882551310377e-06, 'epoch': 0.59}
[h264 @ 0x105e9fa40] mmco: unref short failure
[h264 @ 0x105e9fa40] mmco: unref short failure
[h264 @ 0x12a1c2c0] Missing reference picture, default is 65530
[h264 @ 0x105e9fa40] Missing reference picture, default is 65530
[h264 @ 0xca735980] mmco: unref short failure
[h264 @ 0xca735980] mmco: unref short failure
{'loss': 0.0, 'learning_rate': 7.741604710086778e-06, 'epoch': 0.59}
{'loss': 0.0, 'learning_rate': 7.736327531679933e-06, 'epoch': 0.59}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

为什么loss一直为0 #141

为什么loss一直为0 #141

xienan0326 commented Apr 8, 2024

为什么loss一直为0 #141

为什么loss一直为0 #141

Comments

xienan0326 commented Apr 8, 2024