Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

为什么loss一直为0 #141

Open
xienan0326 opened this issue Apr 8, 2024 · 0 comments
Open

为什么loss一直为0 #141

xienan0326 opened this issue Apr 8, 2024 · 0 comments

Comments

@xienan0326
Copy link

运行脚本
export WANDB_MODE='offline'

JSON_FOLDER="train_json"
IMAGE_FOLDER="/workspace/vl-data/"
VIDEO_FOLDER="/workspace/vl-data/"

HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 deepspeed videollava/train/train_mem.py
--deepspeed ./scripts/zero2.json
--model_name_or_path ./vicuna-7b-v1.5
--version v1
--data_path ${JSON_FOLDER}/llava_image_tune_.json ${JSON_FOLDER}/videochatgpt_tune_.json ${JSON_FOLDER}/nlp_tune.json
--image_folder ${IMAGE_FOLDER}
--image_tower ./LanguageBind/LanguageBind_Image
--video_folder ${VIDEO_FOLDER}
--video_tower ./LanguageBind/LanguageBind_Video_merge
--mm_projector_type mlp2x_gelu
--pretrain_mm_mlp_adapter ./checkpoints/videollava-7b-pretrain/mm_projector.bin
--mm_vision_select_layer -2
--mm_use_im_start_end False
--mm_use_im_patch_token False
--image_aspect_ratio pad
--group_by_modality_length True
--bf16 True
--output_dir ./checkpoints/videollava-sb
--num_train_epochs 1
--per_device_train_batch_size 16
--per_device_eval_batch_size 4
--gradient_accumulation_steps 1
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 50000
--save_total_limit 1
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--tf32 True
--model_max_length 2048 --tokenizer_model_max_length 3072
--gradient_checkpointing True
--dataloader_num_workers 4
--lazy_preprocess True
--cache_dir "./cache_dir"

运行日志
reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
[h264 @ 0xa1e6680] mmco: unref short failure
{'loss': 1.8232, 'learning_rate': 1.1111111111111112e-07, 'epoch': 0.0}
{'loss': 1.737, 'learning_rate': 2.2222222222222224e-07, 'epoch': 0.0}
0%| | 2/5979 [00:33<26:09:12, 15.75s/it][h264 @ 0xb311740] mmco: unref short failure
[h264 @ 0xb311740] mmco: unref short failure
[h264 @ 0xaaffd80] mmco: unref short failure
{'loss': 1.7796, 'learning_rate': 3.3333333333333335e-07, 'epoch': 0.0}
{'loss': 1.708, 'learning_rate': 4.444444444444445e-07, 'epoch': 0.0}
0%| | 4/5979 [00:54<20:20:27, 12.26s/it][h264 @ 0xbabe380] mmco: unref short failure
{'loss': 0.0, 'learning_rate': 5.555555555555555e-07, 'epoch': 0.0}
0%| | 5/5979 [01:05<19:08:34, 11.54s/it][h264 @ 0x25e23040] mmco: unref short failure
{'loss': 0.0, 'learning_rate': 6.666666666666667e-07, 'epoch': 0.0}
0%| | 6/5979 [01:15<18:37:10, 11.22s/it][h264 @ 0x1e8e4c80] Missing reference picture, default is 65530
[h264 @ 0x1e7b9780] Missing reference picture, default is 65530
[h264 @ 0x8d7e80] mmco: unref short failure
[h264 @ 0x8d7e80] mmco: unref short failure
{'loss': 0.0, 'learning_rate': 7.777777777777779e-07, 'epoch': 0.0}
0%| | 7/5979 [01:26<18:32:16, 11.17s/it][h264 @ 0x106b5ec00] mmco: unref short failure
[h264 @ 0x106b5ec00] mmco: unref short failure
{'loss': 0.0, 'learning_rate': 8.88888888888889e-07, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 1.111111111111111e-06, 'epoch': 0.0}
..
[20:06:01] /github/workspace/src/video/video_reader.cc:83: ERROR opening: /workspace/vl-data/videochatgpt_tune/v_D2JvqkKa-qM.mp4, Invalid data found when processing input
Error with Error reading /workspace/vl-data/videochatgpt_tune/v_D2JvqkKa-qM.mp4...
{'loss': 0.0, 'learning_rate': 1.99712517503872e-05, 'epoch': 0.05}
5%|▌ | 320/5979 [54:49<16:08:41, 10.27s/it][h264 @ 0xfa115240] mmco: unref short failure
[h264 @ 0xfa115240] mmco: unref short failure
{'loss': 0.0, 'learning_rate': 1.9970839794784918e-05, 'epoch': 0.05}
5%|▌ | 321/5979 [54:59<16:09:54, 10.29s/it][h264 @ 0xabcbf80] mmco: unref short failure
[h264 @ 0xabcbf80] mmco: unref short failure
[h264 @ 0xabcbf80] mmco: unref short failure
[mov,mp4,m4a,3gp,3g2,mj2 @ 0xd33399c0] moov atom not found
[20:06:22] /github/workspace/src/video/video_reader.cc:83: ERROR opening: /workspace/vl-data/videochatgpt_tune/v_Nx4rK_jvvR4.mp4, Invalid data found when processing input
Error with Error reading /workspace/vl-data/videochatgpt_tune/v_Nx4rK_jvvR4.mp4...
[h264 @ 0xd2a44b00] Missing reference picture, default is 65530
[h264 @ 0xaa8b7840] Missing reference picture, default is 65530
[h264 @ 0xb8fa740] mmco: unref short failure
[h264 @ 0xb8fa740] mmco: unref short failure
[h264 @ 0xd2a44b00] mmco: unref short failure
[h264 @ 0xd2a44b00] mmco: unref short failure
[h264 @ 0xd2a44b00] mmco: unref short failure
{'loss': 0.0, 'learning_rate': 1.9970424912839455e-05, 'epoch': 0.05}
{'loss': 0.0, 'learning_rate': 1.997000710467258e-05, 'epoch': 0.05}
{'loss': 0.0, 'learning_rate': 1.9969586370406913e-05, 'epoch': 0.05}
{'loss': 0.0, 'learning_rate': 1.996916271016593e-05, 'epoch': 0.05}
{'loss': 0.0, 'learning_rate': 1.996873612407397e-05, 'epoch': 0.05}
...
[h264 @ 0xd2a43940] mmco: unref short failure
[h264 @ 0xd2a43940] mmco: unref short failure
[h264 @ 0xa2373c0] Missing reference picture, default is 65530
[h264 @ 0xd2a43940] Missing reference picture, default is 65530
[h264 @ 0xb94c21c0] mmco: unref short failure
[h264 @ 0xb94c21c0] mmco: unref short failure
{'loss': 0.0, 'learning_rate': 7.757440216011661e-06, 'epoch': 0.58}
{'loss': 0.0, 'learning_rate': 7.752161053801734e-06, 'epoch': 0.59}
{'loss': 0.0, 'learning_rate': 7.746882551310377e-06, 'epoch': 0.59}
[h264 @ 0x105e9fa40] mmco: unref short failure
[h264 @ 0x105e9fa40] mmco: unref short failure
[h264 @ 0x12a1c2c0] Missing reference picture, default is 65530
[h264 @ 0x105e9fa40] Missing reference picture, default is 65530
[h264 @ 0xca735980] mmco: unref short failure
[h264 @ 0xca735980] mmco: unref short failure
{'loss': 0.0, 'learning_rate': 7.741604710086778e-06, 'epoch': 0.59}
{'loss': 0.0, 'learning_rate': 7.736327531679933e-06, 'epoch': 0.59}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant