We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llamafactory
8卡L40S训练,训练命令: DISABLE_VERSION_CHECK=1 llamafactory-cli train examples/train_full/pixtral_full_sft.yaml
DISABLE_VERSION_CHECK=1 llamafactory-cli train examples/train_full/pixtral_full_sft.yaml
其中pixtral_full_sft.yaml内容如下:
pixtral_full_sft.yaml
### model model_name_or_path: /lustre/S/tianzikang/LLMs/mistral-community-pixtral-12b/ trust_remote_code: true ### method stage: sft do_train: true finetuning_type: full deepspeed: examples/deepspeed/ds_z3_config.json # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json] ### dataset dataset: spatial_intelligence_sft_generated_1000 # spatial_intelligence_sft_generated template: pixtral cutoff_len: 8192 # 2048 and 4096 will make image features and image tokens do not match overwrite_cache: true preprocessing_num_workers: 16 ### output output_dir: ../workdir/saves/Pixtral-12B/full/sft_generated-1000_10_epoch/ logging_steps: 10 save_steps: 500 plot_loss: true overwrite_output_dir: true ### train per_device_train_batch_size: 1 gradient_accumulation_steps: 2 learning_rate: 1.0e-5 num_train_epochs: 10.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000 ### eval val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 500
训练一段时间后报错:
1%|▏ | 54/3675 [20:05<22:11:33, 22.06s/it] 1%|▏ | 55/3675 [20:27<22:09:39, 22.04s/it] 2%|▏ | 56/3675 [20:49<22:07:20, 22.01s/it] 2%|▏ | 57/3675 [21:10<22:06:01, 21.99s/it]W0109 19:12:09.697000 546635 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 546637 closing signal SIGTERM W0109 19:12:09.703000 546635 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 546638 closing signal SIGTERM W0109 19:12:09.704000 546635 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 546639 closing signal SIGTERM W0109 19:12:09.705000 546635 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 546640 closing signal SIGTERM W0109 19:12:09.706000 546635 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 546641 closing signal SIGTERM W0109 19:12:09.706000 546635 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 546643 closing signal SIGTERM W0109 19:12:09.707000 546635 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 546644 closing signal SIGTERM E0109 19:12:10.427000 546635 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -7) local_rank: 5 (pid: 546642) of binary: /lustre/S/tianzikang/rocky/miniconda3/envs/omnigibson/bin/python Traceback (most recent call last): File "/lustre/S/tianzikang/rocky/miniconda3/envs/omnigibson/bin/torchrun", line 33, in <module> sys.exit(load_entry_point('torch==2.5.1', 'console_scripts', 'torchrun')()) File "/lustre/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/lustre/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/lustre/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/lustre/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/lustre/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /nfs_global/S/tianzikang/rocky/projects/spatial_intelligence/LLaMA-Factory/src/llamafactory/launcher.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-01-09_19:12:09 host : r8l40s-a05.ib.future.cn rank : 5 (local_rank: 5) exitcode : -7 (pid: 546642) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 546642 ============================================================
有时在***** Running Evaluation *****时报错
***** Running Evaluation *****
66%|██████▌ | 500/760 [3:05:21<1:36:57, 22.37s/it] 66%|██████▌ | 500/760 [3:05:21<1:36:57, 22.37s/it][INFO|trainer.py:4128] 2025-01-09 19:38:13,614 >> ***** Running Evaluation ***** [INFO|trainer.py:4130] 2025-01-09 19:38:13,614 >> Num examples = 136 [INFO|trainer.py:4133] 2025-01-09 19:38:13,614 >> Batch size = 1 0%| | 0/17 [00:00<?, ?it/s]�[A 12%|█▏ | 2/17 [00:03<00:26, 1.79s/it]�[AW0109 19:38:21.682000 1513405 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1513407 closing signal SIGTERM W0109 19:38:21.693000 1513405 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1513408 closing signal SIGTERM W0109 19:38:21.694000 1513405 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1513409 closing signal SIGTERM W0109 19:38:21.695000 1513405 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1513411 closing signal SIGTERM W0109 19:38:21.696000 1513405 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1513412 closing signal SIGTERM W0109 19:38:21.697000 1513405 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1513413 closing signal SIGTERM W0109 19:38:21.698000 1513405 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1513414 closing signal SIGTERM E0109 19:38:22.528000 1513405 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -7) local_rank: 3 (pid: 1513410) of binary: /lustre/S/tianzikang/rocky/miniconda3/envs/omnigibson/bin/python Traceback (most recent call last): File "/lustre/S/tianzikang/rocky/miniconda3/envs/omnigibson/bin/torchrun", line 33, in <module> sys.exit(load_entry_point('torch==2.5.1', 'console_scripts', 'torchrun')()) File "/lustre/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/lustre/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/lustre/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/lustre/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/lustre/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /nfs_global/S/tianzikang/rocky/projects/spatial_intelligence/LLaMA-Factory/src/llamafactory/launcher.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-01-09_19:38:21 host : r8l40s-a01 rank : 3 (local_rank: 3) exitcode : -7 (pid: 1513410) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 1513410 ============================================================
这个错误不总能复现,有时会报错,有时就不会,但是出现的概率挺高的。
No response
The text was updated successfully, but these errors were encountered:
把 eval 相关的参数删掉试试
Sorry, something went wrong.
这是什么原理,我也有类似的问题
No branches or pull requests
Reminder
System Info
llamafactory
version: 0.9.2.dev0Reproduction
8卡L40S训练,训练命令:
DISABLE_VERSION_CHECK=1 llamafactory-cli train examples/train_full/pixtral_full_sft.yaml
其中
pixtral_full_sft.yaml
内容如下:训练一段时间后报错:
有时在
***** Running Evaluation *****
时报错这个错误不总能复现,有时会报错,有时就不会,但是出现的概率挺高的。
Others
No response
The text was updated successfully, but these errors were encountered: