Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多卡train_full Pixtral-12B一段时间后报错: torch.distributed.elastic.multiprocessing.errors.ChildFailedError #6590

Open
1 task done
Felixvillas opened this issue Jan 10, 2025 · 2 comments
Labels
bug Something isn't working pending This problem is yet to be addressed

Comments

@Felixvillas
Copy link

Felixvillas commented Jan 10, 2025

Reminder

  • I have read the above rules and searched the existing issues.

System Info

  • llamafactory version: 0.9.2.dev0
  • Platform: Linux-4.18.0-425.10.1.el8_7.x86_64-x86_64-with-glibc2.28
  • Python version: 3.10.15
  • PyTorch version: 2.5.1 (GPU)
  • Transformers version: 4.46.2
  • Datasets version: 3.1.0
  • Accelerate version: 1.0.1
  • PEFT version: 0.12.0
  • TRL version: 0.9.6
  • GPU type: NVIDIA L40S
  • DeepSpeed version: 0.15.4
  • Bitsandbytes version: 0.45.0
  • vLLM version: 0.6.6.post1

Reproduction

8卡L40S训练,训练命令:
DISABLE_VERSION_CHECK=1 llamafactory-cli train examples/train_full/pixtral_full_sft.yaml

其中pixtral_full_sft.yaml内容如下:

### model
model_name_or_path: /lustre/S/tianzikang/LLMs/mistral-community-pixtral-12b/
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json  # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]

### dataset
dataset: spatial_intelligence_sft_generated_1000 # spatial_intelligence_sft_generated
template: pixtral
cutoff_len: 8192 # 2048 and 4096 will make image features and image tokens do not match
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: ../workdir/saves/Pixtral-12B/full/sft_generated-1000_10_epoch/
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 1.0e-5
num_train_epochs: 10.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

训练一段时间后报错:

1%|| 54/3675 [20:05<22:11:33, 22.06s/it]
  1%|| 55/3675 [20:27<22:09:39, 22.04s/it]
  2%|| 56/3675 [20:49<22:07:20, 22.01s/it]
  2%|| 57/3675 [21:10<22:06:01, 21.99s/it]W0109 19:12:09.697000 546635 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 546637 closing signal SIGTERM
W0109 19:12:09.703000 546635 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 546638 closing signal SIGTERM
W0109 19:12:09.704000 546635 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 546639 closing signal SIGTERM
W0109 19:12:09.705000 546635 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 546640 closing signal SIGTERM
W0109 19:12:09.706000 546635 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 546641 closing signal SIGTERM
W0109 19:12:09.706000 546635 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 546643 closing signal SIGTERM
W0109 19:12:09.707000 546635 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 546644 closing signal SIGTERM
E0109 19:12:10.427000 546635 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -7) local_rank: 5 (pid: 546642) of binary: /lustre/S/tianzikang/rocky/miniconda3/envs/omnigibson/bin/python
Traceback (most recent call last):
  File "/lustre/S/tianzikang/rocky/miniconda3/envs/omnigibson/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.5.1', 'console_scripts', 'torchrun')())
  File "/lustre/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/lustre/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/lustre/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/lustre/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/lustre/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/nfs_global/S/tianzikang/rocky/projects/spatial_intelligence/LLaMA-Factory/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-09_19:12:09
  host      : r8l40s-a05.ib.future.cn
  rank      : 5 (local_rank: 5)
  exitcode  : -7 (pid: 546642)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 546642
============================================================

有时在***** Running Evaluation *****时报错

66%|██████▌   | 500/760 [3:05:21<1:36:57, 22.37s/it]
                                                     

 66%|██████▌   | 500/760 [3:05:21<1:36:57, 22.37s/it][INFO|trainer.py:4128] 2025-01-09 19:38:13,614 >> 
***** Running Evaluation *****
[INFO|trainer.py:4130] 2025-01-09 19:38:13,614 >>   Num examples = 136
[INFO|trainer.py:4133] 2025-01-09 19:38:13,614 >>   Batch size = 1


  0%|          | 0/17 [00:00<?, ?it/s]�[A

 12%|█▏        | 2/17 [00:03<00:26,  1.79s/it]�[AW0109 19:38:21.682000 1513405 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1513407 closing signal SIGTERM
W0109 19:38:21.693000 1513405 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1513408 closing signal SIGTERM
W0109 19:38:21.694000 1513405 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1513409 closing signal SIGTERM
W0109 19:38:21.695000 1513405 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1513411 closing signal SIGTERM
W0109 19:38:21.696000 1513405 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1513412 closing signal SIGTERM
W0109 19:38:21.697000 1513405 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1513413 closing signal SIGTERM
W0109 19:38:21.698000 1513405 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1513414 closing signal SIGTERM
E0109 19:38:22.528000 1513405 /nfs_global/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -7) local_rank: 3 (pid: 1513410) of binary: /lustre/S/tianzikang/rocky/miniconda3/envs/omnigibson/bin/python
Traceback (most recent call last):
  File "/lustre/S/tianzikang/rocky/miniconda3/envs/omnigibson/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.5.1', 'console_scripts', 'torchrun')())
  File "/lustre/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/lustre/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/lustre/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/lustre/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/lustre/S/tianzikang/rocky/miniconda3/envs/omnigibson/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/nfs_global/S/tianzikang/rocky/projects/spatial_intelligence/LLaMA-Factory/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-09_19:38:21
  host      : r8l40s-a01
  rank      : 3 (local_rank: 3)
  exitcode  : -7 (pid: 1513410)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 1513410
============================================================

这个错误不总能复现,有时会报错,有时就不会,但是出现的概率挺高的。

Others

No response

@Felixvillas Felixvillas added bug Something isn't working pending This problem is yet to be addressed labels Jan 10, 2025
@hiyouga
Copy link
Owner

hiyouga commented Jan 10, 2025

把 eval 相关的参数删掉试试

@jinzhuoran
Copy link

把 eval 相关的参数删掉试试

这是什么原理,我也有类似的问题

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

3 participants