Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多机训练时,由于不同机器pip初始化环境下载速度不同,导致部分worker进入训练进程等待通信时超时,如何增加timeout等待时间呢? #6575

Closed
1 task done
Mr-lonely0 opened this issue Jan 9, 2025 · 4 comments
Labels
solved This problem has been already solved

Comments

@Mr-lonely0
Copy link

Reminder

  • I have read the README and searched the existing issues.

System Info

与环境无关

Reproduction

使用的config文件如下:

### model
model_name_or_path: /mnt/nas/user/zhaoxin/llm_pretrain/model_hub/Qwen2.5-72B-Instruct
### method
stage: sft
do_train: true
finetuning_type: lora
lora_target: all
deepspeed: examples/deepspeed/ds_z3_config.json

### dataset
# dataset: local_judge_data # 修改训练数据集
dataset: local_jinrong_data # 修改训练数据集
template: qwen
cutoff_len: 8192
# max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: /mnt/nas/user/zhaoxin/llm_pretrain/exp/466735/tmp/v2_exp_low          # 修改输出路径
logging_steps: 10
# save_steps: 20
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

启动命令如下:

FORCE_TORCHRUN=1 NNODES={nnodes} NODE_RANK=$RANK MASTER_ADDR=$IP_ADDRESS MASTER_PORT=29500 llamafactory-cli train examples/train_lora/qwen25_lora_sft_ds3.yaml

报错信息如下:

[E108 20:04:34.467642512 socket.cpp:957] [c10d] The client socket has timed out after 900s while trying to connect to (33.145.124.128, 29500).
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 255, in launch_agent
    result = agent.run()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 680, in run
    result = self._invoke_run(role)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 829, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 652, in _initialize_workers
    self._rendezvous(worker_group)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 489, in _rendezvous
    rdzv_info = spec.rdzv_handler.next_rendezvous()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 66, in next_rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
torch.distributed.DistNetworkError: The client socket has timed out after 900s while trying to connect to (33.145.124.128, 29500).

我看过torch这部分的源码,有一个timeout的参数,但是关键在于如何在不修改源码的情况下从外部传入这个参数呢?
麻烦作者大大解答

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Jan 9, 2025
@Mr-lonely0
Copy link
Author

我看在cli.py文件里真正的启动命令是:

process = subprocess.run(
                (
                    "torchrun --nnodes {nnodes} --node_rank {node_rank} --nproc_per_node {nproc_per_node} "
                    "--master_addr {master_addr} --master_port {master_port} {file_name} {args}"
                )
                .format(
                    nnodes=os.getenv("NNODES", "1"),
                    node_rank=os.getenv("NODE_RANK", "0"),
                    nproc_per_node=os.getenv("NPROC_PER_NODE", str(get_device_count())),
                    master_addr=master_addr,
                    master_port=master_port,
                    file_name=launcher.__file__,
                    args=" ".join(sys.argv[1:]),
                )
                .split()
            )

是不是在torchrun里加上--rdzv_timeout 3600参数能够增加timeout时间呢

@hiyouga
Copy link
Owner

hiyouga commented Jan 9, 2025

@hiyouga hiyouga closed this as completed Jan 9, 2025
@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Jan 9, 2025
@Mr-lonely0
Copy link
Author

可以试试,或者用 ray https://github.com/hiyouga/LLaMA-Factory/tree/main/examples#supervised-fine-tuning-with-ray-on-4-gpus

ray能在多机多卡上使用吗,有没有类似的配置文件呢

@hiyouga
Copy link
Owner

hiyouga commented Jan 10, 2025

可以看ray文档

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

2 participants