多机训练时，由于不同机器pip初始化环境下载速度不同，导致部分worker进入训练进程等待通信时超时，如何增加timeout等待时间呢？ #6575

Mr-lonely0 · 2025-01-09T03:13:47Z

Reminder

I have read the README and searched the existing issues.

System Info

与环境无关

Reproduction

使用的config文件如下：

### model
model_name_or_path: /mnt/nas/user/zhaoxin/llm_pretrain/model_hub/Qwen2.5-72B-Instruct
### method
stage: sft
do_train: true
finetuning_type: lora
lora_target: all
deepspeed: examples/deepspeed/ds_z3_config.json

### dataset
# dataset: local_judge_data # 修改训练数据集
dataset: local_jinrong_data # 修改训练数据集
template: qwen
cutoff_len: 8192
# max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: /mnt/nas/user/zhaoxin/llm_pretrain/exp/466735/tmp/v2_exp_low          # 修改输出路径
logging_steps: 10
# save_steps: 20
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

启动命令如下：

FORCE_TORCHRUN=1 NNODES={nnodes} NODE_RANK=$RANK MASTER_ADDR=$IP_ADDRESS MASTER_PORT=29500 llamafactory-cli train examples/train_lora/qwen25_lora_sft_ds3.yaml

报错信息如下：

[E108 20:04:34.467642512 socket.cpp:957] [c10d] The client socket has timed out after 900s while trying to connect to (33.145.124.128, 29500).
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 255, in launch_agent
    result = agent.run()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 680, in run
    result = self._invoke_run(role)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 829, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 652, in _initialize_workers
    self._rendezvous(worker_group)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 489, in _rendezvous
    rdzv_info = spec.rdzv_handler.next_rendezvous()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 66, in next_rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
torch.distributed.DistNetworkError: The client socket has timed out after 900s while trying to connect to (33.145.124.128, 29500).

我看过torch这部分的源码，有一个timeout的参数，但是关键在于如何在不修改源码的情况下从外部传入这个参数呢？
麻烦作者大大解答

Others

No response

The text was updated successfully, but these errors were encountered:

Mr-lonely0 · 2025-01-09T11:26:32Z

我看在cli.py文件里真正的启动命令是：

process = subprocess.run(
                (
                    "torchrun --nnodes {nnodes} --node_rank {node_rank} --nproc_per_node {nproc_per_node} "
                    "--master_addr {master_addr} --master_port {master_port} {file_name} {args}"
                )
                .format(
                    nnodes=os.getenv("NNODES", "1"),
                    node_rank=os.getenv("NODE_RANK", "0"),
                    nproc_per_node=os.getenv("NPROC_PER_NODE", str(get_device_count())),
                    master_addr=master_addr,
                    master_port=master_port,
                    file_name=launcher.__file__,
                    args=" ".join(sys.argv[1:]),
                )
                .split()
            )

是不是在torchrun里加上--rdzv_timeout 3600参数能够增加timeout时间呢

hiyouga · 2025-01-09T13:14:05Z

可以试试，或者用 ray https://github.com/hiyouga/LLaMA-Factory/tree/main/examples#supervised-fine-tuning-with-ray-on-4-gpus

Mr-lonely0 · 2025-01-10T04:13:54Z

可以试试，或者用 ray https://github.com/hiyouga/LLaMA-Factory/tree/main/examples#supervised-fine-tuning-with-ray-on-4-gpus

ray能在多机多卡上使用吗，有没有类似的配置文件呢

hiyouga · 2025-01-10T04:37:35Z

可以看ray文档

github-actions bot added the pending This problem is yet to be addressed label Jan 9, 2025

hiyouga closed this as completed Jan 9, 2025

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

多机训练时，由于不同机器pip初始化环境下载速度不同，导致部分worker进入训练进程等待通信时超时，如何增加timeout等待时间呢？ #6575

多机训练时，由于不同机器pip初始化环境下载速度不同，导致部分worker进入训练进程等待通信时超时，如何增加timeout等待时间呢？ #6575

Mr-lonely0 commented Jan 9, 2025

Mr-lonely0 commented Jan 9, 2025

hiyouga commented Jan 9, 2025

Mr-lonely0 commented Jan 10, 2025

hiyouga commented Jan 10, 2025

多机训练时，由于不同机器pip初始化环境下载速度不同，导致部分worker进入训练进程等待通信时超时，如何增加timeout等待时间呢？ #6575

多机训练时，由于不同机器pip初始化环境下载速度不同，导致部分worker进入训练进程等待通信时超时，如何增加timeout等待时间呢？ #6575

Comments

Mr-lonely0 commented Jan 9, 2025

Reminder

System Info

Reproduction

Others

Mr-lonely0 commented Jan 9, 2025

hiyouga commented Jan 9, 2025

Mr-lonely0 commented Jan 10, 2025

hiyouga commented Jan 10, 2025