两台服务器，每台4张卡，训练出错 #115

alexiycv · 2022-08-16T03:29:49Z

server not ready, wait 3 sec to retry...
not ready endpoints:['10.10.11.51:6070', '10.10.11.51:6071', '10.10.11.51:6072', '10.10.11.51:6073']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.10.11.51:6070', '10.10.11.51:6071', '10.10.11.51:6072', '10.10.11.51:6073']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.10.11.51:6070', '10.10.11.51:6071', '10.10.11.51:6072', '10.10.11.51:6073']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.10.11.51:6070', '10.10.11.51:6071', '10.10.11.51:6072', '10.10.11.51:6073']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.10.11.51:6070', '10.10.11.51:6071', '10.10.11.51:6072', '10.10.11.51:6073']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.10.11.51:6070', '10.10.11.51:6071', '10.10.11.51:6072', '10.10.11.51:6073'

GuoxiaWang · 2022-08-16T03:35:46Z

这个不是错误。这应该是在等待10.10.11.51响应。

你在两台机器上启动的命令是什么？能贴一下吗？

alexiycv · 2022-08-16T07:13:40Z

TRAINER_IP_LIST=10.10.11.50,10.10.11.51
CUDA_VISIBLE_DEVICES=0,1,2,3
python -m paddle.distributed.launch --ips=$TRAINER_IP_LIST --gpus=$CUDA_VISIBLE_DEVICES tools/train.py
--config_file configs/ms1mv3_r50.py
--is_static False
--backbone FresResNet50
--classifier LargeScaleClassifier
--embedding_size 512
--model_parallel True
--dropout 0.0
--sample_ratio 0.1
--loss ArcFace
--batch_size 128
--dataset MS1M_v3
--num_classes 93431
--data_dir MS1M_v3/
--label_file MS1M_v3/label.txt
--is_bin False
--log_interval_step 100
--validation_interval_step 2000
--fp16 True
--use_dynamic_loss_scaling True
--init_loss_scaling 27648.0
--num_workers 8
--train_unit 'epoch'
--warmup_num 0
--train_num 25
--decay_boundaries "10,16,22"
--output MS1M_v3_arcface_dynamic_0.1_NHWC_FP16

GuoxiaWang · 2022-08-16T07:39:46Z

你这两个机器是在一个集群环境中吗？平常有训练过多机任务么？看着是没问题的。可能是网络不通的问题？IP 地址是否是你的环境中的地址？

alexiycv · 2022-08-16T07:54:50Z

你这两个机器是在一个集群环境中吗？平常有训练过多机任务么？看着是没问题的。可能是网络不通的问题？IP 地址是否是你的环境中的地址？

网络是通的，平时没训练过多机任务

GuoxiaWang · 2022-08-16T08:26:32Z

你确定是两台机器上分别执行了上面的启动命令吗？

多机的话，需要在每个机器上都执行启动命令

alexiycv · 2022-08-16T08:43:15Z

你确定是两台机器上分别执行了上面的启动命令吗？

多机的话，需要在每个机器上都执行启动命令

哦这样子啊，我试一下

GuoxiaWang mentioned this issue Aug 16, 2022

训练报错 #114

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

两台服务器，每台4张卡，训练出错 #115

两台服务器，每台4张卡，训练出错 #115

alexiycv commented Aug 16, 2022

GuoxiaWang commented Aug 16, 2022

alexiycv commented Aug 16, 2022

GuoxiaWang commented Aug 16, 2022

alexiycv commented Aug 16, 2022

GuoxiaWang commented Aug 16, 2022

alexiycv commented Aug 16, 2022

两台服务器，每台4张卡，训练出错 #115

两台服务器，每台4张卡，训练出错 #115

Comments

alexiycv commented Aug 16, 2022

GuoxiaWang commented Aug 16, 2022

alexiycv commented Aug 16, 2022

GuoxiaWang commented Aug 16, 2022

alexiycv commented Aug 16, 2022

GuoxiaWang commented Aug 16, 2022

alexiycv commented Aug 16, 2022