Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

两台服务器,每台4张卡,训练出错 #115

Open
alexiycv opened this issue Aug 16, 2022 · 6 comments
Open

两台服务器,每台4张卡,训练出错 #115

alexiycv opened this issue Aug 16, 2022 · 6 comments

Comments

@alexiycv
Copy link

server not ready, wait 3 sec to retry...
not ready endpoints:['10.10.11.51:6070', '10.10.11.51:6071', '10.10.11.51:6072', '10.10.11.51:6073']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.10.11.51:6070', '10.10.11.51:6071', '10.10.11.51:6072', '10.10.11.51:6073']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.10.11.51:6070', '10.10.11.51:6071', '10.10.11.51:6072', '10.10.11.51:6073']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.10.11.51:6070', '10.10.11.51:6071', '10.10.11.51:6072', '10.10.11.51:6073']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.10.11.51:6070', '10.10.11.51:6071', '10.10.11.51:6072', '10.10.11.51:6073']
server not ready, wait 3 sec to retry...
not ready endpoints:['10.10.11.51:6070', '10.10.11.51:6071', '10.10.11.51:6072', '10.10.11.51:6073'

@GuoxiaWang
Copy link
Collaborator

这个不是错误。这应该是在等待10.10.11.51响应。

你在两台机器上启动的命令是什么?能贴一下吗?

@alexiycv
Copy link
Author

TRAINER_IP_LIST=10.10.11.50,10.10.11.51
CUDA_VISIBLE_DEVICES=0,1,2,3
python -m paddle.distributed.launch --ips=$TRAINER_IP_LIST --gpus=$CUDA_VISIBLE_DEVICES tools/train.py
--config_file configs/ms1mv3_r50.py
--is_static False
--backbone FresResNet50
--classifier LargeScaleClassifier
--embedding_size 512
--model_parallel True
--dropout 0.0
--sample_ratio 0.1
--loss ArcFace
--batch_size 128
--dataset MS1M_v3
--num_classes 93431
--data_dir MS1M_v3/
--label_file MS1M_v3/label.txt
--is_bin False
--log_interval_step 100
--validation_interval_step 2000
--fp16 True
--use_dynamic_loss_scaling True
--init_loss_scaling 27648.0
--num_workers 8
--train_unit 'epoch'
--warmup_num 0
--train_num 25
--decay_boundaries "10,16,22"
--output MS1M_v3_arcface_dynamic_0.1_NHWC_FP16

@GuoxiaWang
Copy link
Collaborator

你这两个机器是在一个集群环境中吗?平常有训练过多机任务么?看着是没问题的。可能是网络不通的问题?IP 地址是否是你的环境中的地址?

@alexiycv
Copy link
Author

你这两个机器是在一个集群环境中吗?平常有训练过多机任务么?看着是没问题的。可能是网络不通的问题?IP 地址是否是你的环境中的地址?

网络是通的,平时没训练过多机任务

@GuoxiaWang
Copy link
Collaborator

你确定是两台机器上分别执行了上面的启动命令吗?

多机的话,需要在每个机器上都执行启动命令

@alexiycv
Copy link
Author

你确定是两台机器上分别执行了上面的启动命令吗?

多机的话,需要在每个机器上都执行启动命令

哦这样子啊,我试一下

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants