Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about test results #230

Closed
LittlePotatoChip opened this issue Dec 9, 2021 · 19 comments
Closed

about test results #230

LittlePotatoChip opened this issue Dec 9, 2021 · 19 comments

Comments

@LittlePotatoChip
Copy link

for pp:
python -m torch.distributed.launch --nproc_per_node=2 ./tools/dist_test.py configs/nusc/pp/nusc_centerpoint_pp_02voxel_two_pfn_10sweep.py --work_dir work_dirs/nusc_centerpoint_pp_02voxel_two_pfn_10sweep --checkpoint work_dirs/nusc_centerpoint_pp_02voxel_two_pfn_10sweep/latest.pth
results:
7be7246e242d0a1ea7fffb4caf14adc
for voxel:
python -m torch.distributed.launch --nproc_per_node=2 ./tools/dist_test.py configs/nusc/voxelnet/nusc_centerpoint_voxelnet_01voxel.py --work_dir work_dirs/nusc_centerpoint_voxelnet_01voxel --checkpoint work_dirs/nusc_centerpoint_voxelnet_01voxel/latest.pth
715b060c72be41a77db25fbe7fbb385

A little low than those in the README.md

@tianweiy
Copy link
Owner

tianweiy commented Dec 9, 2021

something seems wrong with the voxelnet. How do you get these two results (is it a pretrained model?)

@LittlePotatoChip
Copy link
Author

something seems wrong with the voxelnet. How do you get these two results (is it a pretrained model?)

trained in my devices

@tianweiy
Copy link
Owner

tianweiy commented Dec 9, 2021

pp seems reasonable. It seems better than README result https://github.com/tianweiy/CenterPoint/tree/master/configs/nusc

VoxelNet is definitely off (even 1 epoch should be better or close to current one). spconv version? Also the apex / torch nn syncbn ? Is this the epoch 20 one ?

@LittlePotatoChip
Copy link
Author

pp seems reasonable. It seems better than README result https://github.com/tianweiy/CenterPoint/tree/master/configs/nusc

VoxelNet is definitely off (even 1 epoch should be better or close to current one). spconv version? Also the apex / torch nn syncbn ?

I remember that I trained that weeks ago and use another environments version and use one gpus(I don't remember some details and it seemed that the environments building fail).
for pp I use spconv1.2.1 and torch.nn.syncbn without apex,it cost less than 4 days traing.
Maybe I should retrain the voxel if you mean that the results with pp is reasonable.But why the results with pp is lower than that with voxel by about 8 points 2333

@tianweiy
Copy link
Owner

tianweiy commented Dec 9, 2021

yeah, the pp's result is good. (better than my original result).

why the results with pp is lower than that with voxel by about 8 points 2333

I think this is normal for nuScenes and Waymo. PP is worse for small objects compared to voxelnet

@LittlePotatoChip
Copy link
Author

I tryied to train with voxel same as that with pp,without apex,but It print a similar bug:
[E ProcessGroupNCCL.cpp:566] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1807915 milliseconds before timing out.
Traceback (most recent call last):
File "./tools/train.py", line 137, in
main()
File "./tools/train.py", line 132, in main
logger=logger,
File "/home/ruidong/workplace/CenterPoint/det3d/torchie/apis/train.py", line 329, in train_detector
trainer.run(data_loaders, cfg.workflow, cfg.total_epochs, local_rank=cfg.local_rank)
File "/home/ruidong/workplace/CenterPoint/det3d/torchie/trainer/trainer.py", line 543, in run
epoch_runner(data_loaders[i], self.epoch, **kwargs)
File "/home/ruidong/workplace/CenterPoint/det3d/torchie/trainer/trainer.py", line 410, in train
self.model, data_batch, train_mode=True, **kwargs
File "/home/ruidong/workplace/CenterPoint/det3d/torchie/trainer/trainer.py", line 368, in batch_processor_inline
losses = model(example, return_loss=True)
File "/home/ruidong/anaconda3/envs/CenterPoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ruidong/anaconda3/envs/CenterPoint/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 799, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/ruidong/anaconda3/envs/CenterPoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ruidong/workplace/CenterPoint/det3d/models/detectors/voxelnet.py", line 49, in forward
x, _ = self.extract_feat(data)
File "/home/ruidong/workplace/CenterPoint/det3d/models/detectors/voxelnet.py", line 26, in extract_feat
input_features, data["coors"], data["batch_size"], data["input_shape"]
File "/home/ruidong/anaconda3/envs/CenterPoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ruidong/workplace/CenterPoint/det3d/models/backbones/scn.py", line 158, in forward
x_conv1 = self.conv1(x)
File "/home/ruidong/anaconda3/envs/CenterPoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ruidong/anaconda3/envs/CenterPoint/lib/python3.6/site-packages/spconv/modules.py", line 134, in forward
input = module(input)
File "/home/ruidong/anaconda3/envs/CenterPoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ruidong/workplace/CenterPoint/det3d/models/backbones/scn.py", line 68, in forward
out.features = self.bn1(out.features)
File "/home/ruidong/anaconda3/envs/CenterPoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ruidong/anaconda3/envs/CenterPoint/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 757, in forward
world_size,
File "/home/ruidong/anaconda3/envs/CenterPoint/lib/python3.6/site-packages/torch/nn/modules/_functions.py", line 35, in forward
dist.all_gather(combined_list, combined, process_group, async_op=False)
File "/home/ruidong/anaconda3/envs/CenterPoint/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1909, in all_gather
work = group.allgather([tensor_list], [tensor])
RuntimeError: NCCL communicator was aborted on rank 0.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1807915 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 5899) of binary: /home/ruidong/anaconda3/envs/CenterPoint/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed

Like apex,It happened while the first epoch completed.

@tianweiy
Copy link
Owner

tianweiy commented Dec 9, 2021

I see. could you let me know your current torch / cuda again? I will debug this after finishing a few final exams (in one week).

@LittlePotatoChip
Copy link
Author

I see. could you let me know your current torch / cuda again? I will debug this after finishing a few final exams (in one week).

3.6.13 |Anaconda, Inc.| (default, Jun 4 2021, 14:25:59)
[GCC 7.5.0]
torch_version:
1.9.0+cu111
torchvision_version:
0.10.0+cu111
cuda_version:
11.1

thanks very much

@tianweiy
Copy link
Owner

tianweiy commented Dec 17, 2021

I will debug this at the weekend

@tianweiy
Copy link
Owner

I find the problem. Will push a fix in a few hrs

@tianweiy
Copy link
Owner

should be fixed now at e30f768

let me know if the problem still exists

@tianweiy
Copy link
Owner

you can also use torch syncbn and spconv 2.x. I can confirm that the results won't change in the new version

@LittlePotatoChip
Copy link
Author

should be fixed now at e30f768

let me know if the problem still exists

OK,thanks.But now I'm running another code now and I'll give it a try once this run is over

@tianweiy
Copy link
Owner

sure, let me know if there are still problems. (this one is a little hard to test )

@tianweiy
Copy link
Owner

feel free to reopen if there are other issues

@Devoe-97
Copy link

feel free to reopen if there are other issues

there are still problems. When I fix the seed, it will occur after a certain number of steps.

@Devoe-97
Copy link

should be fixed now at e30f768

let me know if the problem still exists

the problem still exists

@tianweiy
Copy link
Owner

See #203

@Devoe-97
Copy link

See #203

Thanks! I will try it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants