about test results #230

LittlePotatoChip · 2021-12-09T02:30:49Z

for pp：
python -m torch.distributed.launch --nproc_per_node=2 ./tools/dist_test.py configs/nusc/pp/nusc_centerpoint_pp_02voxel_two_pfn_10sweep.py --work_dir work_dirs/nusc_centerpoint_pp_02voxel_two_pfn_10sweep --checkpoint work_dirs/nusc_centerpoint_pp_02voxel_two_pfn_10sweep/latest.pth
results：

for voxel:
python -m torch.distributed.launch --nproc_per_node=2 ./tools/dist_test.py configs/nusc/voxelnet/nusc_centerpoint_voxelnet_01voxel.py --work_dir work_dirs/nusc_centerpoint_voxelnet_01voxel --checkpoint work_dirs/nusc_centerpoint_voxelnet_01voxel/latest.pth

A little low than those in the README.md

The text was updated successfully, but these errors were encountered:

tianweiy · 2021-12-09T02:33:57Z

something seems wrong with the voxelnet. How do you get these two results (is it a pretrained model?)

LittlePotatoChip · 2021-12-09T02:35:18Z

something seems wrong with the voxelnet. How do you get these two results (is it a pretrained model?)

trained in my devices

tianweiy · 2021-12-09T02:37:39Z

pp seems reasonable. It seems better than README result https://github.com/tianweiy/CenterPoint/tree/master/configs/nusc

VoxelNet is definitely off (even 1 epoch should be better or close to current one). spconv version? Also the apex / torch nn syncbn ? Is this the epoch 20 one ?

LittlePotatoChip · 2021-12-09T02:45:41Z

pp seems reasonable. It seems better than README result https://github.com/tianweiy/CenterPoint/tree/master/configs/nusc

VoxelNet is definitely off (even 1 epoch should be better or close to current one). spconv version? Also the apex / torch nn syncbn ?

I remember that I trained that weeks ago and use another environments version and use one gpus(I don't remember some details and it seemed that the environments building fail).
for pp I use spconv1.2.1 and torch.nn.syncbn without apex,it cost less than 4 days traing.
Maybe I should retrain the voxel if you mean that the results with pp is reasonable.But why the results with pp is lower than that with voxel by about 8 points 2333

tianweiy · 2021-12-09T02:52:07Z

yeah, the pp's result is good. (better than my original result).

why the results with pp is lower than that with voxel by about 8 points 2333

I think this is normal for nuScenes and Waymo. PP is worse for small objects compared to voxelnet

LittlePotatoChip · 2021-12-09T13:16:36Z

I tryied to train with voxel same as that with pp,without apex,but It print a similar bug：
[E ProcessGroupNCCL.cpp:566] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1807915 milliseconds before timing out.
Traceback (most recent call last):
File "./tools/train.py", line 137, in
main()
File "./tools/train.py", line 132, in main
logger=logger,
File "/home/ruidong/workplace/CenterPoint/det3d/torchie/apis/train.py", line 329, in train_detector
trainer.run(data_loaders, cfg.workflow, cfg.total_epochs, local_rank=cfg.local_rank)
File "/home/ruidong/workplace/CenterPoint/det3d/torchie/trainer/trainer.py", line 543, in run
epoch_runner(data_loaders[i], self.epoch, **kwargs)
File "/home/ruidong/workplace/CenterPoint/det3d/torchie/trainer/trainer.py", line 410, in train
self.model, data_batch, train_mode=True, **kwargs
File "/home/ruidong/workplace/CenterPoint/det3d/torchie/trainer/trainer.py", line 368, in batch_processor_inline
losses = model(example, return_loss=True)
File "/home/ruidong/anaconda3/envs/CenterPoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ruidong/anaconda3/envs/CenterPoint/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 799, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/ruidong/anaconda3/envs/CenterPoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ruidong/workplace/CenterPoint/det3d/models/detectors/voxelnet.py", line 49, in forward
x, _ = self.extract_feat(data)
File "/home/ruidong/workplace/CenterPoint/det3d/models/detectors/voxelnet.py", line 26, in extract_feat
input_features, data["coors"], data["batch_size"], data["input_shape"]
File "/home/ruidong/anaconda3/envs/CenterPoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ruidong/workplace/CenterPoint/det3d/models/backbones/scn.py", line 158, in forward
x_conv1 = self.conv1(x)
File "/home/ruidong/anaconda3/envs/CenterPoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ruidong/anaconda3/envs/CenterPoint/lib/python3.6/site-packages/spconv/modules.py", line 134, in forward
input = module(input)
File "/home/ruidong/anaconda3/envs/CenterPoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ruidong/workplace/CenterPoint/det3d/models/backbones/scn.py", line 68, in forward
out.features = self.bn1(out.features)
File "/home/ruidong/anaconda3/envs/CenterPoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ruidong/anaconda3/envs/CenterPoint/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 757, in forward
world_size,
File "/home/ruidong/anaconda3/envs/CenterPoint/lib/python3.6/site-packages/torch/nn/modules/_functions.py", line 35, in forward
dist.all_gather(combined_list, combined, process_group, async_op=False)
File "/home/ruidong/anaconda3/envs/CenterPoint/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1909, in all_gather
work = group.allgather([tensor_list], [tensor])
RuntimeError: NCCL communicator was aborted on rank 0.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1807915 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 5899) of binary: /home/ruidong/anaconda3/envs/CenterPoint/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed

Like apex,It happened while the first epoch completed.

tianweiy · 2021-12-09T15:46:22Z

I see. could you let me know your current torch / cuda again? I will debug this after finishing a few final exams (in one week).

LittlePotatoChip · 2021-12-10T00:12:41Z

I see. could you let me know your current torch / cuda again? I will debug this after finishing a few final exams (in one week).

3.6.13 |Anaconda, Inc.| (default, Jun 4 2021, 14:25:59)
[GCC 7.5.0]
torch_version:
1.9.0+cu111
torchvision_version:
0.10.0+cu111
cuda_version:
11.1

thanks very much

tianweiy · 2021-12-17T04:11:17Z

I will debug this at the weekend

tianweiy · 2021-12-19T00:10:01Z

I find the problem. Will push a fix in a few hrs

tianweiy · 2021-12-19T00:16:59Z

should be fixed now at e30f768

let me know if the problem still exists

tianweiy · 2021-12-19T20:10:55Z

you can also use torch syncbn and spconv 2.x. I can confirm that the results won't change in the new version

LittlePotatoChip · 2021-12-20T06:40:31Z

should be fixed now at e30f768

let me know if the problem still exists

OK,thanks.But now I'm running another code now and I'll give it a try once this run is over

tianweiy · 2021-12-22T20:33:52Z

sure, let me know if there are still problems. (this one is a little hard to test )

tianweiy · 2022-01-24T19:04:07Z

feel free to reopen if there are other issues

Devoe-97 · 2022-07-20T02:05:08Z

feel free to reopen if there are other issues

there are still problems. When I fix the seed, it will occur after a certain number of steps.

Devoe-97 · 2022-07-20T02:06:19Z

should be fixed now at e30f768

let me know if the problem still exists

the problem still exists

tianweiy · 2022-07-20T05:06:29Z

See #203

Devoe-97 · 2022-07-20T05:21:22Z

See #203

Thanks! I will try it.

tianweiy closed this as completed Jan 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

about test results #230

about test results #230

LittlePotatoChip commented Dec 9, 2021

tianweiy commented Dec 9, 2021

LittlePotatoChip commented Dec 9, 2021

tianweiy commented Dec 9, 2021 •

edited

Loading

LittlePotatoChip commented Dec 9, 2021

tianweiy commented Dec 9, 2021 •

edited

Loading

LittlePotatoChip commented Dec 9, 2021

tianweiy commented Dec 9, 2021

LittlePotatoChip commented Dec 10, 2021

tianweiy commented Dec 17, 2021 •

edited

Loading

tianweiy commented Dec 19, 2021

tianweiy commented Dec 19, 2021

tianweiy commented Dec 19, 2021

LittlePotatoChip commented Dec 20, 2021

tianweiy commented Dec 22, 2021

tianweiy commented Jan 24, 2022

Devoe-97 commented Jul 20, 2022

Devoe-97 commented Jul 20, 2022

tianweiy commented Jul 20, 2022

Devoe-97 commented Jul 20, 2022

about test results #230

about test results #230

Comments

LittlePotatoChip commented Dec 9, 2021

tianweiy commented Dec 9, 2021

LittlePotatoChip commented Dec 9, 2021

tianweiy commented Dec 9, 2021 • edited Loading

LittlePotatoChip commented Dec 9, 2021

tianweiy commented Dec 9, 2021 • edited Loading

LittlePotatoChip commented Dec 9, 2021

tianweiy commented Dec 9, 2021

LittlePotatoChip commented Dec 10, 2021

tianweiy commented Dec 17, 2021 • edited Loading

tianweiy commented Dec 19, 2021

tianweiy commented Dec 19, 2021

tianweiy commented Dec 19, 2021

LittlePotatoChip commented Dec 20, 2021

tianweiy commented Dec 22, 2021

tianweiy commented Jan 24, 2022

Devoe-97 commented Jul 20, 2022

Devoe-97 commented Jul 20, 2022

tianweiy commented Jul 20, 2022

Devoe-97 commented Jul 20, 2022

tianweiy commented Dec 9, 2021 •

edited

Loading

tianweiy commented Dec 9, 2021 •

edited

Loading

tianweiy commented Dec 17, 2021 •

edited

Loading