Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad inference result on sample after overfitting on same sample #48

Open
jonasdieker opened this issue Jun 23, 2023 · 15 comments
Open

Bad inference result on sample after overfitting on same sample #48

jonasdieker opened this issue Jun 23, 2023 · 15 comments

Comments

@jonasdieker
Copy link

jonasdieker commented Jun 23, 2023

Hi @zhulf0804 ,

I wanted to ensure the model can memorise a single training example. To do this I set the __len__() method in the Dataset to return 1. When training I printed the data_dict to ensure that the same sample was used for each iteration. Since the dataset length was set to 1, each epoch consisted of a single training step.

I visualised the train curves in tensorboard and as expected, all three losses eventually decreased to 0. Then I wanted to visualise the prediction of the model. For this I used the test.py script. However, when running on the same sample from training (000000.bin) the model produces zero predictions.

If I set the score_thr in pointpillar.py to 0, then I get a lot of predictions but they are obviously all very low confidence.

Any idea where I am going wrong?

@zhulf0804
Copy link
Owner

Hi @jonasdieker, It's strange. Could you post the visualized predictions when setting score_thr to 0 ?
By the way, did you load the pretrained weight successfully?

@jonasdieker
Copy link
Author

jonasdieker commented Jun 23, 2023

Hi thank you for your very fast reply!

Sorry maybe I should have made it clear that I wanted to train from scratch on a single kitti sample to see if I can get decent predictions overfitting. Therefore, no pretrained weights were loaded, instead I loaded the model weights which I saved from my overfit-training run, produced as described above.

The reason: I tried to do the same for NuScenes to test if the model can memorise the new data when overfitting. In this case the model also predicts nothing, however I am not able to get a zero loss after playing with the parameters. So there is likely more parameter adjusting I need to do still ...

Here is the visualisation you asked for. (Note: I am using a different visualisation function because your one did not work for me over ssh)

White is pedestrian, green is cyclist and blue is car.

image

Here are the confidences:

[0.0112691  0.01061759 0.01054672 0.01012148 0.01011159 0.00997026
 0.00983873 0.00945836 0.00936741 0.00894571 0.00888245 0.00886574
 0.00883586 0.00870235 0.00864896 0.00861476 0.00859446 0.00854981
 0.00853697 0.00851393 0.00847296 0.00834575 0.00832187 0.00829636
 0.00829282 0.00826259 0.00825665 0.00825058 0.00824824 0.00824112
 0.00823086 0.00821262 0.00817523 0.00817244 0.00815322 0.00815221
 0.00809674 0.00809228 0.00809175 0.00807787 0.00805884 0.00801394
 0.00799607 0.00798928 0.00394109 0.00385207 0.00380854 0.00376242
 0.00368402 0.00364244]

And the class counts:

[44, 4, 2]

Hope this is somewhat helpful for you!

@jonasdieker
Copy link
Author

jonasdieker commented Jun 23, 2023

One more comment worth making: In the kitti dataloader I actually commented out the data_augment function.

I did this in order to consistently get the same data for overfitting. I only use point_range_filter even for split="train".

@zhulf0804
Copy link
Owner

Hello @jonasdieker, did you also visualize the G.T. result and the predicted result used the weights provided by this repo on 000000.bin. Are they reasonable ?

@jonasdieker
Copy link
Author

jonasdieker commented Jun 23, 2023

Yes, I did and they were fine. That is why I am confused by my experiments outcome!

Edit: I will send a visualisation of that when I have access to the machine again!

@zhulf0804
Copy link
Owner

Ok. One more thing, could you help to verify the single training example is 000000.bin again ?

@jonasdieker
Copy link
Author

jonasdieker commented Jun 26, 2023

So I tried it again and verified I was overfitting on the same sample as I was testing on. I tried it with 000000.bin and then also 000001.bin individually, and both times the loss was practically zero but returned no bounding boxes at all with the test.py script and the default setting defined here:

# val and test
self.nms_pre = 100
self.nms_thr = 0.01
self.score_thr = 0.1
self.max_num = 50

Could you try to repeat this experiment? It should only take a few minutes.

Edit:

When setting the train_dataloader to split="val" and still with the training set length set to 1, I can perform training and validation on the same 000001.bin sample only. The weird thing is that if I look at tensorboard I get the following plots:

image

So now I am even more confused but it confirms that val/test performs really badly in this specific scenario. Especially the class loss actually diverges, which again makes sense why the confidence is so low and all boxes are filtered out by the get_predicted_bboxes_single method with the default params linked above.

@jonasdieker
Copy link
Author

@zhulf0804 Ok, I think this is kind of interesting:

The only difference between train and val in train.py is the fact model.eval() is called (which of course you should be calling). But if I comment out that line I get the following plots:

image

Doing the same in test.py I get:

image

which is perfect! So, overfitting works exactly as expected with this change. However, I do not understand how this impacts the performance, as changing from train mode to eval mode does the following:

image

I think I need to give this some more thought. Let me know if you have an explanation!

@zhulf0804
Copy link
Owner

Hello, @jonasdieker.
Both validation cls loss and visualized prediction (using test.py) become well by just removing model.eval(), like the following line ?

pointpillars.eval()

@jonasdieker
Copy link
Author

Hello @zhulf0804, yes that is exactly right!

@zhulf0804
Copy link
Owner

Ok, and I'm also confused about the result. I'll test it when I have access to the machine.
Besides, looking forward to your explanation to this question.
Best.

@mdane-x
Copy link

mdane-x commented Oct 24, 2023

Do you have any updates on this? @jonasdieker did you find out the issue? I am getting the same problem, overfitting on one (or few samples) loss goes to 0, but then 0 predictions using test.py. And even worse, when I run test.py multiple times with NO changes, i get different results (sometimes few bboxes, most of the time zero - [] [] [])

@jonasdieker
Copy link
Author

Hi @mdane-x, as far as I remember overfitting on one (or a few) sample(s) didn't work. I ended up commenting out model.eval(). I believe the issue was due to the normalisation. If you have a good explanation of what is going on, please add it here!

@mdane-x
Copy link

mdane-x commented Oct 31, 2023

Hi @jonasdieker, thanks for the answer. I haven't managed to make it work, even after removing the eval() line. I am getting empty predictions with any trained model (on few samples)

@jonasdieker
Copy link
Author

@mdane-x, hmmm that is very strange. I am not sure how to help you. In my experience it helps to visualise as much as you can. What does your validation loss look like? Is it also going to zero?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants