Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

an error with the training set direction #22

Open
i-ting4931 opened this issue Oct 9, 2024 · 5 comments
Open

an error with the training set direction #22

i-ting4931 opened this issue Oct 9, 2024 · 5 comments

Comments

@i-ting4931
Copy link

i-ting4931 commented Oct 9, 2024

Hello, I recently installed this model to train a custom dataset. The environment setup is complete, and I first attempted to use the MOT17 dataset to test whether the training process works properly. However, during the training, I encountered some abnormal data, and I was wondering if you could provide any guidance on how to resolve this issue.

Currently, I have downloaded both the Crowdhuman and MOT17 datasets. However, while training, I noticed that all the loss values are zero, which seems to suggest that the data is not being properly loaded. To check the data loading path, I added the following line of code: print(f"Frame path: {frame_path}"). The result shows that the dataset is loading Crowdhuman, but when I issued the command, I set the dataset to MOT17. I'm not entirely sure where the problem lies—could you kindly take a look? Thank you very much.

Also, just to mention, my computer has only one GPU: an NVIDIA GeForce RTX 3060. Since my GPU is limited, do I need to modify any lambda functions in the code? I appreciate your help.

If there's anything that I didn't explain clearly, please feel free to let me know, and I will provide any additional details you may need.
Thank you again.

image
image
image


Here is the related information regarding the failed training.
image
image

@HELLORPG
Copy link
Collaborator

The data loading process is not related to the GPU. Therefore, I think you should not modified any data loading function on your GPU (3060).

I have never seen this issue before. Have you run the data preparation scripts (like ./data/gen_crowdhuman_gts.py) before running the training script?

@i-ting4931
Copy link
Author

Hello, considering that there might have been some mistakes in my previous operations, I deleted the previously cloned model and cloned it again. This time, I only modified the GPU quantity, downloaded the pre-trained weights and datasets, and placed them in the designated locations. I also used the gen_mot17_gts.py and gen_crowdhuman_gts.py files to generate the necessary files.

The training process is now running, but the loss values seem abnormal (most loss values start at 0 during the initial training). I suspect that the images might not have been successfully read by the model for training. Could you kindly advise if I made any mistakes? If the training were running correctly, what would the expected behavior look like? Thank you very much.
image

@HELLORPG
Copy link
Collaborator

To be honest, this is really weird. I need to wait until I have a spare GPU server to show what the correct logging looks like.

You could check the content of the data loaded into the training loop, after these lines:

MeMOTR/train_engine.py

Lines 192 to 199 in 7de13f4

tracks = TrackInstances.init_tracks(batch=batch,
hidden_dim=get_model(model).hidden_dim,
num_classes=get_model(model).num_classes,
device=device, use_dab=use_dab)
criterion.init_a_clip(batch=batch,
hidden_dim=get_model(model).hidden_dim,
num_classes=get_model(model).num_classes,
device=device)

You can add code like these:

print(batch["infos"][0])                          # the GTs
print(batch["imgs"][0][0].shape)                  # the image's shape
# Or others

You could analyze it based on the results or upload it here.

@i-ting4931
Copy link
Author

Hello, I followed your suggestion and added the print statement. The current output is shown in the image below (Image 1). The prediction result tensor([], dtype=torch.int64) shows 0, and ids, areas, and labels all indicate that no objects were detected. I'm not quite sure why this is happening.

I noticed that in the train_mot17.yaml file, there is a setting "USE_CROWDHUMAN: True". I initially suspected that the issue might be due to training with MOT17 while having Crowdhuman included in the configuration. So, I changed "USE_CROWDHUMAN: True" to false, but this resulted in an error (Image 2).

I also tried some commands related to Submit and Evaluation, but I encountered a small issue. When using eval mode, I got the following error (Image 3), even though I don't have any related files. Could you please advise if this file is supposed to be generated automatically? If so, did I make a mistake somewhere?

Sorry for the multiple questions, and I truly appreciate your help.

Thank you very much.

(Image 1)
image

(Image 2)
image

(Image 3)
image

@HELLORPG
Copy link
Collaborator

According to (Image 2), it seems that you did not successfully load any image and annotation from MOT17. You can add some breakpoints during the data loading process to determine where the problem is.

For example, here:

MeMOTR/data/mot17.py

Lines 59 to 68 in 7de13f4

for vid in self.mot17_seq_names:
mot17_gts_dir = os.path.join(self.mot17_gts_dir, vid, "img1")
mot17_gt_paths = [os.path.join(mot17_gts_dir, filename) for filename in os.listdir(mot17_gts_dir)]
for mot17_gt_path in mot17_gt_paths:
for line in open(mot17_gt_path):
_, i, x, y, w, h, v = line.strip("\n").split(" ")
i, x, y, w, h, v = map(float, (i, x, y, w, h, v))
i, x, y, w, h = map(int, (i, x, y, w, h))
t = int(mot17_gt_path.split("/")[-1].split(".")[0])
self.mot17_gts[vid][t].append([i, x, y, w, h])
all GTs should be loaded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants