Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can I train this model? #74

Open
WateverOk opened this issue May 7, 2024 · 6 comments
Open

Can I train this model? #74

WateverOk opened this issue May 7, 2024 · 6 comments

Comments

@WateverOk
Copy link

Hello, may I ask, I have 4 3090s with 24G memory. Can I train this model?

@demoyu123
Copy link

Hello, may I ask, I have 4 3090s with 24G memory. Can I train this model?

It shouldn't work, the author is doing it on 4 A100s, and I can't drive it with 1 A6000 now.

@deyang2000
Copy link

I tried the demo on RTX 3090, and it looks trainable.

image

Another problem was that I could train on one GPU, but the validation for multi-card training would report errors.

@deyang2000
Copy link

I switched to a new 2*RTX 3090 server, which works at full power compared to the previous one. Now, I can train and verify using two GPUs.
image

@DianiSirimewan
Copy link

Can you provide the config file you used for the training? I used the demo file and replaced only the data roots. But it gives me this error.
Screenshot 2024-10-25 130547

@deyang2000
Copy link

Can you provide the config file you used for the training? I used the demo file and replaced only the data roots. But it gives me this error. Screenshot 2024-10-25 130547

The model you loaded does not seem to match the parameters in the config file. It would help if you switched to a different model. I think that you should switch to the "large" model.

@RockyChason
Copy link

Can you provide the config file you used for the training? I used the demo file and replaced only the data roots. But it gives me this error. Screenshot 2024-10-25 130547

The model you loaded does not seem to match the parameters in the config file. It would help if you switched to a different model. I think that you should switch to the "large" model.

When I use the large model, I encounter a RuntimeError: 'The size of tensor a (128) must match the size of tensor b (256) at non-singleton dimension 1.' How do you suggest I resolve this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants