Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Minillm] Using the qwen2-72b model as the teacher model for minillm training results in out of memory #281

Open
shhn1 opened this issue Nov 12, 2024 · 1 comment

Comments

@shhn1
Copy link

shhn1 commented Nov 12, 2024

I use qwen2-72b as the teacher model and qwen2.5 32b model as the student model for training. 8*80g A100 are used for training.

When I load the qwen2 72b model, I find that the teacher model is not split on each gpu, and the complete qwen2 72b model is loaded on each gpu, resulting in oom.

When I test the model loading alone, the qwen2 72b can be split and loaded on multiple gpus. I don't understand why this happens now.

Have you tried two larger models for minillm experiments? I see that the largest teacher model in the paper is only 13B.

@t1101675
Copy link
Contributor

You can try model (tensor) parallel for large teacher and student model.
First, you need to change the model parallel size as decribed here.
Then, you can follow this script to run qwen models

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants