You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I use qwen2-72b as the teacher model and qwen2.5 32b model as the student model for training. 8*80g A100 are used for training.
When I load the qwen2 72b model, I find that the teacher model is not split on each gpu, and the complete qwen2 72b model is loaded on each gpu, resulting in oom.
When I test the model loading alone, the qwen2 72b can be split and loaded on multiple gpus. I don't understand why this happens now.
Have you tried two larger models for minillm experiments? I see that the largest teacher model in the paper is only 13B.
The text was updated successfully, but these errors were encountered:
You can try model (tensor) parallel for large teacher and student model.
First, you need to change the model parallel size as decribed here.
Then, you can follow this script to run qwen models
I use qwen2-72b as the teacher model and qwen2.5 32b model as the student model for training. 8*80g A100 are used for training.
When I load the qwen2 72b model, I find that the teacher model is not split on each gpu, and the complete qwen2 72b model is loaded on each gpu, resulting in oom.
When I test the model loading alone, the qwen2 72b can be split and loaded on multiple gpus. I don't understand why this happens now.
Have you tried two larger models for minillm experiments? I see that the largest teacher model in the paper is only 13B.
The text was updated successfully, but these errors were encountered: