Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: ColossalMoE Train: AssertionError: Parameters are expected to have the same dtype torch.bfloat16, but got torch.float32 #5664

Open
Camille7777 opened this issue Apr 26, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@Camille7777
Copy link
Contributor

馃悰 Describe the bug

At the stage of booster initialization, some params have wrong dtype of torch.float32 while the precision is set "bf16", and the optimizer initialzation in booster cannot pass the sanity check of params dtype.

Here is the detailed error info:
Screenshot 2024-04-26 at 22 04 58

The bug can be retrivaled as : self.plugin.configure -> HybridParallelZeroOptimizer -> LowLevelZeroOptimizer -> _sanity_checks

There may be some bugs in HybridParallelModule or MixtralModelPolicy.

My test shell:

NUM_GPU=2
MODEL="path to Mixtral-tiny model"
SEQ_LENGTH=2048
BATCH_SIZE=1
LR=0.00001

# hybrid
# torchrun --standalone --nproc_per_node $NUM_GPU \
colossalai run --nproc_per_node $NUM_GPU --hostfile "hostfile" \
    train.py \
    --num_epoch 1 \
    --model_name $MODEL \
    --plugin "hybrid" \
    --batch_size $BATCH_SIZE \
    --lr $LR \
    --zero_stage 1 \
    --pp_size 1 \
    --dp_size 1 \
    --ep_size 2 \
    --max_length $SEQ_LENGTH 

Environment

CUDA 12.1
torch 2.1.0
Python 3.10.14
colossalai 0.3.6 (main)
colossal-moe 1.0.0
transformers 4.36.2

@Camille7777 Camille7777 added the bug Something isn't working label Apr 26, 2024
@Edenzzzz
Copy link
Contributor

Both @ver217 and I have seen this bug, which appears when pp is off. Will dig more into it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants