You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[BUG] Following the quant_with_alpaca.py example but keep getting "You shouldn't move a model that is dispatched using accelerate hooks." and the model is never saved.
#670
Open
murtaza-nasir opened this issue
May 13, 2024
· 0 comments
I have tried running the above without --save_and_reload and the script quantizes the model and then runs inference which seems fine. But the model never gets saved anywhere. With the --save_and_reload switch, I get the this output:
INFO - Model packed.
2024-05-13 03:45:45 INFO [auto_gptq.modeling._utils] Model packed.
WARNING - using autotune_warmup will move model to GPU, make sure you have enough VRAM to load the whole model.
2024-05-13 03:45:45 WARNING [auto_gptq.modeling._utils] using autotune_warmup will move model to GPU, make sure you have enough VRAM to load the whole model.
2024-05-13 03:45:45 WARNING [accelerate.big_modeling] You shouldn't move a model that is dispatched using accelerate hooks.
After this it crashes because of a CUDA OOM error.
When run without the --save_and_reload switch, the script tests the quant with 4 instructions and then exits without any error (although the inference speed was painfully slow).
Hardware details
I have an EPYC 7532 processor with 256GB ram and 4x 3090s.
Describe the bug
I am using the
quant_with_alpaca.py
script to quantize MaziyarPanahi/Llama-3-70B-Instruct-32k-v0.1. I am using the following command:I have tried running the above without
--save_and_reload
and the script quantizes the model and then runs inference which seems fine. But the model never gets saved anywhere. With the--save_and_reload
switch, I get the this output:INFO - Model packed. 2024-05-13 03:45:45 INFO [auto_gptq.modeling._utils] Model packed. WARNING - using autotune_warmup will move model to GPU, make sure you have enough VRAM to load the whole model. 2024-05-13 03:45:45 WARNING [auto_gptq.modeling._utils] using autotune_warmup will move model to GPU, make sure you have enough VRAM to load the whole model. 2024-05-13 03:45:45 WARNING [accelerate.big_modeling] You shouldn't move a model that is dispatched using accelerate hooks.
After this it crashes because of a CUDA OOM error.
When run without the
--save_and_reload
switch, the script tests the quant with 4 instructions and then exits without any error (although the inference speed was painfully slow).Hardware details
I have an EPYC 7532 processor with 256GB ram and 4x 3090s.
Software version
Ubuntu 22.04.4 LTS (6.5.0-28-generic #29~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Apr 4 14:39:20 UTC 2 x86_64 x86_64 x86_64 GNU/Linux)
Python 3.10.14
auto_gptq Version: 0.8.0.dev0+cu121
Torch 2.3.0+cu121
Transformers 4.40.2
Accelerate 0.30.1
To Reproduce
I made one change to one of the files to add the damp 0.1 argument for quantization.
Expected behavior
I was hoping to get a GPTQ quant of the above model.
The text was updated successfully, but these errors were encountered: