Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Following the quant_with_alpaca.py example but keep getting "You shouldn't move a model that is dispatched using accelerate hooks." and the model is never saved. #670

Open
murtaza-nasir opened this issue May 13, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@murtaza-nasir
Copy link

murtaza-nasir commented May 13, 2024

Describe the bug
I am using the quant_with_alpaca.py script to quantize MaziyarPanahi/Llama-3-70B-Instruct-32k-v0.1. I am using the following command:

python quant_with_alpaca.py \
--pretrained_model_dir "/home/murtaza/work/ml/text-generation-webui/models/MaziyarPanahi_Llama-3-70B-Instruct-32k-v0.1" \
--quantized_model_dir "/home/murtaza/work/ml/text-generation-webui/models/MurtazaNasir_Llama-3-70B-Instruct-32k-v0.1-GPTQ" \
--per_gpu_max_memory 6 \
--cpu_max_memory 200 \
--quant_batch_size 16 \
--bits 4 --use_triton --save_and_reload

I have tried running the above without --save_and_reload and the script quantizes the model and then runs inference which seems fine. But the model never gets saved anywhere. With the --save_and_reload switch, I get the this output:

INFO - Model packed.
2024-05-13 03:45:45 INFO [auto_gptq.modeling._utils] Model packed.
WARNING - using autotune_warmup will move model to GPU, make sure you have enough VRAM to load the whole model.
2024-05-13 03:45:45 WARNING [auto_gptq.modeling._utils] using autotune_warmup will move model to GPU, make sure you have enough VRAM to load the whole model.
2024-05-13 03:45:45 WARNING [accelerate.big_modeling] You shouldn't move a model that is dispatched using accelerate hooks.

After this it crashes because of a CUDA OOM error.

When run without the --save_and_reload switch, the script tests the quant with 4 instructions and then exits without any error (although the inference speed was painfully slow).

Hardware details
I have an EPYC 7532 processor with 256GB ram and 4x 3090s.

Software version
Ubuntu 22.04.4 LTS (6.5.0-28-generic #29~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Apr 4 14:39:20 UTC 2 x86_64 x86_64 x86_64 GNU/Linux)
Python 3.10.14
auto_gptq Version: 0.8.0.dev0+cu121
Torch 2.3.0+cu121
Transformers 4.40.2
Accelerate 0.30.1

To Reproduce

  1. Clone repository.
  2. Build.
  3. Go to quantization directory.
  4. Run above command.

I made one change to one of the files to add the damp 0.1 argument for quantization.

Expected behavior
I was hoping to get a GPTQ quant of the above model.

@murtaza-nasir murtaza-nasir added the bug Something isn't working label May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant