When I use SqueezeLLM to quantize the LLaMA2-13B model and test it, the speed is extremely slow. #71

zhangfzR · 2024-07-03T08:44:06Z

I used the following command to run the LLaMA2-13B model:
CUDA_VISIBLE_DEVICES=0 python llama.py /mnt/llama2-13b wikitext2 --wbits 4 --load sq-llama-13b-w4-s0.45.pt --include_sparse --eval
The --load option loads the model after packing. I followed the README instructions step by step, but when I tested this code, the speed was extremely slow. I thought the model might have had an error during quantization, so I downloaded the model from Hugging Face as shown in the README(sq-llama-13b-w4-s45). However, it still remains slow.

I don’t know why this issue persists. What can I do to resolve it?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When I use SqueezeLLM to quantize the LLaMA2-13B model and test it, the speed is extremely slow. #71

When I use SqueezeLLM to quantize the LLaMA2-13B model and test it, the speed is extremely slow. #71

zhangfzR commented Jul 3, 2024 •

edited

Loading

When I use SqueezeLLM to quantize the LLaMA2-13B model and test it, the speed is extremely slow. #71

When I use SqueezeLLM to quantize the LLaMA2-13B model and test it, the speed is extremely slow. #71

Comments

zhangfzR commented Jul 3, 2024 • edited Loading

zhangfzR commented Jul 3, 2024 •

edited

Loading