You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I used the following command to run the LLaMA2-13B model: CUDA_VISIBLE_DEVICES=0 python llama.py /mnt/llama2-13b wikitext2 --wbits 4 --load sq-llama-13b-w4-s0.45.pt --include_sparse --eval
The --load option loads the model after packing. I followed the README instructions step by step, but when I tested this code, the speed was extremely slow. I thought the model might have had an error during quantization, so I downloaded the model from Hugging Face as shown in the README(sq-llama-13b-w4-s45). However, it still remains slow.
I don’t know why this issue persists. What can I do to resolve it?
The text was updated successfully, but these errors were encountered:
I used the following command to run the LLaMA2-13B model:
CUDA_VISIBLE_DEVICES=0 python llama.py /mnt/llama2-13b wikitext2 --wbits 4 --load sq-llama-13b-w4-s0.45.pt --include_sparse --eval
The --load option loads the model after packing. I followed the README instructions step by step, but when I tested this code, the speed was extremely slow. I thought the model might have had an error during quantization, so I downloaded the model from Hugging Face as shown in the README(sq-llama-13b-w4-s45). However, it still remains slow.
I don’t know why this issue persists. What can I do to resolve it?
The text was updated successfully, but these errors were encountered: