You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This command generates a quantized model named quantized_model. My question is, should I replace the original weights from https://huggingface.co/facebook/opt-125m/tree/main with the weights from quantized_model to run the 2-bit model on inference?
The text was updated successfully, but these errors were encountered:
yachty66
changed the title
How to use quantized model for inference
How to use quantized model on inference
Aug 22, 2023
Can you share an example of how you plan to run the model on inference? If you're using the scripts in this repo to evaluation perplexity / zeroshot accuracy, then you just need to provide the saved file with the --load argument.
If you're using the huggingface from_pretrained() function, then what I've done is to put the saved model as well as the config into the same folder, and reference that in from_pretrained(). You can copy the config from huggingface, example https://huggingface.co/facebook/opt-125m/tree/main. You'll also need to rename the saved model to one of the names that from_pretrained() is looking for, like pytorch_model.bin.
Sorry this is a bit hacky, we're working on releasing model checkpoints and a better guide.
I have successfully quantized the facebook/opt-125m model using the opt.py script with the following command:
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m c4 --wbits 4 --quant ldlq --incoh_processing --save quantized_model
This command generates a quantized model named quantized_model. My question is, should I replace the original weights from https://huggingface.co/facebook/opt-125m/tree/main with the weights from quantized_model to run the 2-bit model on inference?
The text was updated successfully, but these errors were encountered: