How to use quantized model on inference #4

yachty66 · 2023-08-22T16:42:47Z

I have successfully quantized the facebook/opt-125m model using the opt.py script with the following command:

CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m c4 --wbits 4 --quant ldlq --incoh_processing --save quantized_model

This command generates a quantized model named quantized_model. My question is, should I replace the original weights from https://huggingface.co/facebook/opt-125m/tree/main with the weights from quantized_model to run the 2-bit model on inference?

The text was updated successfully, but these errors were encountered:

jerry-chee · 2023-09-04T19:55:04Z

Can you share an example of how you plan to run the model on inference? If you're using the scripts in this repo to evaluation perplexity / zeroshot accuracy, then you just need to provide the saved file with the --load argument.

If you're using the huggingface from_pretrained() function, then what I've done is to put the saved model as well as the config into the same folder, and reference that in from_pretrained(). You can copy the config from huggingface, example https://huggingface.co/facebook/opt-125m/tree/main. You'll also need to rename the saved model to one of the names that from_pretrained() is looking for, like pytorch_model.bin.

Sorry this is a bit hacky, we're working on releasing model checkpoints and a better guide.

yachty66 changed the title ~~How to use quantized model for inference~~ How to use quantized model on inference Aug 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use quantized model on inference #4

How to use quantized model on inference #4

yachty66 commented Aug 22, 2023 •

edited

Loading

jerry-chee commented Sep 4, 2023

How to use quantized model on inference #4

How to use quantized model on inference #4

Comments

yachty66 commented Aug 22, 2023 • edited Loading

jerry-chee commented Sep 4, 2023

yachty66 commented Aug 22, 2023 •

edited

Loading