-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible to quantize into 4-bit and 8-bit and still use the models #24
Comments
I havn't looked into that. It would likely reduce the expressivity of the embeddings, so I would expect worse results, but it may still be good enough to make the saved compute worth it. In usual language model modelling the final output vectors are reduced to discrete tokens, so being off by e.g. 0.0001 due to precision may not change the generated token, hence performance impacts are small. |
Hi, was wondering if it's possible to do something like a GPTQ quantization into 8 or 4 bit and be able to use the embeddings from the models.
GPTQ 4-bit models perform quite well compared to fp16 & 32 in text generation. Wasn't sure if such a thing would work for embeddings.
Any suggestions?
The text was updated successfully, but these errors were encountered: