-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llama-3 8B Instruct quantized to 8 Bit spits out gibberish in transformers model.generate()
but works fine in vLLM?
#657
Comments
@davidgxue Use latest 4.40.1 or even latest release. They just fixed a llama generate issue regression that I encountered. This bug is specific to transformers and llama. |
@Qubitium I tried 4.40.1, it has the same problem. I am also already installing directly from the And to add to the above. I get gibberish if I use What is very interesting is I was working on this PR (#651) to extend AutoGPTQ support to Phi 3, and I got asked to post Perplexity results. For simplicity, I used AutoGPTQ's benchmarking script (uses But given that huggingface/transformers#30380 has already been merged into the And to add to this, when I was working on the extending support for Phi 3 PR, I had |
So I feel like may not be related to huggingface/transformers#30380? Since it's already merged? or something else got added that broke things? So far we know transformers 4.38.2, 4.40.1, and 4.41.0.dev0 (current version of dev0) are broken... Just compiling things for reference By the way, Phi-3 uses LLAMA 2 architecture sooo this maybe a llama family related problem still... |
maybe related huggingface/transformers#27179
I'll try to have a look if I get the time to. |
Yeah so I looked into Notably also, in this scenario, both dtypes seem to be broken, load in |
Problem description
Hi friends, hope someone can help out or point me in the right direction here. I feel like this maybe an integration thing with
transformers
? I can't understand why this spits out gibberish in transformers but vLLM works just fine. I thought it maybe decoding strategy/sampling related but that doesn't feel right either considering my following super odd observations:This 8 bit Llama 3 quantized model works perfectly fine with vLLM for inference. No gibberish with exact same params and prompt. But gets gibberish for both huggingface
transformers
'smodel.generate()
and the text generation pipeline.I re-quantized it again with different package versions. Same problem: gibberish if using
transformers
, but works fine with vLLM inference.I also made a 4 bit quant model using the same dataset, same environment, same script, same setup, yet that 4 bit model works fine with both vLLM and
transformers
. No gibberish when using transformers. Basically no issues at all. I have listed the 4 bit model below as well. This is the part that gets me most confused...8 bit quant config (gibberish 8 bit model)
In comparison to the 4 bit model quant config (the model that works fine without gibberish)
Quantization dataset
model.generate()
was using the exact same dataset.Inference script:
Output:
Also tried using AutoGPTQ to load model, also gibberish
Software Versions:
transformers: tried both 4.38.2 and 4.40.0dev
auto_gptq: 0.7.1
optimum: 1.19.1
The text was updated successfully, but these errors were encountered: