New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output gets corrupted when a quantized finetuned model is used with CUDA #2046
Comments
I got the same kind of issues with finetuned French models, but that did also occur with the non-quantized models, and with both GPU / CPU inference. With long audios, it works correctly for the first chunks (between 3 and 5 minutes), but at some point, the outputs become English (the transcription is somehow still correct but not the right language) and sometimes it becomes nonsense, repeating special tokens etc. It may also produce a single French chunk before creating garbage again. I observed this with all finetuned models that I converted (3 of them). I haven't yet found the reason but it must come (at least in my case) from the convert-h5-to-ggml.py script, which I have not yet looked into. When I tried using a pre-converted finetuned model, it worked without any issue. |
I was testing a quantized Whisper Medium model fine-tuned for Portuguese when I noticed the results were odd.
It seems that the transcription gets corrupted for some reason. I tried using the CPU and the output is normal, but when using the GPU it is corrupted. Using Q4_0 or Q5_0 results in corruption too.
I also attempted to use another model, a quantized Whisper Small, also fine-tuned for Portuguese, and the output got corrupted too.
Using the original model doesn't generate any corruption and quantized versions of the standard Whisper models also don't
generate corruption.
I quantized these models myself so I know it's up to date with the version of whisper.cpp.
In summary:
I'm using a RTX 3060 Mobile 6GB VRAM with CUDA 11.5 on a Ubuntu 22.04.4.
The text was updated successfully, but these errors were encountered: