Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output gets corrupted when a quantized finetuned model is used with CUDA #2046

Open
PedroVNasc opened this issue Apr 12, 2024 · 1 comment
Open

Comments

@PedroVNasc
Copy link

I was testing a quantized Whisper Medium model fine-tuned for Portuguese when I noticed the results were odd.

!!Estamos aqui para pedir emprestada!!
output_txt: saving output to 'medium_q8_0/common_voice_pt_19273358.wav.txt'

!! Graças a Deus você está aqui!
output_txt: saving output to 'medium_q8_0/common_voice_pt_19273359.wav.txt'

!P!recisamos nos apressar!
output_txt: saving output to 'medium_q8_0/common_voice_pt_19273360.wav.txt'

!A necessidade! é pai! na inovação!
output_txt: saving output to 'medium_q8_0/common_voice_pt_19273362.wav.txt'

!Você poderia ter mor!! depois! que a paz! fosse declarada
output_txt: saving output to 'medium_q8_0/common_voice_pt_19275111.wav.txt'

It seems that the transcription gets corrupted for some reason. I tried using the CPU and the output is normal, but when using the GPU it is corrupted. Using Q4_0 or Q5_0 results in corruption too.

I also attempted to use another model, a quantized Whisper Small, also fine-tuned for Portuguese, and the output got corrupted too.

Using the original model doesn't generate any corruption and quantized versions of the standard Whisper models also don't
generate corruption.

I quantized these models myself so I know it's up to date with the version of whisper.cpp.

In summary:

  • CPU is normal for any version of the model;
  • GPU is normal for the original models;
  • GPU is normal for the standard models, even when quantized;
  • GPU output is corrupted when using quantized fine-tuned models.

I'm using a RTX 3060 Mobile 6GB VRAM with CUDA 11.5 on a Ubuntu 22.04.4.

@pauljouet
Copy link

I got the same kind of issues with finetuned French models, but that did also occur with the non-quantized models, and with both GPU / CPU inference. With long audios, it works correctly for the first chunks (between 3 and 5 minutes), but at some point, the outputs become English (the transcription is somehow still correct but not the right language) and sometimes it becomes nonsense, repeating special tokens etc. It may also produce a single French chunk before creating garbage again.

I observed this with all finetuned models that I converted (3 of them).

I haven't yet found the reason but it must come (at least in my case) from the convert-h5-to-ggml.py script, which I have not yet looked into.

When I tried using a pre-converted finetuned model, it worked without any issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants