Output gets corrupted when a quantized finetuned model is used with CUDA #2046

PedroVNasc · 2024-04-12T14:43:02Z

I was testing a quantized Whisper Medium model fine-tuned for Portuguese when I noticed the results were odd.

!!Estamos aqui para pedir emprestada!!
output_txt: saving output to 'medium_q8_0/common_voice_pt_19273358.wav.txt'

!! Graças a Deus você está aqui!
output_txt: saving output to 'medium_q8_0/common_voice_pt_19273359.wav.txt'

!P!recisamos nos apressar!
output_txt: saving output to 'medium_q8_0/common_voice_pt_19273360.wav.txt'

!A necessidade! é pai! na inovação!
output_txt: saving output to 'medium_q8_0/common_voice_pt_19273362.wav.txt'

!Você poderia ter mor!! depois! que a paz! fosse declarada
output_txt: saving output to 'medium_q8_0/common_voice_pt_19275111.wav.txt'

It seems that the transcription gets corrupted for some reason. I tried using the CPU and the output is normal, but when using the GPU it is corrupted. Using Q4_0 or Q5_0 results in corruption too.

I also attempted to use another model, a quantized Whisper Small, also fine-tuned for Portuguese, and the output got corrupted too.

Using the original model doesn't generate any corruption and quantized versions of the standard Whisper models also don't
generate corruption.

I quantized these models myself so I know it's up to date with the version of whisper.cpp.

In summary:

CPU is normal for any version of the model;
GPU is normal for the original models;
GPU is normal for the standard models, even when quantized;
GPU output is corrupted when using quantized fine-tuned models.

I'm using a RTX 3060 Mobile 6GB VRAM with CUDA 11.5 on a Ubuntu 22.04.4.

The text was updated successfully, but these errors were encountered:

pauljouet · 2024-05-15T22:46:27Z

I got the same kind of issues with finetuned French models, but that did also occur with the non-quantized models, and with both GPU / CPU inference. With long audios, it works correctly for the first chunks (between 3 and 5 minutes), but at some point, the outputs become English (the transcription is somehow still correct but not the right language) and sometimes it becomes nonsense, repeating special tokens etc. It may also produce a single French chunk before creating garbage again.

I observed this with all finetuned models that I converted (3 of them).

I haven't yet found the reason but it must come (at least in my case) from the convert-h5-to-ggml.py script, which I have not yet looked into.

When I tried using a pre-converted finetuned model, it worked without any issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output gets corrupted when a quantized finetuned model is used with CUDA #2046

Output gets corrupted when a quantized finetuned model is used with CUDA #2046

PedroVNasc commented Apr 12, 2024

pauljouet commented May 15, 2024

Output gets corrupted when a quantized finetuned model is used with CUDA #2046

Output gets corrupted when a quantized finetuned model is used with CUDA #2046

Comments

PedroVNasc commented Apr 12, 2024

pauljouet commented May 15, 2024