You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I created this to write down some of my observations for improving output quality of whisper - especially regarding hallucinations and repetitions.
This is from an attempt of translating an entire foreign language movie into English.
I hope others will comment and add suggestions before I add any PR.
Several people complain that whisper will easily get into an endless repetition loop, and hallucinations especially after silence or noise.
And by simply restarting the transcription from a different starting time you get a better result.
Thus, it seems that main is doing something not optimal when processing audio patches.
In general without modifying anything you can improve things by doing:
Limit context with --context 64 or slightly higher value. You can even try 0 to make sure the prompt doesn't interfere.
For multi-lingual / translation use the "medium" whisper model.
Increase entropy threshold a bit with --entropy-thold 2.6 to avoid repetition.
Experiment with using --flash-attn for improved quality in general.
Here are my observation from changing whisper main:
Do all of the above
Use higher --logprob-thold -1.25 threshold.
Adjust the seek starting value for the next segment a little bit.
The theory is that generated timestamps are slightly off a word boundary.
Simply adjusting 20ms back in time will actually improve output.
Improve the entropy algorithm
It is currently being calculated over timing tokens too, and should use log2.
Add a "system prompt" that is kept for all generations
This seems to encourage the model to output "(music)" and "(inaudible)"
non-speech segments. We can use these to adjust our other parameters at
runtime.
Use --no-fallback or better, add a new --temperature-max argument to limit
how much fantasy the model can use to produce output.
See discussion later.
Add several new samplers in addition to the existing avg_logprobs:
min-p sampler to discard any initial bad choice.
Fails the decoder early at the first non-timing token and low p value.
This was the most effective to avoid hallucinations all together.
min-plog-sum to discard any initial bad choice.
This is experimental and seems to deal with some remaining hallucinations.
min-plog-distance to discard any initial bad choice.
This is experimental. If the distance between the top logprobs score is
too big, then fail the decoder.
min-score to disregard an output segment whisper rated low.
If we know it's a bad output, why are we using it?
Suppress non-speech
There's already an option for that, but it is too aggressive and will
interfere with ability to output (music) and (inaudible) segments, which
I want to keep.
Suppress the history prompt occasionally
It turns out a quick fix for repetitions and hallucinations are sometimes just
to allow the decoder to work without the previous prompt history.
You can do --context 0 but the prompt history does improve output in normal
conditions.
Suppress repeated segments
While controversial since repeated sentences naturally occur in transcriptions
("Hello. Hello. Are you okay? Yes. Nice Weather? Yes."), I found
hallucinations are often repeated segments and causes the model to think it's
doing the right thing by translating everything into Yes.
Unfortunately the model can output hallucinations with high p values and I have
not yet found a reliable way to distinguish them. Once a single one of these
gets into your prompt history, it tends to affect your next output. So ban repeats.
Reinvent the current main "fallback" strategy
Current strategy is to increase temperature from 0.0 to 1.0 in increments until
it produces any output - relying solely on the avg_logprobs sampler and
eventually just taking the worst temperature 1.0 as output!
This is great if you want to demonstrate that the model can output something,
but less so if you want to keep quality up.
Instead a better fallback strategy is to nudge the decoder in small time
increments to process the next audio patch at increasing intervals if our samplers
keep rejecting low quality output.
As mentioned I already tried all of the above, and the output seems significantly
better and hardly any hallucinations.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I created this to write down some of my observations for improving output quality of whisper - especially regarding hallucinations and repetitions.
This is from an attempt of translating an entire foreign language movie into English.
I hope others will comment and add suggestions before I add any PR.
Several people complain that whisper will easily get into an endless repetition loop, and hallucinations especially after silence or noise.
And by simply restarting the transcription from a different starting time you get a better result.
Thus, it seems that
main
is doing something not optimal when processing audio patches.In general without modifying anything you can improve things by doing:
--flash-attn
for improved quality in general.Here are my observation from changing whisper
main
:Do all of the above
Use higher
--logprob-thold
-1.25 threshold.Adjust the seek starting value for the next segment a little bit.
The theory is that generated timestamps are slightly off a word boundary.
Simply adjusting 20ms back in time will actually improve output.
Improve the entropy algorithm
It is currently being calculated over timing tokens too, and should use log2.
Add a "system prompt" that is kept for all generations
This seems to encourage the model to output "(music)" and "(inaudible)"
non-speech segments. We can use these to adjust our other parameters at
runtime.
Use
--no-fallback
or better, add a new--temperature-max
argument to limithow much fantasy the model can use to produce output.
See discussion later.
Add several new samplers in addition to the existing avg_logprobs:
Fails the decoder early at the first non-timing token and low p value.
This was the most effective to avoid hallucinations all together.
This is experimental and seems to deal with some remaining hallucinations.
This is experimental. If the distance between the top logprobs score is
too big, then fail the decoder.
If we know it's a bad output, why are we using it?
Suppress non-speech
There's already an option for that, but it is too aggressive and will
interfere with ability to output (music) and (inaudible) segments, which
I want to keep.
Suppress the history prompt occasionally
It turns out a quick fix for repetitions and hallucinations are sometimes just
to allow the decoder to work without the previous prompt history.
You can do --context 0 but the prompt history does improve output in normal
conditions.
Suppress repeated segments
While controversial since repeated sentences naturally occur in transcriptions
("Hello. Hello. Are you okay? Yes. Nice Weather? Yes."), I found
hallucinations are often repeated segments and causes the model to think it's
doing the right thing by translating everything into Yes.
Unfortunately the model can output hallucinations with high p values and I have
not yet found a reliable way to distinguish them. Once a single one of these
gets into your prompt history, it tends to affect your next output. So ban repeats.
Reinvent the current main "fallback" strategy
Current strategy is to increase temperature from 0.0 to 1.0 in increments until
it produces any output - relying solely on the
avg_logprobs
sampler andeventually just taking the worst temperature 1.0 as output!
This is great if you want to demonstrate that the model can output something,
but less so if you want to keep quality up.
Instead a better fallback strategy is to nudge the decoder in small time
increments to process the next audio patch at increasing intervals if our samplers
keep rejecting low quality output.
As mentioned I already tried all of the above, and the output seems significantly
better and hardly any hallucinations.
Beta Was this translation helpful? Give feedback.
All reactions