Spam Attack #2105

DariusAlexander · 2024-04-29T14:49:58Z

Noticed there's prediction outputs that include spam:

start,end,text
0,8640," 6 greens of fresh snow peas, 5 thick slabs of blue cheese and maybe a snack for her brothered"
8640,9000," Bob."
9000,16000," For more information visit www.beadaholique.com to purchase beading supplies and to get design ideas!"
16000,23000," www.beadaholique.com to purchase beading supplies and to get design ideas!"
23000,30000," www.beadaholique.com to purchase beading supplies and to get design ideas!"

Source audio file is 30s long, zero-padded on the end with about 20s of (absolute) silence.

I followed the Quick Start guide:
git clone https://github.com/ggerganov/whisper.cpp.git
bash ./models/download-ggml-model.sh base.en
make
./main -ocsv -f myfile.wav

I've just started looking at this project, so I don't know the problem deeply, but seems the model downloaded by /models/download-ggml-model.sh (https://huggingface.co/ggerganov/whisper.cpp) might be the issue.

The text was updated successfully, but these errors were encountered:

bobqianic · 2024-04-29T22:31:55Z

See this openai/whisper#1783

bobqianic · 2024-04-29T22:47:28Z

I have been contemplating for the past two months on how to utilize limited resources: 4 * V100, 4 * P100, 2 * 2080ti, and 200 A100 card hours gifted to me by someone else, to partially solve these issues. Whisper sometimes experiences severe hallucinations; you can check this paper https://arxiv.org/pdf/2402.08021. The reason for these severe hallucinations is that Whisper itself is trained on a weakly labeled dataset with considerable noise, making it prone to learning irrelevant information. My current idea is to distill Whisper Large v2, use it to label datasets, then clean those datasets using LLM and other neural networks. Finally train a new Whisper based on the Mixture of Experts (MoE) architecture. However, I'm not entirely sure if this approach will be successful.

The current vocabulary of Whisper is still too small, now only 60K, which will affect the performance of the model. Also, the context is too small, currently only 448 tokens, which needs to be expanded.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spam Attack #2105

Spam Attack #2105

DariusAlexander commented Apr 29, 2024

bobqianic commented Apr 29, 2024

bobqianic commented Apr 29, 2024 •

edited

Spam Attack #2105

Spam Attack #2105

Comments

DariusAlexander commented Apr 29, 2024

bobqianic commented Apr 29, 2024

bobqianic commented Apr 29, 2024 • edited

bobqianic commented Apr 29, 2024 •

edited