[Need help] How to realize Syllable-level Voice Recognition with sherpa-onnx Open Vocabulary Keyword Spotting #920

diyism · 2024-05-25T20:21:02Z

I've been always trying to use Sherpa to implement syllable-level speech recognition
(1.use a few pinyins to detect hotword directly;
2.or send a long sequence of pinyins to a LLM(gpt or claude) to convert it into the most appropriate Chinese sentence)
(k2-fsa/sherpa-ncnn#177)

I found that you released the sherpa-onnx Open Vocabulary Keyword Spotting at 2024-02(https://k2-fsa.github.io/sherpa/onnx/kws/pretrained_models/index.html)

So I imagine I can utilize it to realize syllable-level speech recognition:
I've modified the sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01/keywords.txt into:

j iǎng @jiang3
y ǒu @you3
b ó @bo2

h uí @hui2
d á  @da2
q ǐng @qing3
g ài @gai4
g ē  @ge1

Tested it with the AHPUymhd's code(#760),
specify the sound_files = ["./sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01/test_wavs/4.wav"],
(jiang3 you3 bo2 bei4 pai1 dao4 ...)
the output:
jiang3/bo2/bo2

I understand that the "bo2 bo2"(伯伯 uncle) is a more frequently used word than "you3 bo2"(but the Syllable-level Voice Recognition needs to be future-proof and can recognize any new word in the future into pinyins).
So I very carefully split the ./sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01/test_wavs/4.wav to ensure each WAV file contains only one syllable:

$ sox 4.wav jiang3.wav trim 0.4 0.33
$ sox 4.wav you3.wav trim 0.77 0.2
$ sox 4.wav bo2.wav trim 1.05 0.25

Now, if I run the python code with "sound_files = ["./sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01/test_wavs/jiang3.wav"]", it can correctly output "jiang3",
and if I run it with "you3.wav", it can output "you3",
if I run it with "bo2.wav", it can say "bo2",
It's perfect, even if there're other interfering pinyins(h uí, d á, ...) in keywords.txt file(I am dreaming of adding all 1300 pinyins into it).

So, I guess the sherpa-onnx Open Vocabulary Keyword Spotting is fully capable of perfectly recognizing Chinese mono-syllables,
but a method is needed to segment each syllable.
Maybe something like silero-vad can do it.

Any idea ?

@danpovey @csukuangfj @pkufool @marcoyang1998

The text was updated successfully, but these errors were encountered:

pkufool · 2024-05-27T03:00:23Z

~~What system do you want, can you clarify it in details.~~ @diyism

After reading the issue you posted before, I know what you want. Actually, I already have a model modeling with pinyin in my machine, will push to huggingface.

diyism · 2024-05-27T07:21:30Z

Thank you, I am eager to test it.

diyism · 2024-06-17T18:32:55Z

What a pity, I can't find a realtime VAD for mandarin Syllables.
https://github.com/linto-ai/whisper-timestamped
https://github.com/readbeyond/aeneas (needs wav + text)
https://modelscope.cn/models/iic/speech_timestamp_prediction-v1-16k-offline/summary (needs wav + text)
https://github.com/snakers4/silero-vad

If there's a VAD that can detect the beginning of every mandarin syllables(https://courses.washington.edu/chin342/ipa/syllables.html),
I can truncate every beginning 0.2 seconds of each syllable as a wav file and send it to Sherpa-ONNX Open Vocabulary Keyword Spotting for processing, as Sherpa-ONNX kws is already able to accurately recognize individual syllable wav files.

pkufool · 2024-06-21T10:56:53Z

@diyism Please follow the progress of this PR https://github.com/k2-fs/icefall/pull/1662 . It adds models trained with pinyin.

The recognized results look like this, I think they are what you want.

DEV_T0000000000_S00000: ref=['duì', 'wǒ', 'zuò', 'le', 'jiè', 'shào', 'a', 'nà', 'me', 'wǒ', 'xiǎng', 'shuō', 'de', 'shì', 'ne', 'dà', 'jiā', 'rú', 'guǒ', 'duì', 'wǒ', 'de', 'yán', 'jiū', 'gǎn', 'xìng', 'qù', 'ne', 'ń']
DEV_T0000000000_S00000: hyp=['duì', 'wǒ', 'zuò', 'le', 'jiè', 'shào', 'nà', 'me', 'wǒ', 'xiǎng', 'shuō', 'de', 'shì', 'dà', 'jiā', 'rú', 'guǒ', 'duì', 'wǒ', 'de', 'yán', 'jiū', 'gǎn', 'xìng', 'qù']
DEV_T0000000001_S00000: ref=['zhòng', 'diǎn', 'ne', 'xiǎng', 'kàn', 'sān', 'gè', 'wèn', 'tí', 'shǒu', 'xiān', 'ne', 'jiù', 'shì', 'zhè', 'yī', 'lún', 'quán', 'qiú', 'jīn', 'róng', 'dòng', 'dàng', 'de', 'biǎo', 'xiàn']
DEV_T0000000001_S00000: hyp=['zhòng', 'diǎn', 'xiǎng', 'tán', 'sān', 'gè', 'wèn', 'tí', 'shǒu', 'xiān', 'jiù', 'shì', 'zhè', 'yī', 'lún', 'quán', 'qiú', 'jīn', 'róng', 'dòng', 'dàng', 'de', 'biǎo', 'xiàn']
DEV_T0000000002_S00000: ref=['shēn', 'rù', 'dì', 'fēn', 'xī', 'zhè', 'yī', 'cì', 'quán', 'qiú', 'jīn', 'róng', 'dòng', 'dàng', 'bèi', 'hòu', 'de', 'gēn', 'yuán']
DEV_T0000000002_S00000: hyp=['shēn', 'rù', 'dì', 'fēn', 'xī', 'zhè', 'yī', 'cì', 'quán', 'qiú', 'jīn', 'róng', 'dòng', 'dàng', 'bèi', 'hòu', 'de', 'gēn', 'yuán']
DEV_T0000000003_S00000: ref=['a', 'jiǎng', 'dì', 'yí', 'gè', 'wèn', 'tí', 'hā', 'zěn', 'me', 'lái', 'kàn', 'dài']
DEV_T0000000003_S00000: hyp=['jiǎng', 'dì', 'yí', 'gè', 'wèn', 'tí', 'zěn', 'me', 'lái', 'kàn', 'dài']

pkufool · 2024-06-21T11:06:52Z

What a pity, I can't find a realtime VAD for mandarin Syllables. https://github.com/linto-ai/whisper-timestamped https://github.com/readbeyond/aeneas (needs wav + text) https://modelscope.cn/models/iic/speech_timestamp_prediction-v1-16k-offline/summary (needs wav + text) https://github.com/snakers4/silero-vad

If there's a VAD that can detect the beginning of every mandarin syllables(https://courses.washington.edu/chin342/ipa/syllables.html), I can truncate every beginning 0.2 seconds of each syllable as a wav file and send it to Sherpa-ONNX Open Vocabulary Keyword Spotting for processing, as Sherpa-ONNX kws is already able to accurately recognize individual syllable wav files.

Using vad + kws is not a normal way to implement this feature. I am very surprised you have not implemented this feature in a year, did you try following our suggestions to train a model based on pinyin? Did you have some troubles training the model? Anyway, I am adding the recipe now, see the comments above.

diyism · 2024-06-25T18:32:21Z

I guess "VAD splitting" can prevent interference between syllables, such as "jiang3 you3 bo2" being recognized as "jiang3 bo2 bo2" (蒋伯伯), or "gai4 ge2" being recognized as "gai4 kuo4" (概括).

Additionally, VAD can avoid issues caused by continuously sliding time windows to extract segments for recognition, like in whisper-timestamped realtime recognition.

It can also prevent the problem of missing syllables: ['zhòng', 'diǎn', 'ne', 'xiǎng', 'kàn', 'sān', 'gè', 'wèn', 'tí'] missing the middle 'ne'. Human ears and VAD can clearly know that there is a syllable 'ne' in the middle.

It seems that whisperX has the best VAD, but it's not for mandarin(https://github.com/m-bain/whisperX):

Anyway, I'm looking forward to the progress of decode_pinyin.py: k2-fsa/icefall#1662

diyism · 2024-06-26T06:32:27Z

whisperX's VAD is also not a syllable-level VAD, the "That's"(in the whisperX README) have 2 syllables, and whisperX is
also not a real-time VAD, it's similar to the aeneas(https://github.com/readbeyond/aeneas), force alignment of pre-prepared "wav + text".

Even if whisperX supports mandarin, the "./sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01/test_wavs/4.wav" won't be separated syllable by syllable:

The first 3 syllables of "jiang3 you3 bo2" in the 4.wav will be seperated as "jiang3 you3" and "bo2".
So the whisperX's VAD can't prepare syllable-level wav files for sherpa-onnx-kws recognition just like my manually "sox trim" 4.wav file.

diyism · 2024-06-27T16:18:35Z

I'm wrong, the whisperX can seperate the first 2 syllables of "./sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01/test_wavs/4.wav":

$ sox ../sherpa-onnx/sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01/test_wavs/4.wav a.wav trim 0 0.97
$ whisperx --compute_type int8 --language zh a.wav
$ cat a.json
{"segments": [{"start": 0.537, "end": 0.822, "text": "讲有", "words": [{"word": "讲", "start": 0.537, "end": 0.801, "score": 0.923}, {"word": "有", "start": 0.801, "end": 0.822, "score": 0.591}]}], "word_segments": [{"word": "讲", "start": 0.537, "end": 0.801, "score": 0.923}, {"word": "有", "start": 0.801, "end": 0.822, "score": 0.591}], "language": "zh"}

Maybe I can use whisperX as a "realtime syllable-level speech recognizer", but it is not enough fast:

$ time whisperx --compute_type int8 --language zh a.wav
real	0m16.209s
user	0m21.419s
sys	0m5.020s

While sherpa-onnx-kws.py is very fast, but it can't recognize the second syllable:

$ sox ./sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01/test_wavs/4.wav a.wav trim 0 1.0
$ aplay a.wav
$ python sherpa-onnx-kws.py --sound_files ./a.wav
Started!
jiang3 is detected.
Done!
./a.wav
jiang3/
----------
num_threads: 1
Wave duration: 1.000 s
Elapsed time: 0.053 s
Real time factor (RTF): 0.053/1.000 = 0.053

And the whisperx also has the interference problem between syllables:

$ sox ../sherpa-onnx/sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01/test_wavs/4.wav a.wav trim 0 1.2
$ aplay a.wav
$ whisperx --compute_type int8 --language zh a.wav
$ cat a.json
{"segments": [{"start": 0.536, "end": 1.062, "text": "江有博", "words": [{"word": "江", "start": 0.536, "end": 0.799, "score": 0.923}, {"word": "有", "start": 0.799, "end": 1.041, "score": 0.947}, {"word": "博", "start": 1.041, "end": 1.062, "score": 0.257}]}], "word_segments": [{"word": "江", "start": 0.536, "end": 0.799, "score": 0.923}, {"word": "有", "start": 0.799, "end": 1.041, "score": 0.947}, {"word": "博", "start": 1.041, "end": 1.062, "score": 0.257}], "language": "zh"}

The first syllable's tone is wrong.

sherpa-onnx-kws with 3 wav files of single syllable:

$ sox 4.wav jiang3.wav trim 0.4 0.33
$ sox 4.wav you3.wav trim 0.77 0.2
$ sox 4.wav bo2.wav trim 1.05 0.25
$ sox 4.wav jiang3you3bo2.wav trim 0 1.3

$ python sherpa-onnx-kws.py --sound_files ./jiang3.wav
Started!
jiang3 is detected.
$ python sherpa-onnx-kws.py --sound_files ./you3.wav
Started!
you3 is detected.
$ python sherpa-onnx-kws.py --sound_files ./bo2.wav
Started!
bo2 is detected.
$ python sherpa-onnx-kws.py --sound_files ./jiang3you3bo2.wav
Started!
jiang3 is detected.
bo2 is detected.
bo2 is detected.

Currently for first 3 syllables of ./sherpa-onnx/sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01/test_wavs/4.wav:

whisperx with 1 wav file of 3 syllables:               jiang1 you3 bo2    (wrong)
sherpa-onnx-kws with 1 wav file of 3 syllables:        jiang3 bo2  bo2    (wrong)
sherpa-onnx-kws with 3 wav files of single syllable:   jiang3 you3 bo2    (correct)

It's perfect for sherpa-onnx-kws with 3 wav files of single syllable,
but I haven't found a syllable segmentation tool.

diyism · 2024-08-09T09:00:27Z

There was a project about "Singing voice phoneme segmentation" (syllable segmentation) 6 years ago:
https://github.com/ronggong/interspeech2018_submission01

diyism · 2024-08-21T02:53:39Z

There's a simple project that can segment syllables 5 years ago(https://github.com/diyism/thetaOscillator-syllable-segmentation),
but what a pity, it also can't split the first 2 syllables("jiang3 you3") of ./sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01/test_wavs/4.wav :

diyism · 2024-08-23T02:12:28Z

I found a simple command in ~/miniconda3/bin/sherpa-onnx-keyword-spotter which was installed by building sherpa-onnx, and I've test it with:

$ cd sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01
$ sherpa-onnx-keyword-spotter --tokens=tokens.txt     --encoder=encoder-epoch-12-avg-2-chunk-16-left-64.onnx     --decoder=decoder-epoch-12-avg-2-chunk-16-left-64.onnx     --joiner=joiner-epoch-12-avg-2-chunk-16-left-64.onnx     --provider=cpu     --num-threads=8     --keywords-file=../keywords.txt --keywords-threshold=0.2 --modeling-unit=cjkchar test_wavs/4.wav

the output:

{"start_time":0.00, "keyword": "jiang3", "timestamps": [0.64, 0.68], "tokens":["j", "iǎng"]}

test_wavs/4.wav
{"start_time":0.00, "keyword": "bo2", "timestamps": [1.12, 1.16], "tokens":["b", "ó"]}

test_wavs/4.wav
{"start_time":0.00, "keyword": "bo2", "timestamps": [1.28, 1.32], "tokens":["b", "ó"]}

It seems the timestamps are not correct, according to the previous sox commands that splited 3 wav files perfectly recognized by sherpa-onnx-keyword-spotter:

$ sox 4.wav jiang3.wav trim 0.4 0.33
$ sox 4.wav you3.wav trim 0.77 0.2
$ sox 4.wav bo2.wav trim 1.05 0.25

The correct timestamps should be:

[0.40, 0.73]
[0.77, 0.97]
[1.05, 1.30]

It seems the allosaurus project (https://github.com/xinjli/allosaurus) can produce correct timestamps but it missed a syllable of "bei"(in 5 syllables of "jiang3 you3 bo2 bei4 pai1"):

$ python -m allosaurus.run -e 1.2 --lang=cmn -i 4_8000hz.wav --timestamp=True
0.510 0.045 ɕ
0.540 0.045 i
0.600 0.045 a
0.630 0.045 ŋ

0.750 0.045 j
0.780 0.045 i
0.810 0.045 o
0.870 0.045 ə

1.020 0.045 p
1.080 0.045 o

1.590 0.045 p
1.650 0.045 a
1.710 0.045 i

Is there any parameter that can improve the timestamps results?
I've already tried "--modeling-unit=cjkchar", "--wenet-ctc-chunk-size=8" and "--wenet-ctc-num-left-chunks=2",
but they don't seem to have much effect.

diyism · 2024-08-23T02:36:14Z

If I truncate the first syllables (jiang3 and you3), and lower the "--keywords-threshold" into 0.08, it can recognize the second syllable:

$ sox test_wavs/4.wav jiang3you3.wav trim 0.0 0.97
$ sherpa-onnx-keyword-spotter --tokens=tokens.txt     --encoder=encoder-epoch-12-avg-2-chunk-16-left-64.onnx     --decoder=decoder-epoch-12-avg-2-chunk-16-left-64.onnx     --joiner=joiner-epoch-12-avg-2-chunk-16-left-64.onnx     --provider=cpu     --num-threads=8     --keywords-file=../keywords.txt --keywords-threshold=0.08 jiang3you3.wav
jiang3you3.wav
{"start_time":0.00, "keyword": "jiang3", "timestamps": [0.64, 0.68], "tokens":["j", "iǎng"]}

jiang3you3.wav
{"start_time":0.00, "keyword": "you3", "timestamps": [0.84, 0.96], "tokens":["y", "ǒu"]}

But the timestamp results are still not correct.

And use "--keywords-threshold=0.08" to parse the whole test_wavs/4.wav file, the result is still the wrong "蒋伯伯":

test_wavs/4.wav
{"start_time":0.00, "keyword": "jiang3", "timestamps": [0.64, 0.68], "tokens":["j", "iǎng"]}

test_wavs/4.wav
{"start_time":0.00, "keyword": "bo2", "timestamps": [1.12, 1.16], "tokens":["b", "ó"]}

test_wavs/4.wav
{"start_time":0.00, "keyword": "bo2", "timestamps": [1.28, 1.32], "tokens":["b", "ó"]}

diyism · 2024-08-24T13:08:27Z

And I also tested the "sherpa-onnx-keyword-spotter-microphone",
it's amazing, when I say "jiang3 you3 bo2 bei4" to my microphone,
it can syllable-level real-time output "jiang3", "you3", "bei4", but missed a syllable of "bo2":

diyism · 2024-09-13T15:40:10Z

It seems the allosaurus project (https://github.com/xinjli/allosaurus) can produce IPAs and correct timestamps but it missed a syllable of "bei" (in 5 syllables of "jiang3 you3 bo2 bei4 pai1"):

$ python -m allosaurus.run -e 1.2 --lang=cmn --timestamp=True -i ../sherpa-onnx/sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01/test_wavs/4_8000hz.wav
0.510 0.045 ɕ
0.540 0.045 i
0.600 0.045 a
0.630 0.045 ŋ

0.750 0.045 j
0.780 0.045 i
0.810 0.045 o
0.870 0.045 ə

1.020 0.045 p
1.080 0.045 o

1.590 0.045 p
1.650 0.045 a
1.710 0.045 i

diyism mentioned this issue Sep 13, 2024

Add pyannote vad (segmentation) model #1197

Closed

diyism mentioned this issue Sep 23, 2024

How to train or optimize the sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01 model for my own voice? #1371

Open

diyism mentioned this issue Nov 3, 2024

[Need Help] segment syllables (mandarin pinyin) for syllable-level voice recognition or syllable-level VAD pengzhendong/pyannote-onnx#9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Need help] How to realize Syllable-level Voice Recognition with sherpa-onnx Open Vocabulary Keyword Spotting #920

[Need help] How to realize Syllable-level Voice Recognition with sherpa-onnx Open Vocabulary Keyword Spotting #920

diyism commented May 25, 2024 •

edited

Loading

pkufool commented May 27, 2024 •

edited

Loading

diyism commented May 27, 2024

diyism commented Jun 17, 2024

pkufool commented Jun 21, 2024 •

edited

Loading

pkufool commented Jun 21, 2024 •

edited

Loading

diyism commented Jun 25, 2024 •

edited

Loading

diyism commented Jun 26, 2024 •

edited

Loading

diyism commented Jun 27, 2024 •

edited

Loading

diyism commented Aug 9, 2024 •

edited

Loading

diyism commented Aug 21, 2024 •

edited

Loading

diyism commented Aug 23, 2024 •

edited

Loading

diyism commented Aug 23, 2024 •

edited

Loading

diyism commented Aug 24, 2024 •

edited

Loading

diyism commented Sep 13, 2024

[Need help] How to realize Syllable-level Voice Recognition with sherpa-onnx Open Vocabulary Keyword Spotting #920

[Need help] How to realize Syllable-level Voice Recognition with sherpa-onnx Open Vocabulary Keyword Spotting #920

Comments

diyism commented May 25, 2024 • edited Loading

pkufool commented May 27, 2024 • edited Loading

diyism commented May 27, 2024

diyism commented Jun 17, 2024

pkufool commented Jun 21, 2024 • edited Loading

pkufool commented Jun 21, 2024 • edited Loading

diyism commented Jun 25, 2024 • edited Loading

diyism commented Jun 26, 2024 • edited Loading

diyism commented Jun 27, 2024 • edited Loading

diyism commented Aug 9, 2024 • edited Loading

diyism commented Aug 21, 2024 • edited Loading

diyism commented Aug 23, 2024 • edited Loading

diyism commented Aug 23, 2024 • edited Loading

diyism commented Aug 24, 2024 • edited Loading

diyism commented Sep 13, 2024

diyism commented May 25, 2024 •

edited

Loading

pkufool commented May 27, 2024 •

edited

Loading

pkufool commented Jun 21, 2024 •

edited

Loading

pkufool commented Jun 21, 2024 •

edited

Loading

diyism commented Jun 25, 2024 •

edited

Loading

diyism commented Jun 26, 2024 •

edited

Loading

diyism commented Jun 27, 2024 •

edited

Loading

diyism commented Aug 9, 2024 •

edited

Loading

diyism commented Aug 21, 2024 •

edited

Loading

diyism commented Aug 23, 2024 •

edited

Loading

diyism commented Aug 23, 2024 •

edited

Loading

diyism commented Aug 24, 2024 •

edited

Loading