-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Early Stopping of Token Generation in Streaming Model Training #1717
Comments
@yaozengwei |
Could you show the tensorboard logs? |
could you tell us what is the scale of the final loss, e.g., 0.5 or 0.05? Also, have you tried to decode some of the training data? |
Thanks for your reply! I decoded the entire training data over the weekend. Simply put, the behavior is the same—only a few tokens are generated in the output. Here're some details:
Note: for those are considered correct, because the original speech is also very short with only one or two tokens. |
For the loss, are you referring the Here's the training log:
If this is not what you're asking for, please let me know where I can find the specific parameter for you. Thank you so much! |
what is the final pruned loss? Could you also upload the text log file? |
Thanks for your quick reply! Do you mean the following log:
I can also upload the entire log if it helps. Thank you! |
The pruned loss is a bit high.
Have you tried other combinations instead of |
Thanks for these great points! Let me check those numbers and I will get back to you very soon! |
For our previous experiment with non-streaming model, the pruned loss is only 0.02325. Here is the detail:
I just tried with |
Do you only change |
Yes! I didn't change anything else other than this parameter. |
Could you use |
Sure. Here's the detailed statistics of our 1000h experiment:
|
Could you also try |
Thanks for the suggestion! ./zipformer/decode.py \
--epoch 30 \
--avg 15 \
--exp-dir ./zipformer/exp-causal \
--causal 1 \
--chunk-size 32 \
--left-context-frames 128 \
--max-duration 1600 \
--decoding-method greedy_search \
--lang data/lang_char %WER = 58.15
Errors: 563 insertions, 10658 deletions, 1668 substitutions, over 22164 reference words (9838 correct)
0-0: ref=['こ', 'れ', 'ま', 'た', 'ジ', 'ミ', 'ー', 'さ', 'ん']
0-0: hyp=['こ', 'れ', 'ま', 'で', 'ジ', 'ミ', 'さ', 'ん']
1-1: ref=['今', 'も', '相', '手', 'に', 'ロ', 'ン', 'バ', 'ル', 'ド', 'の', 'ほ', 'う', 'に', '肩', '口', 'で', '握', 'ら', 'れ', 'て', 'も', 'す', 'ぐ', 'さ', 'ま', '流', 'れ', 'を', '切', 'る', '引', 'き', '込', 'み', '返', 'し', 'に', '変', 'え', 'た', 'と']
1-1: hyp=['今', 'も', '相', '手', 'に', 'ロ', 'ン', 'バ', 'ル', 'ト', 'の', 'ほ', 'う', 'に', '貴', '子', 'ら', 'れ', 'て', 'も', 'す', 'ぐ', 'す', 'ぐ', 'さ', 'ま', '流', 'れ', 'を', '切', 'る', '返', 'し', 'に', '切', 'り', '替', 'え', 'た', 'と']
10-10: ref=['予', '定', 'を', '大', '幅', 'に', '狂', 'わ', 'せ', 'る', '交', '通', '機', '関', 'の', '乱', 'れ']
10-10: hyp=['こ']
100-100: ref=['矢', '部', 'さ', 'ん', 'で', 'プ', 'ラ', 'ス', '二', '千', '六', '百', '円', 'で', 'す']
100-100: hyp=['そ']
101-101: ref=['現', '場', 'に', 'お', '任', 'せ', '頂', 'け', 'る', 'と', 'い', 'う', '約', '束', 'で', 'す']
101-101: hyp=['現', '場', 'に', 'お', '任', 'せ', 'い', 'た', 'だ', 'け', 'る', 'と', 'い', 'う', '約', '束', 'で', 'す'] It does generate more tokens this time! According to the documentation, this |
By the way, we happened to decode a longer audio (60 secs) with the aforementioned streaming model. Curiously, it worked this time! ./python-api-examples/online-decode-files.py \
--tokens=./pretrained-models/k2-streaming/1000h/tokens.txt \
--encoder=./pretrained-models/k2-streaming/1000h/encoder-epoch-99-avg-1-chunk-16-left-128.onnx \
--decoder=./pretrained-models/k2-streaming/1000h/decoder-epoch-99-avg-1-chunk-16-left-128.onnx \
--joiner=./pretrained-models/k2-streaming/1000h/joiner-epoch-99-avg-1-chunk-16-left-128.onnx \
/Users/qi_chen/Documents/work/asr/validation/tmp/Akazukinchan-60s.wav
Started!
Done!
/Users/qi_chen/Documents/work/asr/validation/tmp/Akazukinchan-60s.wav
それはだれだってちょいとみたがでもだれよりもカレよりもこの子のおばあさんほどこの子をかわいがっているものはなくこの子を見ると何もかもやりたくてやりたくて一体何をやっているのかわからなくなるくらいでしたえてありましたさあちょいといらっしゃい赤ずきんここにお菓子が一つが一人ありますがこれをあげるときっと元気だ
----------
num_threads: 1
decoding_method: greedy_search
Wave duration: 60.000 s
Elapsed time: 4.128 s
Real time factor (RTF): 4.128/60.000 = 0.069 However, this model still doesn't work with Hope this helps! |
Is it possible to share those |
Could you try different
(You can search for |
Sure, and really appreciate your help! Please find the model and its variations in: https://huggingface.co/reazon-research/k2-streaming/tree/main/1000h Please also let me know when you finish downloading the model, so I can change it back to private mode. Thank you! |
Yes, we tried with blank-penalty=10 1000-0: (ライブ映像です菅総理のコメントがこれから発表されます->そしているのは〈〈〈〈続)
1001-1: (日経平均株価の午前の終値二万八千八十一円五十五銭と七十四円六十六銭->日本でも例えばあったらいただ日本)
1002-2: (来年の大統領選挙を控える中で四件目の起訴を受けたわけですが今回も相変わらず選挙妨害だなどと無実を主張しています->ラ)
1003-3: (膿の除去や歯周病の原因となる歯石の除去などのケアを続けたのです->こ)
1004-4: (まずは東京都心のお天気の変化から見てみましょう->またまたまたまたまた)
1005-5: (ご準備お願いいたします->でもう一度もあっでもしれちゃいま)
1006-6: (だって上いったら筋見えるよ->だったということですねだっててで)
1007-7: (ロシアの潜水艦が日本海でミサイル発射の演習を行いました->ここからここはこここから)
1008-8: (まあまあまあでもさこれもほらあのトカゲが急に敵におそわれたときしっぽちょん切ってにげるみてえな感じだから->まあまりますねもうま>いまあまあま) Generally speaking, it does generate more tokens; however, most of them are nonsense and not even close to the actual speech... |
Thanks! I have downloaded them. |
Please try a smaller --blank-penalty, e.g., 0.5. You can try several of them, e.g., 0.7, 1.0, 0.1, etc. |
Thanks for the suggestions! Yes, we tried with smaller values but it didn't work...it didn't generate extra tokens or only one or two more tokens. |
I find that vocab_size of the model trained with the 1000h of data is 3878. However, the non-streaming reazonspeech model's vocab size is 5224. Do you prepare the reazonspeech dataset differently for the non-streaming zipformer and streaming zipformer? |
By the way, #1724 (comment) |
Hi Next-gen Kaldi team,
Thank you once again for your continuous support and patience with our Japanese ASR recipe and model developments.
We're currently training the streaming model based on our existing recipe,
ReazonSpeech
. Despite experimenting with both the regularzipformer
andzipformer-L
across different datasets (100h, 1000h, and 5000h), we've encountered a consistent issue where the output tends to generate only the first few tokens.Current environment:
Our commands and results:
Training command (regular zipformer):
Decoding command:
Some results from
errs-test-greedy_search-epoch-30-avg-15-chunk-32-left-context-128-use-averaged-model.txt
:We also exported this model and tested with sherpa-onnx.
exporting command:
Decoding with Python API examples:
The outputs we're seeing from both the
streaming_decode.py
and the sherpa-onnx deployed models are truncated early in the speech, leading to significantly shortened or incomplete transcriptions.We would greatly appreciate any insights or suggestions on how to address these early stopping issues in token generation. We will also open-source this streaming model as soon as we resolve these challenges.
Thank you!
The text was updated successfully, but these errors were encountered: