Can transcript_words.txt file contain only unique and sorted lines ? #1615
Replies: 7 comments 5 replies
-
hi,
may i ask which recipe are you using?
best
jin
… On May 4, 2024, at 08:33, ChrystianKacki ***@***.***> wrote:
In one of the final stages of scripts egs/.../ASR/prepare.sh, there is creating of transcript_words.txt file.
After opening this transcript_words.txt file, I see that it contains duplicated lines.
Can transcript_words.txt contain only unique lines ?
Can lines in transcript_words.txt be sorted ?
—
Reply to this email directly, view it on GitHub <#1615>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOON42AMJ2NPZXBWBCGL7STZAQUHLAVCNFSM6AAAAABHGK5GQCVHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZWGYYDGMZRHA>.
You are receiving this because you are subscribed to this thread.
|
Beta Was this translation helpful? Give feedback.
-
i see, are you using the ``jq`` command to grep text lines from the jsonl.gz format cutset?
best
jin
… On May 4, 2024, at 13:04, ChrystianKacki ***@***.***> wrote:
Hi,
I am using my own recipe based on Common Voice recipe, with combined MLS and VoxPopuli.
I used egs/commonvoice/ASR/prepare.sh to create my own version.
—
Reply to this email directly, view it on GitHub <#1615 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOON42AO2F6HQ75I2DBLKALZART4DAVCNFSM6AAAAABHGK5GQCVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TGMJRGM3TO>.
You are receiving this because you commented.
|
Beta Was this translation helpful? Give feedback.
-
i see, the `transcript_words.txt` is only used for BPE and ngram LM training, so I dont think several duplicated sentences or sorting it would cause a huge problem.
I’m a little bit concerned about the repeating sentences of different speakers, would you do me a favor and check if those sentences really exist in the training set, and there are corresponding audio files matching with those text?
best
jin
… On May 4, 2024, at 13:56, ChrystianKacki ***@***.***> wrote:
Also in prepare.sh I set use_validated=true. I checked transcripts file validated.tsv of newest CV 17.0 dataset for Polish language and there are repeating sentences for different speakers.
—
Reply to this email directly, view it on GitHub <#1615 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOON42DYY2L7ZHAW2ZWFT2LZARZ7LAVCNFSM6AAAAABHGK5GQCVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TGMJRGYZDG>.
You are receiving this because you commented.
|
Beta Was this translation helpful? Give feedback.
-
also are there a lot of duplications presented in the `validated.tsv` file?
best
jin
… On May 4, 2024, at 13:56, ChrystianKacki ***@***.***> wrote:
Also in prepare.sh I set use_validated=true. I checked transcripts file validated.tsv of newest CV 17.0 dataset for Polish language and there are repeating sentences for different speakers.
—
Reply to this email directly, view it on GitHub <#1615 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOON42DYY2L7ZHAW2ZWFT2LZARZ7LAVCNFSM6AAAAABHGK5GQCVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TGMJRGYZDG>.
You are receiving this because you commented.
|
Beta Was this translation helpful? Give feedback.
-
I prepared a file with repeating sentences count in Polish CV validated training set. |
Beta Was this translation helpful? Give feedback.
-
i see, there really is a lot of duplication in the transcript.
i suppose you can try manually removing all duplicated sentences, this would at least benefit the performance of your ngram model.
best
jin
… On May 4, 2024, at 16:09, ChrystianKacki ***@***.***> wrote:
I prepared a file with repeating sentences count in Polish CV validated training set.
I checked it and all the sentences have corresponding and unique audio files.
Please see it: common_voice-pl_cuts_validated_raw-line_count.txt <https://github.com/k2-fsa/icefall/files/15208450/common_voice-pl_cuts_validated_raw-line_count.txt>
—
Reply to this email directly, view it on GitHub <#1615 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOON42BMXQJ2HSYXIFPDB23ZASJUZAVCNFSM6AAAAABHGK5GQCVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TGMJSGI3DK>.
You are receiving this because you commented.
|
Beta Was this translation helpful? Give feedback.
-
yes, sorting doesn’t interfere with BPE and ngram LM training
best
jin
… On May 4, 2024, at 16:20, ChrystianKacki ***@***.***> wrote:
Great. Can those unique sentences be sorted ?
I'm asking because in Bash it's easier to sort a file before removing duplicates.
Best, Chrystian
|
Beta Was this translation helpful? Give feedback.
-
In one of the final stages of scripts
egs/.../ASR/prepare.sh
, there is creating oftranscript_words.txt
file.After opening this
transcript_words.txt
file, I see that it contains duplicated lines.Can
transcript_words.txt
contain only unique lines ?Can lines in
transcript_words.txt
be sorted ?Beta Was this translation helpful? Give feedback.
All reactions