Datasets and Adaptation

Why adaptation?

The existing general acoustic and language model doesn't perform well and training another general Greek model is difficult, since very small Greek speech datasets are available. So, the whole project's success is based on the personalization of the dictation system. The acoustic model will be adapted in user's dictations and the language model will be enhanced by taking advantage of the send emails of the user. In order to verify that adaptation increases the accuracy of an asr system, tests were done in available datasets.

Structure

All datasets have been uploaded in Dropbox. Each one follows the above structure, based on Sphinx requirements.:

train: Contains the ids, the recordings and the corresponding transcriptions of the train set (usually 70% of the dataset).
test: Contains the ids, the recordings and the corresponding transcriptions of the test set (usually 30% of the dataset).
hypothesis: Contains the hypothesis for the test set of each model.
language-models: Contains all the language models that created based on the train set.
- specific: Developed using only the transcriptions of the dataset.
- merged: Developed using both the transcriptions of the dataset and the default language model.
adaptation: Contains all the files used for the acoustic model adaptation (both mllr and map method).

Language model adaptation

The adaptation of the default language model for domain-specific datasets follows the corresponding Sphinx tutorial and uses the SRILM toolkit.

For every dataset, we should create a train-text.txt file that contains only the Greek alphabetic characters of the transcriptions. So, we remove punctuation, non-alphabetic and non-Greek words. This procedure is described extensively in Email Fetching page. Then, a domain-specific language model is created using the following command:

ngram-count -kndiscount -interpolate -text train-text.txt -lm specific.lm

Although the domain-specific language model is adapted to our dataset, it will not perform well since it is based on a relatively small amount of words. So, there are a lot of out of the vocabulary words. In order to resolve this, we merge the domain-specific language model with the default one, as follows:

ngram -lm el-gr.dic -mix-lm specific.lm -lambda 0.5 -write-lm merged.lm

where lambda is the weight of each model.

Acoustic model adaptation and phonetic dictionary extension

The adaptation of the default acoustic model for domain-specific datasets follows again the corresponding Sphinx tutorial. Useful tools were developed in order to prepare speech dataset for adaptation.

Convert sound files in Sphinx format (mono .wav file of 16kHz sample rate).

Usage:

$ python converter.py -h
usage: converter.py [-h] --input INPUT [--output OUTPUT]

Tool for converting sound files in Sphinx format (mono wav files with 16kHz
sample rate)

optional arguments:
  -h, --help       show this help message and exit

required arguments:
  --input INPUT    Input directory

optional arguments:
  --output OUTPUT  Output direcory (default: Input directory)

Add out of the dictionary words

When adapting an acoustic model, some words from the transcriptions may not be included in the default phonetic dictionary. As a result, if we ignore them, adaptation will be poor and these words will be unknown for the system.

A tool was developed that first searches in the transcriptions for words that are out of the dictionary. Then, Phonetisaurus is used in order to generate phonemes for missing words and, finally, the pair (word, phoneme) is added in the default dictionary. In fact, AltFstAligner (an alternative of Phonetisaurus) was used for the training of the model, because it requires much lower memory.

Usage:

$ python findOOD.py -h
usage: findOOD.py [-h] --dict DICT --input INPUT --output OUTPUT

Tool that finds out of dictionary words from a given transcription

optional arguments:
  -h, --help       show this help message and exit

required arguments:
  --dict DICT      Path of dictionary
  --input INPUT    Path of input transcription (should be in Sphinx format)
  --output OUTPUT  File to write the missing words

$ python addOOD.py -h
usage: addOOD.py [-h] --model MODEL --input INPUT --dict DICT

Tool that generates phonemes for out of dictionary words and adds them in the
dictionary

optional arguments:
  -h, --help     show this help message and exit

required arguments:
  --model MODEL  Phonetisaurus model
  --input INPUT  Path of missing words file.
  --dict DICT    Path of the dictionary

The trained model on the default dictionary can be found here. An example of the usage of the scripts follow:

If we have the above transcription file:

καλησπέρα με λένε γιώργο μπαλαμώτη (test)

The word μπαλαμώτη is not included in the default dictionary, since it is a username. Let's generate the phoneme of this out of the dictionary word:

$ python findOOD.py --dict ../../cmusphinx-el-gr-5.2/el-gr.dic --input transcription --output missing --print True
Searching for transcription: (test)
μπαλαμώτη

$ python addOOD.py --model ../../cmusphinx-el-gr-5.2/phonetisaurus/el-gr.o8.fst --input missing --dict ../../cmusphinx-el-gr-5.2/el-gr.dic
Generating phonemes...
Copy generated phonemes to given dictionary...
OK

$ tail -n 1 ../../cmusphinx-el-gr-5.2/el-gr.dic
μπαλαμώτη	b a0 l a0 m o1 t i0

Evaluation

After decoding sound files to text using pocketsphinx_batch tool from pocketsphinx, we evaluate a model using word_align.pl script that compares the transcriptions of the test file with the hypothesis that the model gave, as follows:

word_align.pl test.transcription test.hyp

Note: Two methods of acoustic model adaptation were tested. Map adaptation updates each parameter in the model, while mllr adaptation creates a generic transform of the parameters.

Radio

Multiple Greek speakers of the Department of Journalism tell the news lasting 1 hour (medium size, different speakers).

Link: https://www.dropbox.com/sh/a8dkcgchb3cxgnc/AAA-7uxX8embvJWPOW-yQFTGa?dl=0

Language model	Acoustic model	Accuracy
default	default	53.28%
specific	default	53.92%
merged	default	66.03%
merged	adapted (mllr)	67.91%
merged	adapted (map)	50.03%

Paramythi_horis_onoma

Greek woman speaker reads a fairytale lasting 4 hours (large size, one speaker).

Link: https://www.dropbox.com/sh/87e87d78ykw96zi/AABoh1oHDjJrhv4BoNiEPs8qa?dl=0

Language model	Acoustic model	Accuracy
default	default	59.55%
specific	default	51.99%
merged	default	65.04%
merged	adapted (mllr)	66.53%
merged	adapted (map)	71.68%

Pda

Recordings of Greek people asking questions about the weather, nearest hospitals, and pharmacies. It was created for the purposes of this diploma thesis (medium size, different speakers, very specific domain).

Link: https://www.dropbox.com/sh/t7uwom0hxp7cehb/AAC5EEB18DSm8qGLXFfobquWa?dl=0

Language model	Acoustic model	Accuracy
default	default	73.11%
specific	default	80.06%
merged	default	83.08%
merged	adapted (mllr)	84.59%
merged	adapted (map)	90.63%

Personal emails

Recordings of my voice dictating 15 emails (small size). This dataset is representative of the data that our system will have to adapt, but it should be extended because the test set contains only 4 sentences.

Link: https://www.dropbox.com/sh/oguos83j7938q39/AABEd0I9CkXKfV91NsxZuSTZa?dl=0

Language model	Acoustic model	Accuracy
default	default	75.71%
specific	default	25.71%
merged	default	77.14%
merged	adapted (mllr)	77.14%
merged	adapted (map)	50.00%

Conclusion

Merged language model improves accuracy for all types of datasets, since it adapts to the domain-specific dataset and, at the same time, it contains a large number of words.
Mllr adaptation performs better when limited data are available or when speakers are different (personal emails and radio). On the other hand, map adaptation can increase accuracy a lot (pda, paramythi_horis_onoma) in case we have more dictations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly