This project is an attempt to create a pronunciation lexicon covering both English and Chinese words in a unified phoneset for ASR applications.
P.S. "CiDian" means "lexicon" in Chinese.
typical use cases in Chinese ASR applications:
你手机上都装了什么 APP ?
APPLE 的新 MACBOOK PRO 真漂亮
上个月 PRADA 出了款新包包
手机开了 GPRS 导航
世界杯 H 组小组赛
The unified phoneset should be a simple and precise phoneset that covers both languages. Note that the mapping listed below are heavily based on IPA.
English entries are derived from CMUDict 0.7b, hence we need a mapping from ARPA phoneset to target phoneset.
ARPA | IPA | CMUDict example entries |
---|---|---|
AA0 | a | icon:AY1 K AA0 N |
AA1 | a | heart: HH AA1 R T |
AA2 | a | kmart: K EY1 M AA2 R T |
AE0 | æ | romance: R OW1 M AE0 N S |
AE1 | æ | lambda: L AE1 M D AH0 |
AE2 | æ | setback: S EH1 T B AE2 K |
AH0 | ə | station: S T EY1 SH AH0 N |
AH1 | ʌ | bug: B AH1 G |
AH2 | ʌ | haircut: HH EH1 R K AH2 T |
AO0 | ɔ | hongkong: HH AO1 NG K AO0 NG |
AO1 | ɔ | law: L AO1 |
AO2 | ɔ | layoff: L EY1 AO2 F |
AW0 | au | foundation: F AW0 N D EY1 SH AH0 N |
AW1 | au | founder: F AW1 N D ER0 |
AW2 | au | hometown: HH OW1 M T AW2 N |
AY0 | ai | hypothese: HH AY0 P AA1 TH AH0 S IY2 Z |
AY1 | ai | ice: AY1 S |
AY2 | ai | iceland: AY1 S L AH0 N D |
B | b | bike: B AY1 K |
CH | ch | chase: CH EY1 S |
D | d | desk: D EH1 S K |
DH | ð | those: DH OW1 Z |
EH0 | e | princess: P R IH1 N S EH0 S |
EH1 | e | professor: P R AH0 F EH1 S ER0 |
EH2 | e | progress: P R AA1 G R EH2 S |
ER0 | ə r | programmer: P R OW1 G R AE2 M ER0 |
ER1 | ə r | purge: P ER1 JH |
ER2 | ə r | showgirl: SH OW1 G ER2 L |
EY0 | ei | eighteen: EY0 T IY1 N |
EY1 | ei | email: IY0 M EY1 L |
EY2 | ei | thursday: TH ER1 Z D EY2 |
F | f | face: F EY1 S |
G | g | give: G IH1 V |
HH | h | hey: HH EY1 |
IH0 | i | facing: F EY1 S IH0 NG |
IH1 | i | fear: F IH1 R |
IH2 | i | fellowship: F EH1 L OW0 SH IH2 P |
IY0 | ii | email: IY0 M EY1 L |
IY1 | ii | prefix: P R IY1 F IH0 K S |
IY2 | ii | increase: IH1 N K R IY2 S |
JH | zh | gesture: JH EH1 S CH ER0 |
K | k | cat: K AE1 T |
L | l | lack: L AE1 K |
M | m | may: M EY1 |
N | n | no: N OW1 |
NG | ŋ | thing: TH IH1 NG |
OW0 | əu | crypto: K R IH1 P T OW0 |
OW1 | əu | token: T OW1 K AH0 N |
OW2 | əu | earphone: IH1 R F OW2 N |
OY0 | ɔi | invoice: IH1 N V OY0 S |
OY1 | ɔi | floyd: F L OY1 D |
OY2 | ɔi | episode: EH1 P IH0 S OW2 D |
P | p | pat: P AE1 T |
R | r | risk: R IH1 S K |
S | s | sing: S IH1 NG |
SH | sh | shake: SH EY1 K |
T | t | test: T EH1 S T |
TH | θ | think: TH IH1 NG K |
UH0 | u | fulfill: F UH0 L F IH1 L |
UH1 | u | full: F UH1 L |
UH2 | u | goodbye: G UH2 D B AY1 |
UW0 | uu | rescue: R EH1 S K Y UW0 |
UW1 | uu | fool: F UW1 L |
UW2 | uu | restroom: R EH1 S T R UW2 M |
V | v | very: V EH1 R IY0 |
W | w | west: W EH1 S T |
Y | y | yes: Y EH1 S |
Z | z | zero: Z IY1 R OW0 |
ZH | ʒ | illusion: IH2 L UW1 ZH AH0 N |
notes: If you find anything that doesn't make sense in the mapping table, please let me know, thanks
Chinese entries are extracted from DaCiDian project
Here is a PinYin to IPA mapping from educational prospective: https://resources.allsetlearning.com/chinese/pronunciation/Pinyin_chart
With a few mapping modifications and symbolic adaptations, here is the final PinYin to target phoneset mapping
There are normally 5 tones in Chinese PinYin system ranging from 0 ~ 4. However there is no tone definition in English. In BigCiDian, Chinese tonal information is retained and merged with untoned English, so the resulting phoneset may contain 6 tonal variation(1 from English and 5 from Chinese):
e.g. for phoneme *ai*
1. HI -> h ai
2. 哎 -> ai_0
3. 掰 -> b ai_1
4. 还 -> h ai_2
5. 凯 -> k ai_3
6. 外 -> w ai_4
The final unified bi-lingual phoneset details are listed below:
phoneme | CN example | EN example |
---|---|---|
a | 把 b a_3 | AACHEN a k ə n |
æ | CAT k æ t | |
ai | 爱 ai_4 | KITE k ai t |
an | 安 an_1 | |
aŋ | 羊 y aŋ_2 | |
au | 老 l au_3 | LOUD l au d |
b | 白 b ai_2 | BUT b ʌ t |
ch | 陈 ch ən_2 | CHEST ch e s t |
d | 大 d a_4 | DAY d ei |
ð | THIS ð i s | |
e | BED b e d | |
ei | 累 l ei_4 | LAKE l ei k |
ə | 鹅 ə_2 | COCA-COLA k əu k ə k əu l a |
ən | 陈 ch ən_2 | |
əŋ | 横 h əŋ_2 | |
ər | 二 ər_4 | |
əu | 欧 əu_1 | BOAT b əu t |
f | 房 f aŋ_2 | FACE f ei s |
g | 刚 g aŋ_1 | GIVE g i v |
h | 海 h ai_3 | HUG h ʌ g |
i | 天 t i an_1 | HIT h i t |
ie | 别 b ie_2 | |
ii | 比 b ii_3 | BEAT b ii t |
iii | 吃 ch iii_1 | |
in | 音 y in_1 | |
iŋ | 听 t iŋ_1 | |
j | 九 j i əu_3 | |
k | 看 k an_4 | CAKE k ei k |
l | 来 l ai_2 | LAKE l ei k |
m | 马 m a_3 | MAKE m ei k |
n | 那 n a_1 | NIKE n ai k ii |
ŋ | INTERESTING i n t ə r e s t i ŋ | |
ɔ | OFF ɔ f | |
ɔi | JOY zh ɔi | |
p | 胖 p aŋ_4 | PACE p ei s |
q | 钱 q i an_2 | |
r | 让 ʒ aŋ_4 | RISK r i s k |
s | 丝 s iii_1 | SING s i ŋ |
sh | 上 sh aŋ_4 | SHAKE sh ei k |
t | 团 t u an_2 | TIME t ai m |
ts | 才 ts ai_2 | |
u | BOOK b u k | |
uŋ | 从 ts uŋ_2 | |
uɔ | 桌 zh uɔ_1 | |
uu | 不 b uu_4 | TWO t uu |
v | VICTORY v i k t ə r ii | |
ʌ | CUT k ʌ t | |
w | 王 w aŋ_2 | WEST w e s t |
x | 西 x ii_1 | |
y | 言 y an_2 | YES y e s |
yu | 去 q yu_4 | |
yue | 缺 q yue_1 | |
z | 赞 z an_4 | ZOO z uu |
zh | 中 zh uŋ_1 | GESTURE zh e s ch ə r |
ʒ | 让 ʒ aŋ_4 | LEISURE l e ʒ ə r |
θ | THINK θ i ŋ k |
So overall there are 56 phonemes in the unified phoneset(regardless of tones).
Theoretically some phonemes can be split with smaller granularity(eg. au->a u, ɔi->ɔ i, an->a n ...), hence making the phoneset even more compact. But it is a common practice that larger acoustic modeling units are beneficial for Chinese ASR accuracy, and the existence of decision-tree based state-tying, makes base phoneset size less irrelevant to ASR problem.
I may or may not change the unified phoneset in the future, currently it seems to be sufficient for my purpose.
sh run.sh
should give you a ready-to-use bi-lingual ASR lexicon (lexicon.txt
), and a phoneset list(phones.list
) in project root directory.
To extend the final lexicon with entries of your own interest(say "IPHONE", "华为P30"), you can either:
- add those entries into the very bottom sources(CMUDict and DaCiDian)
or:
- maintain a seperate extension-lexicon, and merge it with main lexicon automatically generated above.
In AISHELL-2 Mandarin ASR task, replacing Chinese lexicon(DaCiDian) with multilingual CN-EN lexicon(BigCiDian), details are showed below:
For DaCiDian, system performance:
----- test -----:
%WER 44.39 [ 21986 / 49532, 338 ins, 2085 del, 19563 sub ] exp/mono/decode_test/cer_9_0.0
%WER 24.25 [ 12011 / 49532, 393 ins, 792 del, 10826 sub ] exp/tri1/decode_test/cer_12_0.0
%WER 22.13 [ 10963 / 49532, 396 ins, 644 del, 9923 sub ] exp/tri2/decode_test/cer_12_0.0
%WER 19.29 [ 9555 / 49532, 263 ins, 640 del, 8652 sub ] exp/tri3/decode_test/cer_13_0.5
%WER 8.33 [ 4125 / 49532, 84 ins, 192 del, 3849 sub ] exp/chain/tdnn_1a/decode_test/cer_8_0.5
For BigCiDian, system performance:
%WER 43.92 [ 21754 / 49532, 405 ins, 1574 del, 19775 sub ] exp/mono/decode_test/cer_7_0.0
%WER 22.54 [ 11163 / 49532, 406 ins, 652 del, 10105 sub ] exp/tri1/decode_test/cer_11_0.0
%WER 21.09 [ 10445 / 49532, 377 ins, 609 del, 9459 sub ] exp/tri2/decode_test/cer_12_0.0
%WER 18.47 [ 9148 / 49532, 265 ins, 621 del, 8262 sub ] exp/tri3/decode_test/cer_13_0.5
%WER 8.22 [ 4072 / 49532, 68 ins, 260 del, 3744 sub ] exp/chain/tdnn_1a/decode_test/cer_9_0.5
Conclusion
- It shows that BigCiDian only gives slightly better results than DaCiDian.
- But more importantly, BigCiDian turns a pure Chinese ASR system to multiligual system, which is pretty much the case in nowadays Chinese ASR applications.
THE END