Add an segmenter option whether we use dictionary (support u-dx) #4808

makotokato · 2024-04-15T04:30:57Z

This is a low priority issue from https://bugzilla.mozilla.org/show_bug.cgi?id=1871754. Before using ICU4X, Gecko's word segmenter for Chinese and Japanese is that segment is whether character class is same or not.

Actually, word segmenter for Chinese and Japanese are based on dictionary. Since new words are always incremented, dictionary implementation may not be enough for quality without updating it.

Although we are considering to use other ways for it such as Machine Leaning in the future, it may be better that we have a segmenter's options not to use dictionary for some languages only (If Japanese, we don't use dictionary, but other can use it).

CC: @aethanyc

makotokato added the C-segmentation Component: Segmentation label Apr 15, 2024

makotokato added this to the Priority Backlog ⟨P3⟩ milestone Apr 15, 2024

sffc changed the title ~~Add an segmenter option whether we use dictionary~~ Add an segmenter option whether we use dictionary (support u-dx) Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an segmenter option whether we use dictionary (support u-dx) #4808

Add an segmenter option whether we use dictionary (support u-dx) #4808

makotokato commented Apr 15, 2024

Add an segmenter option whether we use dictionary (support u-dx) #4808

Add an segmenter option whether we use dictionary (support u-dx) #4808

Comments

makotokato commented Apr 15, 2024