Skip to content

Speed Comparison

Koichi Akabe edited this page Jun 9, 2022 · 5 revisions

This wiki shows the analysis speed of Vaporetto and other tokenizers and morphological analyzers.

Experimental Setup

We compared the following softwares:

For Vaporetto and KyTea, we used the compact SVM model based on BCCWJ and UniDic downloaded from KyTea Models page. For MeCab, we used IPADic and UniDic. For Lindera, we used UniDic. For sudachi.rs, we used sudachi-dictionary-20210802-core based on UniDic.

We tokenized I Am a Cat (by Soseki Natsume), which is available at Aozora Bunko, and measured elapsed time 100 times for each software.

The following is the specification of the used machine:

  • CPU: Intel(R) Core(TM) i7-8086K CPU @ 4.00GHz
  • Memory: 64GiB
  • OS: CentOS Linux release 7.5.1804 (Core)
  • Compilers:
    • Rust: 1.60.0
    • GCC: 11.2.0

Results

Tool Name Elapsed Time [ms] STD
KyTea 219.6 2.9
Vaporetto 29.0 0.6
Vaporetto (charwise) 25.3 0.4
rust-tinysegmenter 272.7 6.1
MeCab (IPADic) 102.9 1.8
MeCab (UniDic) 255.1 3.1
Lindera 397.1 7.1
sudachi.rs 286.2 4.7
Clone this wiki locally