ETOS (Ejaan oTOmatiS) is a dataset for automatic spelling correction for formal Indonesian text. It consists of 200 sentences with each sentence contains at least one typo. It has 4,323 tokens with 288 of them are non-word errors.
Since this dataset is very small, we do not define any split.
- 2022-12-01 v1.0
- Initial dataset
- ETOS v1.0 was built by M. Nirwan Samsuri for his master thesis at Faculty of Computer Science, Universitas Indonesia in 2022.
Please cite the following paper if you use this dataset for your project/publication:
- Mukhlizar Nirwan Samsuri, Arlisa Yuliawati, and Ika Alfina. A Comparison of Distributed, PAM, and Trie Data Structure Dictionaries in Automatic Spelling Correction for Indonesian Formal Text. In the Proceeding of 2022 5th International Seminar on Research of Information Technology and Intelligent Systems (ISRITI) (Accepted).
You can use this dataset for free. You don't need our permission to use it. Please cite our paper if your work uses our data in your publication. Please note that you are not allowed to create a copy of this dataset and share it publicly in your own repository without our permission.
ika.alfina [at] cs.ui.ac.id