Idn-tagged-corpus-CSUI is a manually tagged Indonesian POS tagging corpus consists of 10000 sentences.
Each line consists of token with its respective part-of-speech tag separated by a tab character(\t). There is an empty line between sentences.
Korpus ini menggunakan format tab-separated file (.tsv). Setiap baris berisi token beserta part-of-speech tag dari token tersebut yang terpisahkan oleh satu karakter tab(\t). Antar kalimat dipisahkan oleh satu baris kosong.
- Ruli Manurung
- Arawinda Dinakaramani
- Fam Rashel
- Andry Luthfi
@inproceedings{Dinakaramani2014,
author = {Dinakaramani, Arawinda and Rashel, Fam and Luthfi, Andry and Manurung, Ruli},
booktitle = {Proceedings of the International Conference on Asian Language Processing 2014, IALP 2014},
doi = {10.1109/IALP.2014.6973519},
pages = {66--69},
title = {{Designing an Indonesian part of speech tagset and manually tagged Indonesian corpus}},
year = {2014}
}
For more details about this work, please visit http://bahasa.cs.ui.ac.id/postag/corpus
-
2022
- The dataset was moved to the IR-NLP Lab repository
- The dataset name was changed from idn-tagged-corpus to idn-tagged-corpus-CSUI
-
2014
- Initial release at Fam Rashel's repository
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/.
You can use this dataset for free. You don't need our permission to use it. Please cite our paper if your work uses our data in your publication. Please note that you are not allowed to create a copy of this dataset and share it publicly in your own repository without our permission.
arawinda [at] cs.ui.ac.id