Aksara is an Indonesian morphological analyzer that conforms to the Universal Dependencies (UD) v2 annotation guidelines. Aksara can perform four tasks:
- Word segmentation (tokenization)
- Lemmatization
- POS tagging
- Morphological features analysis
The output is in the CoNLL-U format
ATTENTION: Aksara v1.1. and future versions are available at this new repository: https://github.com/ir-nlp-csui/aksara
-
Clone this repository.
git clone https://github.com/bahasa-csui/aksara
-
Install Foma. For Debian/Ubuntu packaging,
apt-get install foma-bin
. Make sure you have the privilege to install package or usesudo
. -
Use the package manager pip to install required packages.
foo@bar:~$ cd aksara foo@bar:~/aksara$ python3 -m pip install -r requirements.txt
If pip is not installed, please install pip first.
apt-get install python3-pip
Use console with main.py
.
foo@bar:~/aksara$ python3 main.py -s '“Meski kebanyakan transisi digital yang terjadi di Amerika Serikat belum pernah terjadi sebelumnya, transisi kekuasaan yang damai tidaklah begitu,” tulis asisten khusus Obama, Kori Schulman di sebuah postingan blog pada hari Senin.'
# sent_id = 1
# text = “Meski kebanyakan transisi digital yang terjadi di Amerika Serikat belum pernah terjadi sebelumnya, transisi kekuasaan yang damai tidaklah begitu,” tulis asisten khusus Obama, Kori Schulman di sebuah postingan blog pada hari Senin.
1 “ “ PUNCT _ _ _ _ _ SpaceAfter=No
2 Meski meski SCONJ _ _ _ _ _ _
3 kebanyakan banyak NOUN _ Number=Sing _ _ _ _
4 transisi transisi NOUN _ Number=Sing _ _ _ _
5 digital digital ADJ _ Degree=Pos _ _ _ _
6 yang yang/yang SCONJ/PRON _ (PRON -> PronType=Rel) _ _ _ _
7 terjadi jadi VERB _ Subcat=Tran|Voice=Pass _ _ _ _
8 di di ADP _ _ _ _ _ _
9 Amerika Amerika PROPN _ _ _ _ _ _
10 Serikat Serikat PROPN _ _ _ _ _ _
11 belum belum PART _ Polarity=Neg _ _ _ _
12 pernah pernah ADV _ _ _ _ _ _
13 terjadi jadi VERB _ Subcat=Tran|Voice=Pass _ _ _ _
14 sebelumnya sebelumnya ADV _ _ _ _ _ SpaceAfter=No
15 , , PUNCT _ _ _ _ _ _
16 transisi transisi NOUN _ Number=Sing _ _ _ _
17 kekuasaan kuasa NOUN _ Number=Sing _ _ _ _
18 yang yang/yang SCONJ/PRON _ (PRON -> PronType=Rel) _ _ _ _
19 damai damai ADJ _ Degree=Pos _ _ _ _
20-21 tidaklah _ _ _ _ _ _ _ _
20 tidak tidak PART _ Polarity=Neg _ _ _ _
21 lah lah PART _ PartType=Emp _ _ _ _
22 begitu begitu DET _ _ _ _ _ SpaceAfter=No
23 , , PUNCT _ _ _ _ _ SpaceAfter=No
24 ” ” PUNCT _ _ _ _ _ _
25 tulis tulis VERB _ _ _ _ _ _
26 asisten asisten NOUN _ Number=Sing _ _ _ _
27 khusus khusus ADJ _ Degree=Pos _ _ _ _
28 Obama Obama PROPN _ _ _ _ _ SpaceAfter=No
29 , , PUNCT _ _ _ _ _ _
30 Kori Kori PROPN _ _ _ _ _ _
31 Schulman Schulman PROPN _ _ _ _ _ _
32 di di ADP _ _ _ _ _ _
33 sebuah buah DET _ Number=Sing|PronType=Ind _ _ _ _
34 postingan posting NOUN _ Number=Sing _ _ _ _
35 blog blog NOUN _ Number=Sing _ _ _ _
36 pada pada ADP _ _ _ _ _ _
37 hari hari NOUN _ Number=Sing _ _ _ _
38 Senin Senin PROPN _ _ _ _ _ SpaceAfter=No
39 . . PUNCT _ _ _ _ _ _
foo@bar:~/aksara$
Accepting text file as input and write to file.
foo@bar:~/aksara$ python3 main.py -f "input_example.txt" --output "output_example.conllu"
Processing inputs...
100%|██████████████████████████████████████████████████| 5/5 [00:32<00:00, 6.45s/it]
foo@bar:~/aksara$
- To be added. Please use option
--help
at the moment.
- Aksara v1.0 was built by M. Yudistira Hanifmuti and Ika Alfina, as the reseach project for Yudistira's undergraduate thesis at Faculty of Computer Science, Universitas Indonesia.
- Aksara v1.0 refers to the annotation guidelines for Indonesian dependency treebank proposed by Alfina et al. (2019) and Alfina et al. (2020)
- M. Yudistira Hanifmuti and Ika Alfina. "Aksara: An Indonesian Morphological Analyzer that Conforms to the UD v2 Annotation Guidelines". In Proceeding of the 2020 International Conference of Asian Language Processing (IALP) in Kuala Lumpur, Malaysia, 4-6 Desember 2020. (accepted)
- Ika Alfina, Daniel Zeman, Arawinda Dinakaramani, Indra Budi, and Heru Suhartanto. "Selecting the UD v2 Morphological Features for Indonesian Dependency Treebank". In Proceeding of the 2020 International Conference of Asian Language Processing (IALP) in Kuala Lumpur, Malaysia, 4-6 Desember 2020. (accepted)
- Ika Alfina, Arawinda Dinakaramani, Mohamad Ivan Fanany, and Heru Suhartanto. "A Gold Standard Dependency Treebank for Indonesian". In Proceeding of 33rd Pacific Asia Conference on Language, Information and Computation (PACLIC) 2019 in Hakodate, Japan, 13-15 September 2019.
- 2020-10-27 v1.0
- Initial release.
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.