An auto generated NER dataset of 48K sentences
The datasets conforms with the dataset format of Stanford-NER.
Four named entity classes are used:
"Person" for person names
"Place" for place names
"Organisation" for organization names
"O" for others
The dataset may be used for free, but if you want to publish paper/publication using the dataset, please cite these publications:
-
Ika Alfina, Septiviana Savitri, and Mohamad Ivan Fanany, "Modified DBpedia Entities Expansion for Tagging Automatically NER Dataset", in Proceeding of 9th International Conference on Advanced Computer Science and Information Systems 2017 (ICACSIS 2017).
-
Ika Alfina, Ruli Manurung, and Mohamad Ivan Fanany, "DBpedia Entities Expansion in Automatically Building Dataset for Indonesian NER", in Proceeding of 8th International Conference on Advanced Computer Science and Information Systems 2016 (ICACSIS 2016).
We suggest you to use the Stanford NER library.
The steps to create NER model using Stanford NER library are as follows:
-
Download Stanford-NER.
-
Download the dataset and its properties file (file with .prop extension)
-
Use Stanford NER classifier to create the model.
For example:
java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop singgalang.propI recommend to increase the heap size so you can train the dataset on computer with limited RAM. Add option like "-Xmx1024m" on the command, for example:
java -Xmx1024m -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop singgalang.prop
if this still doesn't work, increase the number. For example: "-Xmx8000m". This works for me :)
Let say this step will create a NER model file named "idner-model-singgalang.ser.gz"
-
Create or use a testing dataset. Lets say the file name is "testing.txt"
-
Evaluate the NER model using Stanford NER library
For example:
java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier idner-model-20k-mdee.ser.gz -testFile testing.txt
You can use this dataset for free. You don't need our permission to use it. Please cite our paper if your work uses our data in your publication. Please note that you are not allowed to create a copy of this dataset and share it publicly in your own repository without our permission.
ika.alfina [at] cs.ui.ac.id