This repository is a collection of linguistic models made with the udpipe R package (https://CRAN.R-project.org/package=udpipe).
The models are located in the models folder and allow to do Tokenisation, POS tagging, Lemmatisation and Dependency Parsing for the following languages.
Language | model name | size | data source | license |
---|---|---|---|---|
Afrikaans | afrikaans-ud-2.1-20180111.udpipe | 4.6 MB | UD_Afrikaans | CC BY-SA 4.0 |
Croatian | croatian-ud-2.1-20180111.udpipe | 20.4 MB | UD_Croatian | CC BY-SA 4.0 |
Czech | czech-cac-ud-2.1-20180111.udpipe | 30.0 MB | UD_Czech-CAC | CC BY-SA 4.0 |
Dutch | dutch-ud-2.1-20180111.udpipe | 15.4 MB | UD_Dutch | CC BY-SA 4.0 |
English | english-ud-2.1-20180111.udpipe | 15.6 MB | UD_English | CC BY-SA 4.0 |
Finnish | finnish-ud-2.1-20180111.udpipe | 24.2 MB | UD_Finnish | CC BY-SA 4.0 |
French | french-sequoia-ud-2.1-20180111.udpipe | 5.7 MB | UD_French-Sequoia | LGPL-LR |
Irish | irish-ud-2.1-20180111.udpipe | 2.5 MB | UD_Irish | CC BY-SA 3.0 |
Norwegian | norwegian-bokmaal-ud-2.1-20180111.udpipe | 17.6 MB | UD_Norwegian-Bokmaal | CC BY-SA 4.0 |
Persian | persian-ud-2.1-20180111.udpipe | 11.6 MB | UD_Persian | CC BY-SA 4.0 |
Polish | polish-ud-2.1-20180111.udpipe | 11.9 MB | UD_Polish | GPL-3.0 |
Portuguese | portuguese-ud-2.1-20180111.udpipe | 18.0 MB | UD_Portuguese | CC BY-SA 4.0 |
Romanian | romanian-ud-2.1-20180111.udpipe | 14.6 MB | UD_Romanian | CC BY-SA 4.0 |
Serbian | serbian-ud-2.1-20180111.udpipe | 8.6 MB | UD_Serbian | CC BY-SA 4.0 |
Slovak | slovak-ud-2.1-20180111.udpipe | 17.1 MB | UD_Slovak | CC BY-SA 4.0 |
Spanish | spanish-ancora-ud-2.1-20180111.udpipe | 20.2 MB | UD_Spanish-AnCora | GPL-3.0 |
Swedish | swedish-ud-2.1-20180111.udpipe | 7.8 MB | UD_Swedish | CC BY-SA 4.0 |
Create an issue if languages from Universal Dependencies (http://universaldependencies.org) which you like to have included are missing.
If you want to use the models, download the model from the respective folder and proceed as follows to annotate text with udpipe_annotate
.
For more documentation on udpipe: look at the vignettes at https://CRAN.R-project.org/package=udpipe
library(udpipe)
download.file(url = "https://raw.githubusercontent.com/bnosac/udpipe.models.ud/master/models/polish-ud-2.1-20180111.udpipe",
destfile = "ud_polish.udpipe", mode = "wb")
m <- udpipe_load_model("ud_polish.udpipe")
x <- udpipe_annotate(m, "Budynek otrzymany od parafii wymaga remontu, a placówka nie otrzymała jeszcze żadnej dotacji.")
x <- as.data.frame(x)
x
doc_id paragraph_id sentence_id token_id token lemma upos xpos feats head_token_id dep_rel
doc1 1 1 1 Budynek budynek NOUN subst:sg:acc:m3 Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing 5 obj
doc1 1 1 2 otrzymany otrzymany ADJ ppas:sg:acc:m3:imperf:aff Animacy=Inan|Aspect=Imp|Case=Acc|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass 1 amod
doc1 1 1 3 od od ADP prep:gen:nwok AdpType=Prep|Case=Gen|Variant=Short 4 case
doc1 1 1 4 parafii parafia NOUN subst:sg:gen:f Case=Gen|Gender=Fem|Number=Sing 2 obl
doc1 1 1 5 wymaga wymagac VERB fin:sg:ter:imperf Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root
doc1 1 1 6 remontu remont NOUN subst:sg:gen:m3 Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing 5 obj
...
- The models pubished here are models released under a licence which allow for commercial usage in contrast to the models which are made available at https://github.com/jwijffels/udpipe.models.ud.2.0 (that repository contains a lot more models for many more languages but these were released under a non-commercial license).
- Each of the models has its own license terms and you are responsible for complying with the license terms applicable to those parts of the models which you use. If you do not agree with the license terms, you must stop using these models and destroy all copies that you have obtained.
- The license for every model included in this repository is specified in the appropriate src/treebank directory.
- The .R source code which is used to build the models is made available under the Mozilla Public License Version 2.0.
- The data which is used when constructing the models is data from the respective treebanks available at https://github.com/UniversalDependencies and fastText word vectors released at https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
In order to reproduce the model, execute the train.R code inside the src/treebank folders. The src/treebank folders contain the R code and the log which was used to generate the model. The log contains as well accuracy statistics of the model. If you want to contribute, use a similar flow and submit a pull request. Please be clear on the license of your model.
If you want to reproduce model building, install the R packages you need as follows.
## First make sure you have the necessary R packages installed
install.packages("udpipe")
devtools::install_github("bmschmidt/wordVectors")
And next you can start building the models. Just proceed as follows (example on dutch) and wait a few hours depending on your CPU power.
Rscript src/dutch/train.R > src/dutch/train.log
Mark that the wordVectors
package is used only to train our own word vectors, which is usefull if you want to train a model which does dependency parsing. You can also download these wordvectors from Facebook or build these with the text2vec
R package. We chose the wordVectors
package because it was the closest to the original UDPipe paper.
Need support in text mining? Contact BNOSAC: http://www.bnosac.be