This project template lets you train a spaCy pipeline on any Universal Dependencies corpus (v2.5) for benchmarking purposes. The pipeline includes an experimental trainable tokenizer, an experimental edit tree lemmatizer, and the standard spaCy tagger, morphologizer and dependency parser components. The CoNLL 2018 evaluation script is used to evaluate the pipeline. The template uses the UD_English-EWT
treebank by default, but you can swap it out for any other available treebank. Just make sure to adjust the ud_treebank
and spacy_lang
settings in the config. Use xx
(multi-language) for spacy_lang
if a particular language is not supported by spaCy. The tokenizer in particular is only intended for use in this generic benchmarking setup. It is not optimized for speed and it does not perform particularly well for languages without space-separated tokens. In production, custom rules for spaCy's rule-based tokenizer or a language-specific word segmenter such as jieba for Chinese or sudachipy for Japanese would be recommended instead.
The project.yml
defines the data assets required by the
project, as well as the available commands and workflows. For details, see the
Weasel documentation.
The following commands are defined by the project. They
can be executed using weasel run [name]
.
Commands are only re-run if their inputs have changed.
Command | Description |
---|---|
extract |
Extract the data |
convert |
Convert the data to spaCy's format |
train-tokenizer |
Train tokenizer |
train-transformer |
Train transformer |
assemble |
Assemble full pipeline |
evaluate |
Evaluate on the test data and save the metrics |
evaluate-with-senter |
Evaluate on the test data and save the metrics |
package |
Package the trained model so it can be installed |
clean |
Remove intermediate files |
The following workflows are defined by the project. They
can be executed using weasel run [name]
and will run the specified commands in order. Commands are only re-run if their
inputs have changed.
Workflow | Steps |
---|---|
all |
extract → convert → train-tokenizer → train-transformer → assemble → evaluate → evaluate-with-senter → package |
The following assets are defined by the project. They can
be fetched by running weasel assets
in the project directory.
File | Source | Description |
---|---|---|
assets/ud-treebanks-v2.5.tgz |
URL |