You can specify two data setttings:
en_tweetwnut17
: trained onTB2+WNUT17
en_tweet
: trained onTB2
# prepare NER data
cd ./twitter-stanza/data/ner
python prepare_ner_data.py
cd ../..
# train the NER model
shorthand=en_tweetwnut17
python stanza/utils/training/run_ner.py ${shorthand} \
--wordvec_file ./data/wordvec/English/en.twitter100d.xz \
--eval_file data/ner/en_tweet.dev.json
You can specify two data settings:
en_tweetewt
: trained onTB2+UD-English-EWT
en_tweet
: trained onTB2
## assign the shorthand name
shorthand=en_tweet
## Data Preparation
python -m stanza.utils.datasets.prepare_tokenizer_treebank ${shorthand}
## Train
python stanza/utils/training/run_tokenizer.py ${shorthand} --no_use_mwt
## assign the shorthand name
shorthand=en_tweet
## Data Preparation
python -m stanza.utils.datasets.prepare_lemma_treebank ${shorthand}
## Train
python stanza/utils/training/run_lemma.py ${shorthand}
Note that for POS and depparse, if there are a pretrained word2vec files in the target folders, Stanza will prioritize using them even if you give the --wordvec_file argument. To avoid accidentally using a wrong word2vec, remember to add --no_pretrain.
## assign the shorthand name
shorthand=en_tweetewt
## Data Preparation
python -m stanza.utils.datasets.prepare_pos_treebank ${shorthand}
## Train
python stanza/utils/training/run_pos.py ${shorthand} --wordvec_file ../data/wordvec/English/en.twitter100d.xz --no_pretrain
## We didn't use a pretrained wordvec file in our training process. To specify one, use --wordvec_pretrain_file.
## assign the shorthand name
shorthand=en_tweetewt
## Data Preparation
## --gold would give you gold data. But according to conventions we did not use gold data for our depparse model.
python -m stanza.utils.datasets.prepare_depparse_treebank ${shorthand}
## Train
python stanza/utils/training/run_depparse.py ${shorthand} --wordvec_file ../data/wordvec/English/en.twitter100d.xz --no_pretrain
## We didn't use pretrained word2vec file in our parser training, but the pretrain.pt generated in training the pos_tagger can be re-used here.
In general, we follow these steps:
- {shorthand} should be selected from
en_tweet
,tweet_ewt
- {model} should be one of
tokenizer
,lemma
,pos
,depparse
python -m stanza.utils.datasets.prepare_{model}_treebank {shorthand}
python stanza/utils/training/{model}.py {shorthand}
Compared to Stanza, we do not include sentence splitting, so we commented out the error check for stanza.utils.conll18_ud_eval.UDError: There are multiple roots in a sentence
.
python stanza/utils/training/run_{model}.py {shorthand} --mode predict --score_test