This is the repo for the pre-trained Italian Language Model for fast.ai (see ULMFiT models at http://nlp.fast.ai/) , based on Italian wikipedia dump.
Resources available:
- Two parametric workbooks (tested with fastai v1 rev. 51) to tokenize the dataset and to train the model (in this repo).
- The basic CSVs with the data derived from wikipedia and created using the official fast.ai process with 400M tokens (https://github.com/fastai/fastai/tree/master/courses/dl2/imdb_scripts step 0): train csv and val csv
- The merged CSV for language model training: merged csv
- A serialized loader with a corpus of 100M tokens and a vocab size of 60.000 words (we downsample the above file using
.use_partial_data(p_in_partial_data_pct, seed=42)
with a pct of .25 in the datablock api: corpus - The corresponding itos file (to be used on step 2 and 3 of ULMFiT approach): itos
- The trained model (26.8 perplexity on validation): model. With this model we achieved 96.5% accuracy on classifications of sentiment in restaurant reviews.
Work is heavily inspired by https://github.com/tchambon/deepfrench and made with ❤️ by Quantyca Analytics Team