Italian Language Model for Fast.ai ULMFiT

This is the repo for the pre-trained Italian Language Model for fast.ai (see ULMFiT models at http://nlp.fast.ai/) , based on Italian wikipedia dump.

Resources available:

Two parametric workbooks (tested with fastai v1 rev. 51) to tokenize the dataset and to train the model (in this repo).
The basic CSVs with the data derived from wikipedia and created using the official fast.ai process with 400M tokens (https://github.com/fastai/fastai/tree/master/courses/dl2/imdb_scripts step 0): train csv and val csv
The merged CSV for language model training: merged csv
A serialized loader with a corpus of 100M tokens and a vocab size of 60.000 words (we downsample the above file using .use_partial_data(p_in_partial_data_pct, seed=42) with a pct of .25 in the datablock api: corpus
The corresponding itos file (to be used on step 2 and 3 of ULMFiT approach): itos
The trained model (26.8 perplexity on validation): model. With this model we achieved 96.5% accuracy on classifications of sentiment in restaurant reviews.

Work is heavily inspired by https://github.com/tchambon/deepfrench and made with ❤️ by Quantyca Analytics Team

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Italian Language Model Tokenization.ipynb		Italian Language Model Tokenization.ipynb
Italian Language Model Training.ipynb		Italian Language Model Training.ipynb
README.md		README.md

Provide feedback