This repository includes code to train and evaluate a language identification model as well as code to launch a small web-application and interactively test the model.
A pre-trained model is available here. It is trained on a balanced dataset of 240k sentences in german 🇩🇪, english 🇬🇧, frensh 🇫🇷, italian 🇮🇹, portuguese 🇵🇹 and spanish 🇪🇸.
Accuracy 🎯: 98.73%
Confusion matrix 🤯:
Test the model with the demo application. Start the app with $ streamlit run app.py
. Then, open http://localhost:8501/ in your browser.
$ conda create -n langidentify python=3.8
$ conda activate langidentify
(Please check https://pytorch.org/get-started/locally/ and select the correct command depending on your CUDA version.)
$ conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=10.1 -c pytorch
$ pip install -r requirements.txt
- Download the dataset here and save the sentences.csv file under data/sentences.csv.
- Filter the data with
$ python filter_dataset.py
. This creates a balanced dataset and a train, val, test split of 80/10/10 for 6 languages. - Pre-process the dataset with
$ python preprocess_dataset.py
. This generates a feature representation (most common trigrams) for the data. - Run
$ python main.py
with mode set to TRAIN (default), EVAL or TEST. The trained model is saved under checkpoints/model.pth by default.
For reproducibility, the random seed is set to 42 in filter_dataset.py
and 420 in main.py
. You might want to change these numbers to obtain different results.