State-Of-The-Art Language Models for Indian Languages
Language | Model Architecture | Perplexity | IndicNLP News (Accuracy/ Kappa) | Public Classification Datasets (Accuracy/Kappa) | PyTorch model | Tensorboard Logs |
---|---|---|---|---|---|---|
Marathi (mr) | ALBERT-base-v2-550K | 37.06 | 0.96/ 0.93 | iNLTK Headlines 0.94/ 0.89 |
Checkpoint | Logs |
BERT-base | ||||||
Hindi (hi) | ALBERT-base-v2 | |||||
BERT-base |
Trained on the IndicNLP Corpus (Max Samples=10M)
Download
Contains Tokenizers for the following languages : Marathi, Hindi, Punjabi, Bengali, Oria, Gujurati, Kannada, Telegu, Malayalam, Tamil
SentencePiece
File format : {lang}_ {character_coverage}_ {model_type}_ {vocab_size}_ spiece
Added Tokens : [CLS], [SEP], [MASK]
WordPiece
File format : {lang}_ {min_frequency}_ wordpiece_ {vocab_size}
Added Tokens : [PAD], [UNK], [CLS], [SEP], [MASK]
ByteLevelBPE
File format : {lang}_ {min_frequency}_ bpe_ {vocab_size}
Added Tokens : <s>, <pad>, </s>, <unk>, <mask>
LMs trained using IndicNLP corpus (10% held out for evaluation)
For generating the same splits :
>>>from sklearn.model_selection import train_test_split
#lines (List(str)) - List of lines read from the corpus
>>>train, test = train_test_split(lines, shuffle=True, test_size=0.1, random_state=19)
Links -