Skip to content

rhn19/indic-LM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

indic-LM

State-Of-The-Art Language Models for Indian Languages

Models

Language Model Architecture Perplexity IndicNLP News (Accuracy/ Kappa) Public Classification Datasets (Accuracy/Kappa) PyTorch model Tensorboard Logs
Marathi (mr) ALBERT-base-v2-550K 37.06 0.96/ 0.93 iNLTK Headlines
0.94/ 0.89
Checkpoint Logs
BERT-base
Hindi (hi) ALBERT-base-v2
BERT-base

Tokenizers

Trained on the IndicNLP Corpus (Max Samples=10M)

Download
Contains Tokenizers for the following languages : Marathi, Hindi, Punjabi, Bengali, Oria, Gujurati, Kannada, Telegu, Malayalam, Tamil

SentencePiece
File format : {lang}_ {character_coverage}_ {model_type}_ {vocab_size}_ spiece
Added Tokens : [CLS], [SEP], [MASK]

WordPiece
File format : {lang}_ {min_frequency}_ wordpiece_ {vocab_size}
Added Tokens : [PAD], [UNK], [CLS], [SEP], [MASK]

ByteLevelBPE
File format : {lang}_ {min_frequency}_ bpe_ {vocab_size}
Added Tokens : <s>, <pad>, </s>, <unk>, <mask>

Datasets

LMs trained using IndicNLP corpus (10% held out for evaluation)
For generating the same splits :

>>>from sklearn.model_selection import train_test_split
#lines (List(str)) - List of lines read from the corpus
>>>train, test = train_test_split(lines, shuffle=True, test_size=0.1, random_state=19)

Links -

About

SOTA Language Models for Indian Languages

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published