Skip to content

Fine-tuned Multilingual BERT and Multilingual USE for sentiment analysis in Russian. RuReviews, RuSentiment, Kaggle Russian News Dataset, LINIS Crowd, and RuTweetCorp were utilized as training data.

License

Notifications You must be signed in to change notification settings

sismetanin/sentiment-analysis-in-russian

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

Sentiment Analysis in Russian

This repository contains links to models for sentiment analysis of texts in Russian, which were trained within Evaluation of Pre-Trained Transformers for Sentiment Analysis of Texts in Russian and Deep Transfer Learning Baselines for Sentiment Analysis in Russian articles.

Evaluation of Pre-Trained Transformers for Sentiment Analysis of Texts in Russian

Model Score
Rank Dataset
SentiRuEval-2016
RuSentiment KRND LINIS Crowd RuTweetCorp RuReviews
TC Banks
micro F1 macro F1 F1 micro F1 macro F1 F1 wighted F1 F1 F1 F1 F1 F1
SOTA n/s 76.71 66.40 70.68 67.51 69.53 74.06 78.50 n/s 73.63 60.51 83.68 77.44
XLM-RoBERTa-Large 76.37 1 82.26 76.36 79.42 76.35 76.08 80.89 78.31 75.27 75.17 60.03 88.91 78.81
SBERT-Large 75.43 2 78.40 71.36 75.14 72.39 71.87 77.72 78.58 75.85 74.20 60.64 88.66 77.41
MBARTRuSumGazeta 74.70 3 76.06 68.95 73.04 72.34 71.93 77.83 76.71 73.56 74.18 60.54 87.22 77.51
Conversational RuBERT 74.44 4 76.69 69.09 73.11 69.44 68.68 75.56 77.31 74.40 73.10 59.95 87.86 77.78
LaBSE 74.11 5 77.00 69.19 73.55 70.34 69.83 76.38 74.94 70.84 73.20 59.52 87.89 78.47
XLM-RoBERTa-Base 73.60 6 76.35 69.37 73.42 68.45 67.45 74.05 74.26 70.44 71.40 60.19 87.90 78.28
RuBERT 73.45 7 74.03 66.14 70.75 66.46 66.40 73.37 75.49 71.86 72.15 60.55 86.99 77.41
MBART-50-Large-Many-to-Many 73.15 8 75.38 67.81 72.26 67.13 66.97 73.85 74.78 70.98 71.98 59.20 87.05 77.24
SlavicBERT 71.96 9 71.45 63.03 68.44 64.32 63.99 71.31 72.13 67.57 72.54 58.70 86.43 77.16
EnRuDR-BERT 71.51 10 72.56 64.74 69.07 61.44 60.21 68.34 74.19 69.94 69.33 56.55 87.12 77.95
RuDR-BERT 71.14 11 72.79 64.23 68.36 61.86 60.92 68.48 74.65 70.63 68.74 54.45 87.04 77.91
MBART-50-Large 69.46 12 70.91 62.67 67.24 61.12 60.25 68.41 72.88 68.63 70.52 46.39 86.48 77.52

Deep Transfer Learning Baselines for Sentiment Analysis in Russian

This repository contains the fine-tuned Multilingual Bidirectional Encoder Representations from Transformers (M-BERT), RuBERT, and two versions of Multilingual Universal Sentence Encoder (M-USE) for sentiment classification in Russian referenced in Deep Transfer Learning Baselines for Sentiment Analysis in Russian.

Dataset Measure Current SOTA M-BERT RuBERT M-USE-CNN M-USE-Trans
SentiRuEval-2016 TC F1 68.42 66.29
70.68
63.64 68.27
macro F1PN 66.07 61.78 66.40 58.97 62.77
micro F1PN 74.11 72.45 76.71 71.31 75.00
SentiRuEval-2016 Banks F1 74.06 65.31 72.83 66.71 72.40
macro F1PN 69.53 58.00 65.89 58.73 65.04
micro F1PN 71.76 60.52 68.43 62.41 68.21
SentiRuEval-2016 TC F1 68.54 60.47 64.39 60.57 64.28
macro F1PN 63.47 53.16 57.76 52.37 57.60
micro F1PN 67.51 57.03 61.38 57.76 61.18
SentiRuEval-2016 Banks F1 79.51 67.65 70.58 66.32 69.62
macro F1PN 67.44 56.97 60.95 54.74 59.12
micro F1PN 70.09 59.32 63.33 57.61 62.17
RuSentiment F1 n/s 71.37 72.03 66.27 68.60
weighted F1 78.50 75.13 75.71 71.05 73.42
Kaggle Russian News Dataset F1 70.00 71.36 73.63 71.27 72.66
LINIS Crowd F1 37.29 42.73 60.51 56.34 56.95
RuTweetCorp (binary) F1 75.95 83.04 83.69 81.34 83.17
RuTweetCorp (trinary) F1 78.1 80.10 80.79 78.39 79.69
RuReviews F1 75.45 77.31 77.44 76.63 76.94

SOTA approaches for RuReviews, RuSentiment, Kaggle Russian News Dataset, and RuTweetCorp were described in papers (Smetanin and Komarov, 2019), (Baymurzina et al., 2019), (Shalkarbayuli et al., 2018), and (Rubtsova, 2018), consequently. The SOTA approach for LINIS Crowd was implemented based on the paper (Koltsova et al., 2016).

Sentiment Datasets in Russian

Despite the fact that Russian is one of the most common languages in the World Wide Web, generally it is not as well-resourced as the English language, especially in the field of sentiment analysis. Even though many studies aim at sentiment classification, only few of them makes their datasets publicly available for the research community.

Dataset Classes Average lengths Max lengths Train Samples Test Samples Overall Samples Download Link
SentiRuEval-2016 (Loukachevitch and Rubtsova, 2016) 3 87.0928 172 18,035 5,560 23,595 Project page
SentiRuEval-2015 Subtask (Loukachevitch et al., 2015) 3 81.4986 172 8,580 7,738 16,318 Project page
RuTweetCorp (Rubtsova, 2013) 3 89.1725 189 n/a n/a 334836 Project page
LINIS Crowd (Koltsova et al., 2016) 5 n/a n/a n/a n/a n/a Project page
RuSentiment (Rogers et al., 2018) 5 82.0279 800 28218 2967 31185 Project page
Kaggle Russian News Dataset 3 3911.8501 381498 n/a n/a 8263 Kaggle page
RuReviews (Smetanin and Komarov, 2019) 3 130.0693 1007 n/a n/a 90,000 GitHub page

Fine-Tuned Models

To download fine-tuned models for Russian, please follow the link https://yadi.sk/d/Xp5vLG_5xCQL-Q.

Citation

@article{Smetanin2020Deep,
  title = {Deep transfer learning baselines for sentiment analysis in Russian},
  author = {Sergey Smetanin and Mikhail Komarov},
  journal = {Information Processing & Management},
  volume = {58},
  number = {3},
  pages = {102484},
  year = {2021},
  issn = {0306-4573},
  doi = {https://doi.org/10.1016/j.ipm.2020.102484},
  url = {https://www.sciencedirect.com/science/article/pii/S0306457320309730}
}

License

See LICENSE.

About

Fine-tuned Multilingual BERT and Multilingual USE for sentiment analysis in Russian. RuReviews, RuSentiment, Kaggle Russian News Dataset, LINIS Crowd, and RuTweetCorp were utilized as training data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published