Skip to content

Kyrgyz language processing software, models and datasets.

Notifications You must be signed in to change notification settings

alexeyev/awesome-kyrgyz-nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Kyrgyz NLP Awesome

A curated list of awesome Kyrgyz language processing software, models and datasets. Inspired by awesome-ML.

The main focus is on open source tools, downloadable data and research papers with code.

If you want to contribute to this list (please do), send me a pull request. Also, a listed repository should be tagged as deprecated if:

  • Repository's owners explicitly say that "this library is not maintained".
  • Not committed to for a long time (2~3 years).

Table of Contents

Datasets

Corpora

  • Manas-UdS: 1.2M words, 84 literary texts, 5 genres: novel, novelette, epic, minor epic, and fairy tale; lemmata, PoS tags, rich per-text metadata.
  • kyWaC: Kyrgyz corpus from the web, 19M words, Jan 2012 [not open]
  • Kyrgyz in Leipzig Corpora Collecion: Community data / Newscrawl (1M sentences) / Wikipedia (300K sentences)
  • TilCorpusu: Kyrgyz corpus, 100M words, news+fiction, made public in July 2023 (just the News part due to legal restrictions)
  • TurkLang-7: parallel corpora mentioned in the 2020 work 'First Results of the ``TurkLang-7'' Project: Creating Russian-Turkic Parallel Corpora and MT Systems' by Khusainov, A., Suleymanov, D., Gilmullin, R., Minsafina, A., Kubedinova, L., Abdurakhmonova, N. [status?]

Character recognition

Raw text

  • kloop corpus: 16'826 articles (sqlite3 DB file) + crawler code

Morphology & Syntax

Named Entity Recognition

Text Classification

Word Similarity Data

Instructions

Machine-readable dictionaries

Pretrained models

Methods/Software

  • spaCy basic support: tokenization, stopwords, like_num
  • stanza-ky pipeline called 'ktmu'; use with care, seems to have a very suspicious brackets processing
  • kyrgyz-nlp/disambiguator project studies the ability of popular embedding models to select word senses based on the word hints (anchor words)

Morphology

Hate Speech detection

Other

Online Demos

Miscellaneous

Contributions to this list

@golden-ratio