-
Notifications
You must be signed in to change notification settings - Fork 5
/
README
18 lines (14 loc) · 806 Bytes
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Polyglot is a language identifier for detecting text documents containing text
written in more than one language, and for identifying the languages therein.
It is an experimental project. For monolingual language detection, langid.py[1]
is a proven off-the-shelf solution.
The theoretical motivation behind it is described in "Automatic Detection and
Language Identification of Multilingual Documents. Marco Lui, Jey Han Lau,
Timothy Baldwin. TACL Vol 2 (2014)" [2].
To re-train polyglot on custom data, use the training tools for langid.py [1]
to build a model, and convert it to polyglot's format using the script in
./polyglot/convert.py
Marco Lui <[email protected]>,
November 2013
[1] https://github.com/saffsd/langid.py
[2] https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/86