use language model perplexity to augment a small domain-specific sentence by selecting 'similar' sentences from an unlabeled corpus (e.g. web-crawled data) using
based on Ramaswamy, Printz, Gopalakrishnan: A Bootstrap Technique for Building Domain-Dependent Language Models, available here: http://mirlab.org/conference_papers/International_Conference/ICSLP%201998/PDF/SCAN/SL980611.PDF
nltk
kenlm
(LM in C++, install python extensions withsetup.py
)
build a seed corpus of in-domain data, then:
iterate:
- build language model
- evaluate perplexity of unlabeled sents under this model
- add n sents under the perplexity threshhold to the corpus
terminate when no new sentences are under the threshhold
see the jupyter notebooks for demos of selecting Jane Austen sentences from a mixture of sentences from Austen, Lewis Carroll and Herman Melville
For KenLM: