-
Notifications
You must be signed in to change notification settings - Fork 1
Domain Specific Language Models
Before clustering, we have to represent each email as a vector in such a way that similar emails will have a similar vector representation (semantic similarity). One simple way to achieve this is by using a Bag of Words representation, but the results are not satisfying. An efficient way to resolve text-related problems and outperform Bag of Words is to use Greek word embeddings, which are available by Spacy.
It should be noted, that the Greek word embeddings that are used were trained as a part of a 2018 Google Summer of Code project that is available here. As we can see, their performance is really good:
>>> import spacy
>>> nlp = spacy.load('el_core_news_md')
>>> doc1 = nlp('Καλησπέρα τι κάνεις')
>>> doc2 = nlp('Γειά τι γίνεται')
>>> doc3 = nlp('Ευχαριστώ πολύ για την αποδοχή')
>>> doc1.similarity(doc2)
0.9827883400844197
>>> doc1.similarity(doc3)
0.18061352570108277
>>> doc2.similarity(doc3)
0.1755354704127123
On the other hand, the fact that a sentence embeddings is equal to the mean value of its embeddings causes some problems:
>>> doc1 = nlp('Καλή συνέχεια Πάνος')
>>> doc2 = nlp('Καλή συνέχεια π')
>>> doc1.similarity(doc2)
0.4125659884568417
That's why, we can train our own embeddings in the emails of the user. Four techniques are provided:
Since in k-means algorithm, the number of clusters is predefined, the method for specifying it is really important.
- Elbow method
- Silhouette analysis As seen in evaluation, silhouette analysis provided better results.
As discussed in Datasets and Adaptation, SRILM toolkit is used for creating the language models.
Classification can be done using the euclidean distance or the cosine similarity. The second one is usually more efficient since the magnitude of the vectors does not matter.
- The
clustering.py
tool is used to cluster emails using the k-means algorithm.
Usage:
$ python clustering.py -h
usage: clustering.py [-h] --input INPUT --output OUTPUT
[--metric {euclidean,cosine}]
[--vector_type {spacy,cbow,skipgram,word2vec,doc2vec}]
[--vector_path VECTOR_PATH] [--n_clusters N_CLUSTERS]
[--plot] [--method {elbow,silhouette}] [--min_cl MIN_CL]
[--max_cl MAX_CL] [--samples] [--keywords] [--sentence]
Tool for clustering emails using k-means algorithm. It supports: a) Various
word vectors, such as spacy, tfidf, cbow, skip-gram, word2vec and doc2vec. b)
Automatic selection of number of clusters using either silhouette or elbow
method.
optional arguments:
-h, --help show this help message and exit
required arguments:
--input INPUT Input directory
--output OUTPUT Ouput directory
optional arguments:
--metric {euclidean,cosine}
Metric to be used for distance between points
--vector_type {spacy,cbow,skipgram,word2vec,doc2vec}
Vector representation to be used
--vector_path VECTOR_PATH
If cbow, fasttext, word2vec or doc2vec is selected,
give the path of the trained embeddings
--n_clusters N_CLUSTERS
Number of clusters to be used (if not set,
automatically choose one)
--plot Plot sum of squared errors and silhouette scores (only
if n_clusters is not defined)
--method {elbow,silhouette}
Method for choosing optimal number of clusters
--min_cl MIN_CL Minimum number of clusters (only if n_clusters is not
defined)
--max_cl MAX_CL Maximum number of clusters (only if n_clusters is not
defined)
--samples If set, a file that contains a representative email
for each cluster is saved
--keywords If set, print some keywords for each cluster
--sentence If set, clustering is done using the sentences of the
emails instead of the entire emails
Using --sentence
argument, clustering is performed in the sentences of the emails (which is more efficient) rather than the entire email. Also, --samples
and --keywords
arguments are useful in order to identify the identity of each cluster.
- The
cluster2lm.py
tool is used to create for each cluster a language model.
$ python cluster2lm.py -h
usage: cluster2lm.py [-h] --input INPUT [--mix MIX]
Tool for converting text to language model using srilm toolkit
optional arguments:
-h, --help show this help message and exit
required arguments:
--input INPUT Input directory that contains the clusters
optional arguments:
--mix MIX If set, create a merged lm with the given model
- The
classify.py
tool is used to classify a new text in existing clusters.
$ python classify.py -h
usage: classify.py [-h] --input INPUT --centers CENTERS --ids IDS
[--metric {euclidean,cosine}]
[--vector_type {spacy,cbow,skipgram,word2vec,doc2vec}]
[--vector_path VECTOR_PATH] [--has_id] [--save]
[--output OUTPUT]
Tool for classify new emails in precomputed clusters
optional arguments:
-h, --help show this help message and exit
required arguments:
--input INPUT Input transriptions (one per line)
--centers CENTERS Pickle file that contains the centers of the clusters
--ids IDS File that contains the ids of the transcriptions
optional arguments:
--metric {euclidean,cosine}
Metric to be used for finding closest cluster
--vector_type {spacy,cbow,skipgram,word2vec,doc2vec}
Vector representation to be used
--vector_path VECTOR_PATH
If cbow, fasttext, word2vec or doc2vec is selected,
specify the path of the trained embeddings
--has_id If set, each email contains his id in the end (Sphinx
format)
--save If set, save labels in pickle format
--output OUTPUT If set, name of the pickle output
- The
train_vec.py
tool is used to train word vectors in a given email corpus.
$ python train_vec.py -h
usage: train_vec.py [-h] --input INPUT --size SIZE --type {fasttext,word2vec}
[--algorithm {skipgram,cbow}] --output OUTPUT
Tool for training different kind of embeddings on email corpus
optional arguments:
-h, --help show this help message and exit
required arguments:
--input INPUT Input directory that contains the emails
--size SIZE Dimensionality of word vectors
--type {fasttext,word2vec}
Choose between FastText and Word2Vec
--output OUTPUT Output directory
optional arguments:
--algorithm {skipgram,cbow}
Training algorithm to be used when choosing fasttext
embeddings
- The
train_doc.py
tool is used to train document vectors in a given email corpus.
$ python train_doc.py -h
usage: train_doc.py [-h] --input INPUT --size SIZE --output OUTPUT
Tool for training doc2vec embeddings on email corpus
optional arguments:
-h, --help show this help message and exit
required arguments:
--input INPUT Input directory that contains the emails
--size SIZE Dimensionality of vectors
--output OUTPUT Output directory