Domain Specific Language Models

Procedure

Vector representation

Before clustering, we have to represent each email as a vector in such a way that similar emails will have a similar vector representation (semantic similarity). One simple way to achieve this is by using a Bag of Words representation, but the results are not satisfying. An efficient way to resolve text-related problems and outperform Bag of Words is to use Greek word embeddings, which are available by Spacy.

It should be noted, that the Greek word embeddings that are used were trained as a part of a 2018 Google Summer of Code project that is available here. As we can see, their performance is really good:

>>> import spacy
>>> nlp = spacy.load('el_core_news_md')
>>> doc1 = nlp('Καλησπέρα τι κάνεις')
>>> doc2 = nlp('Γειά τι γίνεται')
>>> doc3 = nlp('Ευχαριστώ πολύ για την αποδοχή')
>>> doc1.similarity(doc2)
0.9827883400844197
>>> doc1.similarity(doc3)
0.18061352570108277
>>> doc2.similarity(doc3)
0.1755354704127123

On the other hand, the fact that a sentence embeddings is equal to the mean value of its embeddings causes some problems:

>>> doc1 = nlp('Καλή συνέχεια Πάνος')
>>> doc2 = nlp('Καλή συνέχεια π')
>>> doc1.similarity(doc2)
0.4125659884568417

That's why, we can train our own embeddings in the emails of the user. Four techniques are provided:

Clustering

Since in k-means algorithm, the number of clusters is predefined, the method for specifying it is really important.

Elbow method
Silhouette analysis As seen in evaluation, silhouette analysis provided better results.

Create a language model for each cluster

As discussed in Datasets and Adaptation, SRILM toolkit is used for creating the language models.

Classify new data in existing clusters

Classification can be done using the euclidean distance or the cosine similarity. The second one is usually more efficient since the magnitude of the vectors does not matter.

Implemented Tools

The clustering.py tool is used to cluster emails using the k-means algorithm.

Usage:

$ python clustering.py -h
usage: clustering.py [-h] --input INPUT --output OUTPUT
                     [--metric {euclidean,cosine}]
                     [--vector_type {spacy,cbow,skipgram,word2vec,doc2vec}]
                     [--vector_path VECTOR_PATH] [--n_clusters N_CLUSTERS]
                     [--plot] [--method {elbow,silhouette}] [--min_cl MIN_CL]
                     [--max_cl MAX_CL] [--samples] [--keywords] [--sentence]

Tool for clustering emails using k-means algorithm. It supports: a) Various
word vectors, such as spacy, tfidf, cbow, skip-gram, word2vec and doc2vec. b)
Automatic selection of number of clusters using either silhouette or elbow
method.

optional arguments:
  -h, --help            show this help message and exit

required arguments:
  --input INPUT         Input directory
  --output OUTPUT       Ouput directory

optional arguments:
  --metric {euclidean,cosine}
                        Metric to be used for distance between points
  --vector_type {spacy,cbow,skipgram,word2vec,doc2vec}
                        Vector representation to be used
  --vector_path VECTOR_PATH
                        If cbow, fasttext, word2vec or doc2vec is selected,
                        give the path of the trained embeddings
  --n_clusters N_CLUSTERS
                        Number of clusters to be used (if not set,
                        automatically choose one)
  --plot                Plot sum of squared errors and silhouette scores (only
                        if n_clusters is not defined)
  --method {elbow,silhouette}
                        Method for choosing optimal number of clusters
  --min_cl MIN_CL       Minimum number of clusters (only if n_clusters is not
                        defined)
  --max_cl MAX_CL       Maximum number of clusters (only if n_clusters is not
                        defined)
  --samples             If set, a file that contains a representative email
                        for each cluster is saved
  --keywords            If set, print some keywords for each cluster
  --sentence            If set, clustering is done using the sentences of the
                        emails instead of the entire emails

Using --sentence argument, clustering is performed in the sentences of the emails (which is more efficient) rather than the entire email. Also, --samples and --keywords arguments are useful in order to identify the identity of each cluster.

The cluster2lm.py tool is used to create for each cluster a language model.

$ python cluster2lm.py -h
usage: cluster2lm.py [-h] --input INPUT [--mix MIX]

Tool for converting text to language model using srilm toolkit

optional arguments:
  -h, --help     show this help message and exit

required arguments:
  --input INPUT  Input directory that contains the clusters

optional arguments:
  --mix MIX      If set, create a merged lm with the given model

The classify.py tool is used to classify a new text in existing clusters.

$ python classify.py -h
usage: classify.py [-h] --input INPUT --centers CENTERS --ids IDS
                   [--metric {euclidean,cosine}]
                   [--vector_type {spacy,cbow,skipgram,word2vec,doc2vec}]
                   [--vector_path VECTOR_PATH] [--has_id] [--save]
                   [--output OUTPUT]

Tool for classify new emails in precomputed clusters

optional arguments:
  -h, --help            show this help message and exit

required arguments:
  --input INPUT         Input transriptions (one per line)
  --centers CENTERS     Pickle file that contains the centers of the clusters
  --ids IDS             File that contains the ids of the transcriptions

optional arguments:
  --metric {euclidean,cosine}
                        Metric to be used for finding closest cluster
  --vector_type {spacy,cbow,skipgram,word2vec,doc2vec}
                        Vector representation to be used
  --vector_path VECTOR_PATH
                        If cbow, fasttext, word2vec or doc2vec is selected,
                        specify the path of the trained embeddings
  --has_id              If set, each email contains his id in the end (Sphinx
                        format)
  --save                If set, save labels in pickle format
  --output OUTPUT       If set, name of the pickle output

The train_vec.py tool is used to train word vectors in a given email corpus.

$ python train_vec.py -h
usage: train_vec.py [-h] --input INPUT --size SIZE --type {fasttext,word2vec}
                    [--algorithm {skipgram,cbow}] --output OUTPUT

Tool for training different kind of embeddings on email corpus

optional arguments:
  -h, --help            show this help message and exit

required arguments:
  --input INPUT         Input directory that contains the emails
  --size SIZE           Dimensionality of word vectors
  --type {fasttext,word2vec}
                        Choose between FastText and Word2Vec
  --output OUTPUT       Output directory

optional arguments:
  --algorithm {skipgram,cbow}
                        Training algorithm to be used when choosing fasttext
                        embeddings

The train_doc.py tool is used to train document vectors in a given email corpus.

$ python train_doc.py -h
usage: train_doc.py [-h] --input INPUT --size SIZE --output OUTPUT

Tool for training doc2vec embeddings on email corpus

optional arguments:
  -h, --help       show this help message and exit

required arguments:
  --input INPUT    Input directory that contains the emails
  --size SIZE      Dimensionality of vectors
  --output OUTPUT  Output directory

Provide feedback

Saved searches

Use saved searches to filter your results more quickly