Abstract
A local library wants to implement a categorizing system based on the similarity of the genres and the authors. This text clustering model will enable the librarian to assign the books into groups and subgroups based on their labels. This unsupervised learning model will apply various transformation and clustering techniques to choose the best algorithm to assign the correct labels to the books. Five books, written by different authors and from different genres, were extracted from the Project Gutenberg website. The data was preprocessed and partitioned into 200 partitions and each partition consisted of 150 words. Different text processing and dimension reducing techniques were used to transform the data. Unsupervised clustering algorithms were used on all data transformations to find out the best algorithm.
Design
The purpose of the model is to use text clustering techniques to assign books into similar groups or clusters. The model uses the data obtained from the Project Gutenberg website. The data was preprocessed using NLTK library by removing stop words and lemmatizing every word to its origin. BOW, TD-IDF and LDA techniques were used to transform the data. K-means and Hierarchical clustering algorithms were used on all three data transformations, respectively. V-measure score and Silhouette score were used for evaluating the performance of clustering algorithms against the true authors. LDA-transformed data was used to compare different silhouette scores to figure out the optimum number of clusters using K-means algorithm.
Data
The data was obtained from the Project Gutenberg website - https://www.gutenberg.org. The following books were extracted from this website using python library – requests:
-
Chaldea
-
A Book About Lawyers
-
EBook of Darwinism
-
The Vicomte de Bragelonne
-
A Popular History of Astronomy During the Nineteenth Century
All the books are by different authors and from different genres. The data was preprocessed using NLTK. The stop words and garbage characters were removed, all the words were converted to lower case and lemmatization was performed to return every word to its origin. The books were randomly partitioned into 200 partitions and each partition consisted of 150 words. The books were labelled as [a,b,c,d,e].
Algorithms
• Data preprocessing:
o removed stop words, garbage characters, all words converted to lower case, lemmatization performed
o Unigrams/bigrams and Wordcloud used to find out the frequently used words in all 5 books
• Feature Engineering:
o Bag of Words (BOW)
o TF-IDF
o LDA
• Clustering Algorithms:
o K-means
o Hierarchical clustering
• Evaluation scores:
o V-measure
o Silhouette score
Tools
• Requests: For extracting books using their urls.
• Pandas and Numpy: For cleaning data and preprocessing.
• NLTK, gensim, scikit-learn: For processing text, topic modeling, clustering and evaluation
• Matplotlib, Seaborn, Wordcloud, pyLDAvis, scipy : For visualizing the data.
Communication
Unigram and bigram techniques were used to show frequently used words in all the five books. One example is shown below:
Wordcloud was used to show 50 frequently used words in all five books. One example is:
Feature engineering was performed using following techniques:
- BOW transformation
- TF-IDF transformation
- LDA transformation
Clustering algorithms on three transformations:
1. BOW and K-Means algorithm:
2. BOW and Hierarchical clustering algorithm
3. TF-IDF and K-Means algorithm
4. TF-IDF and Hierarchical clustering algorithm
5. LDA and K-Means algorithm
6. LDA and Hierarchical clustering algorithm
V-Measure for Evaluating Clustering Performance against the true authors
Silhouette score for Evaluating Clustering Performance against the true authors
Comparison between Silhouette scores using different K numbers on LDA transformed data
K = 5 was figured out as the optimum number of clusters as the average score is the highest and all the clusters have the same size and clear clusters.
Hence, K-means algorithm on LDA transformed data where number of clusters is 5 is the best algorithm for the text clustering project.