GitHub - Himani0924/Unsupervised_text_clustering_model: The purpose of the model is to use text clustering techniques to assign books into similar groups or clusters.

Abstract

A local library wants to implement a categorizing system based on the similarity of the genres and the authors. This text clustering model will enable the librarian to assign the books into groups and subgroups based on their labels. This unsupervised learning model will apply various transformation and clustering techniques to choose the best algorithm to assign the correct labels to the books. Five books, written by different authors and from different genres, were extracted from the Project Gutenberg website. The data was preprocessed and partitioned into 200 partitions and each partition consisted of 150 words. Different text processing and dimension reducing techniques were used to transform the data. Unsupervised clustering algorithms were used on all data transformations to find out the best algorithm.

Design

The purpose of the model is to use text clustering techniques to assign books into similar groups or clusters. The model uses the data obtained from the Project Gutenberg website. The data was preprocessed using NLTK library by removing stop words and lemmatizing every word to its origin. BOW, TD-IDF and LDA techniques were used to transform the data. K-means and Hierarchical clustering algorithms were used on all three data transformations, respectively. V-measure score and Silhouette score were used for evaluating the performance of clustering algorithms against the true authors. LDA-transformed data was used to compare different silhouette scores to figure out the optimum number of clusters using K-means algorithm.

Data

The data was obtained from the Project Gutenberg website - https://www.gutenberg.org. The following books were extracted from this website using python library – requests:

Chaldea
A Book About Lawyers
EBook of Darwinism
The Vicomte de Bragelonne
A Popular History of Astronomy During the Nineteenth Century

All the books are by different authors and from different genres. The data was preprocessed using NLTK. The stop words and garbage characters were removed, all the words were converted to lower case and lemmatization was performed to return every word to its origin. The books were randomly partitioned into 200 partitions and each partition consisted of 150 words. The books were labelled as [a,b,c,d,e].

Algorithms

• Data preprocessing:

o removed stop words, garbage characters, all words converted to lower case, lemmatization performed

o Unigrams/bigrams and Wordcloud used to find out the frequently used words in all 5 books

• Feature Engineering:

o Bag of Words (BOW)

o TF-IDF

o LDA

• Clustering Algorithms:

o K-means

o Hierarchical clustering

• Evaluation scores:

o V-measure

o Silhouette score

Tools

• Requests: For extracting books using their urls.

• Pandas and Numpy: For cleaning data and preprocessing.

• NLTK, gensim, scikit-learn: For processing text, topic modeling, clustering and evaluation

• Matplotlib, Seaborn, Wordcloud, pyLDAvis, scipy : For visualizing the data.

Communication

Unigram and bigram techniques were used to show frequently used words in all the five books. One example is shown below:

Wordcloud was used to show 50 frequently used words in all five books. One example is:

Feature engineering was performed using following techniques:

BOW transformation

TF-IDF transformation

LDA transformation

Clustering algorithms on three transformations:

1. BOW and K-Means algorithm:

2. BOW and Hierarchical clustering algorithm

3. TF-IDF and K-Means algorithm

4. TF-IDF and Hierarchical clustering algorithm

5. LDA and K-Means algorithm

6. LDA and Hierarchical clustering algorithm

V-Measure for Evaluating Clustering Performance against the true authors

Silhouette score for Evaluating Clustering Performance against the true authors

Comparison between Silhouette scores using different K numbers on LDA transformed data

K = 5 was figured out as the optimum number of clusters as the average score is the highest and all the clusters have the same size and clear clusters.

Hence, K-means algorithm on LDA transformed data where number of clusters is 5 is the best algorithm for the text clustering project.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Project MVP - Unsupervised Learning.pdf		Project MVP - Unsupervised Learning.pdf
Project Proposal - Himani Kaushik.pdf		Project Proposal - Himani Kaushik.pdf
Project Write-up - Himani Kaushik.pdf		Project Write-up - Himani Kaushik.pdf
README.md		README.md
Text Clustering Model.pdf		Text Clustering Model.pdf
Text Clustering Project.ipynb		Text Clustering Project.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

Himani0924/Unsupervised_text_clustering_model

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages