Find similar documents using LSI and cosine similarity matrix

This program uses Gensim to find documents similar to the one provided as a query. It can be used to build a recommendation system, automation of knowledgebase systems etc.

Latent Semantic Indexing (LSI) builds on the assumption that words that are used in the same contexts tend to have similar meanings. It can extract the conceptual content of a body of text by establishing associations between terms that occur in similar contexts.

In this experiment, we take a subset of Wikipedia pages (music artist pages) save them into a mysql database and build a similarity matrix based on a LSI transformation of the documents. We can then use this index to provide "similar" artists, actually, artists that have a similar Wikipedia page.

Run

The preparing tasks can take many hours, depending on your computer capabilities.

download a Wikipedia archive, enwiki-latest-pages-articles.xml.bz2 ~13GB
create your database, see db_connect.py
run db-import-music-pages.py
run make_wikicorpus.py to prepare the corpus
run lsi_similarities to get similarities for a document

At the first run, lsi_similarities.py creates the index and saves it to disk.
Next runs will load the saved index to provide a quick answer.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
db-import-music-pages.py		db-import-music-pages.py
db_connect.py		db_connect.py
lsi_similarities.py		lsi_similarities.py
make_wikicorpus.py		make_wikicorpus.py
wiki_corpus.py		wiki_corpus.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Find similar documents using LSI and cosine similarity matrix

Run

About

Releases

Packages

Languages

License

dvictor/lsi-document-similarity

Folders and files

Latest commit

History

Repository files navigation

Find similar documents using LSI and cosine similarity matrix

Run

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages