Google News and Leo Tolstoy: Visualizing Word2Vec Word Embeddings using t-SNE

This repository contains the source code for visualizing high-dimensional Word2Vec word embeddings using t-SNE. The visualization can be useful to understand how Word2Vec works and how to interpret relations between vectors captured from your texts before using them in neural networks or other machine learning algorithms. As a training data, we will use articles from Google News and classical literary works by Leo Tolstoy, the Russian writer who is regarded as one of the greatest authors of all time.

Data

The pre-trained model trained on part of Google News dataset (about 100 billion words) is available at https://code.google.com/archive/p/word2vec/ (and also described in [1]). The model contains 300-dimensional vectors for 3 million words and phrases.

Tolstoy's novels in Russian are available at https://www.litres.ru/lev-tolstoy.

References

L. Maate and G. Hinton, "Visualizing data using t-SNE", Journal of Machine Learning Research, vol. 9, pp. 2579-2605, 2008.
T. Mikolov, I. Sutskever, K. Chen, G. Corrado and J. Dean, "Distributed Representations of Words and Phrases and their Compositionality", Advances in Neural Information Processing Systems, pp. 3111-3119, 2013.
R. Rehurek and P. Sojka, "Software Framework for Topic Modelling with Large Corpora", Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 2010.

Documentation and How to report bugs

Gensim documentatiob: https://radimrehurek.com/gensim/.
Scikit-learn documentation: http://scikit-learn.org/stable/documentation.html.
If you find any issues, please open a bug here on GitHub.

License

See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
static		static
README.md		README.md
Visualizing Word2Vec Word Embeddings using t-SNE.ipynb		Visualizing Word2Vec Word Embeddings using t-SNE.ipynb
gitignore.txt		gitignore.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Google News and Leo Tolstoy: Visualizing Word2Vec Word Embeddings using t-SNE

Data

References

Documentation and How to report bugs

License

About

Releases

Packages

Languages

sismetanin/word2vec-tsne

Folders and files

Latest commit

History

Repository files navigation

Google News and Leo Tolstoy: Visualizing Word2Vec Word Embeddings using t-SNE

Data

References

Documentation and How to report bugs

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages