Welcome to Word2Vec Model Generation

You can use the Word2Vec proposed by Google to train models (vectors) to be used in any word2vec application.

In this project, we build codes for each sapcific datasource, like Wikipedia dump, or umbc_corpus. We don't include the data but we list here from where you can got them.

Project structure

datasets: contains a folder for each dataset and the code to train word2vec on it.
word2vec-app: the c code from Google
wikiextractor-app: a Python code to transform Wikipedia XML files to text files.

Where to obtain the training data

The quality of the word vectors increases significantly with amount of the training data. For research purposes, you can consider using data sets that are available on-line:

enwik9: First billion characters from wikipedia, here.
Wikipedia: Latest Wikipedia dump. Should be more than 3 billion words, here.
1-billion-word: Dataset from "One Billion Word Language Modeling Benchmark" Almost 1B words, already pre-processed text, here.
umbc: UMBC webbase corpus Around 3 billion words, more info here. Needs further processing (mainly tokenization)t, here.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
datasets		datasets
wikiextractor-app		wikiextractor-app
word2vec-app		word2vec-app
.gitignore		.gitignore
README.md		README.md
_config.yml		_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Welcome to Word2Vec Model Generation

Project structure

Where to obtain the training data

Future textual corpus

Future objectives

Support or Contact

About

Releases

Packages

Languages

ALSAREM/word2vec-model-generation

Folders and files

Latest commit

History

Repository files navigation

Welcome to Word2Vec Model Generation

Project structure

Where to obtain the training data

Future textual corpus

Future objectives

Support or Contact

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages