Skip to content

Use the Word2Vec proposed by Google to train models (vectors) to be used in any word2vec application.

Notifications You must be signed in to change notification settings

ALSAREM/word2vec-model-generation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Welcome to Word2Vec Model Generation

You can use the Word2Vec proposed by Google to train models (vectors) to be used in any word2vec application.

In this project, we build codes for each sapcific datasource, like Wikipedia dump, or umbc_corpus. We don't include the data but we list here from where you can got them.

Project structure

  • datasets: contains a folder for each dataset and the code to train word2vec on it.
  • word2vec-app: the c code from Google
  • wikiextractor-app: a Python code to transform Wikipedia XML files to text files.

Where to obtain the training data

The quality of the word vectors increases significantly with amount of the training data. For research purposes, you can consider using data sets that are available on-line:

  • enwik9: First billion characters from wikipedia, here.
  • Wikipedia: Latest Wikipedia dump. Should be more than 3 billion words, here.
  • 1-billion-word: Dataset from "One Billion Word Language Modeling Benchmark" Almost 1B words, already pre-processed text, here.
  • umbc: UMBC webbase corpus Around 3 billion words, more info here. Needs further processing (mainly tokenization)t, here.

Future textual corpus

Future objectives

  • merge multiple corpus then train word2vec

Support or Contact

Having trouble with this project? Please contact me.

About

Use the Word2Vec proposed by Google to train models (vectors) to be used in any word2vec application.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published