You can use the Word2Vec proposed by Google to train models (vectors) to be used in any word2vec application.
In this project, we build codes for each sapcific datasource, like Wikipedia dump, or umbc_corpus. We don't include the data but we list here from where you can got them.
- datasets: contains a folder for each dataset and the code to train word2vec on it.
- word2vec-app: the c code from Google
- wikiextractor-app: a Python code to transform Wikipedia XML files to text files.
The quality of the word vectors increases significantly with amount of the training data. For research purposes, you can consider using data sets that are available on-line:
- enwik9: First billion characters from wikipedia, here.
- Wikipedia: Latest Wikipedia dump. Should be more than 3 billion words, here.
- 1-billion-word: Dataset from "One Billion Word Language Modeling Benchmark" Almost 1B words, already pre-processed text, here.
- umbc: UMBC webbase corpus Around 3 billion words, more info here. Needs further processing (mainly tokenization)t, here.
- English corpus list
- European Language Newspaper Text
- The New York Times Annotated Corpus
- British National Corpus, XML edition
- merge multiple corpus then train word2vec
Having trouble with this project? Please contact me.