Skip to content

Latest commit

 

History

History
39 lines (30 loc) · 1 KB

README.md

File metadata and controls

39 lines (30 loc) · 1 KB

Files

Files Description
process_wiki.py Process the xml format wikipedia to text format
train_word2vec_model.py Train the pt-br wikipedia word2vec model
WikipediaWord2Vec.ipynb Sample notebook

Instructions

  1. Build Docker image
docker-compose build
  1. Download Wikipedia pt-br dump
curl https://dumps.wikimedia.org/ptwiki/latest/ptwiki-latest-pages-articles.xml.bz2 --create-dirs -o data/ptwiki-latest-pages-articles.xml.bz2
  1. Process Wikipedia dump
docker-compose run jupyter python src/process_wiki.py data/ptwiki-latest-pages-articles.xml.bz2 data/wiki.pt-br.text
  1. Train Model
docker-compose run jupyter python src/train_word2vec_model.py data/wiki.pt-br.text data/wiki.pt-br.word2vec.model
  1. Run notebook
docker-compose up -d

Access notebook: localhost:8888

Reference:

http://textminingonline.com/training-word2vec-model-on-english-wikipedia-by-gensim