Files | Description |
---|---|
process_wiki.py | Process the xml format wikipedia to text format |
train_word2vec_model.py | Train the pt-br wikipedia word2vec model |
WikipediaWord2Vec.ipynb | Sample notebook |
- Build Docker image
docker-compose build
- Download Wikipedia pt-br dump
curl https://dumps.wikimedia.org/ptwiki/latest/ptwiki-latest-pages-articles.xml.bz2 --create-dirs -o data/ptwiki-latest-pages-articles.xml.bz2
- Process Wikipedia dump
docker-compose run jupyter python src/process_wiki.py data/ptwiki-latest-pages-articles.xml.bz2 data/wiki.pt-br.text
- Train Model
docker-compose run jupyter python src/train_word2vec_model.py data/wiki.pt-br.text data/wiki.pt-br.word2vec.model
- Run notebook
docker-compose up -d
Access notebook: localhost:8888
http://textminingonline.com/training-word2vec-model-on-english-wikipedia-by-gensim