-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Word2Vec Wikipedia Corpus 2017 no vocabulary #2737
Comments
Thanks for posting the log. That Can you try: from itertools import islice
# Print the trained model's vocabulary.
print(list(model_with_wikipedia.wv.keys()))
# Print the first two documents. For word2vec, each document should be a list of
# words (strings).
print(list(islice(corpus, 2))) |
I'm not sure that What exactly does the string ' I'm not sure it's documented anywhere, and it's inherently hard to analyze given the sketchy behavior of the downloading API – it potentially downloads not-in-the-main-project source code & executes it, so the return value could be literally anything. (For example, that |
@gojomo Your aversion to The documentation is clear though: https://github.com/RaRe-Technologies/gensim-data/releases/tag/wiki-english-20171001 Once @JackStillwell tries my two commands above, the issue will become apparent :) Namely, that the iteration yields dicts, and the model "vocabulary" are the three dict keys like @JackStillwell word2vec needs each document as a list of words, not a dict. So you want an extra step that takes the titles and texts from each Wikipedia article, and presents them as a list of strings to word2vec. Check out Generators, iterators, iterables if unsure how streaming works in Python. I guess it even makes sense to send each article section as a separate document to word2vec (so each Wikipedia article becomes several documents). But try it and see what works better for you. |
That page is helpful about the data-contents of 'wiki-english-20171001', but I haven't noticed that page linked from docs/examples that encourage use of And, the page (like gensim docs of That What's this source code's authors or history? (Impossible to tell for such 'assets', using a Github feature that's not designed for source code.) If it were buggy, how would someone contribute a fix? (I see no way to open a reviewable PR against it.) If it were maliciously changed, who would even notice? (AFAICT, changes to the 'assets' area of an existing Github project release can happen anytime, and generate no public logs/notifications. Perhaps it's visible to maintainers?) |
Problem description
When following the example code here I receive a "Word not in vocabulary" error. Opening at the request of Radim: https://groups.google.com/forum/#!topic/gensim/ULW_OKrPtqE
Steps/code/corpus to reproduce
Logging output (truncated to non-repeat / progress):
Versions
Linux-4.15.0-74-generic-x86_64-with-debian-buster-sid
Python 3.7.4 (default, Sep 5 2019, 19:15:53)
[GCC 7.4.0]
NumPy 1.18.1
SciPy 1.4.1
gensim 3.8.1
FAST_VERSION 1
The text was updated successfully, but these errors were encountered: