-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Recipes & FAQ
This page collects code snippets and recipes for common Gensim-related questions. Technical, no open-ended questions or discussions here.
- LSI, LsiModel, Latent Semantic Indexing:
- General I/O:
- Word2vec, Doc2vec, Fasttext
- Q7: I have many text files under a directory, each file is a single document. How do I create a word2vec model from that?
- Q10: Loading a word2vec model fails with
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position …
- Q11: I've trained my Word2Vec/Doc2Vec/etc model repeatedly using the exact same text corpus, but the vectors are different each time. Is there a bug or have I made a mistake?
- Q12: I've used Doc2Vec infer_vector() on a single text, but the resulting vector is different each time. Is there a bug or have I made a mistake?
- Q13: How do I export a trained word2vec model to Keras?
Answer: total_sum = sum(dict(doc).get(123, 0) for doc in corpus)
Answer: (note that "vector length" only makes sense for non-zero vectors):
- If the input vector
vec
is in gensim sparse format (a list of 2-tuples) :length = math.sqrt(sum(val**2 for _, val in vec))
, or uselength = gensim.matutils.veclen(vec)
. - If the input vector is a numpy array:
length = gensim.matutils.blas_nrm2(vec)
- If the input vector is in a
scipy.sparse
format:length = numpy.sqrt(numpy.sum(vec.tocsr().data**2))
Also note that if you want the length just to normalize a vector to unit length, you might as well call gensim.matutils.unitvec(vec)
, which accepts any of these three formats as input.
Answer: Given a model lsi = LsiModel(X, ...)
, with the truncated singular value decomposition of your corpus X
being X=U*S*V^T
, doing lsi[X]
computes U^-1*X
, which equals V*S
(basic linear algebra). So if you want V
, divide lsi[X]
by S
:
V = gensim.matutils.corpus2dense(lsi[X], len(lsi.projection.s)).T / lsi.projection.s
, to get V
as a 2d numpy array.
A: After creating the LSI model lsi = models.LsiModel(corpus, ...)
, the U and S matrices are in lsi.projection.u
and lsi.projection.s
. The V (or V^T) matrix is not stored explicitly, because it may not fit in memory (its shape is num_docs * num_topics
). If you need V, you can compute it with an extra pass over corpus
, using gensim's streaming lsi[corpus]
API (see Q3 above).
Answer: The final model is stored as a matrix of num_terms x num_topics
numbers. With 8 bytes per number (double precision), that's 8 * num_terms * num_topics
, i.e. for 100k terms in dictionary and 500 topics, the model will be 8*100,000*500 = 400MB
.
That's just the output -- during the actual computation of this model, temporary copies are needed, so in practice, you'll need about 3x that amount. For the 100k dictionary and 500 topics example, you'll actually need ~1.2GB to create the LSI model.
When out of memory, you'll have to either reduce the dictionary size or the number of topics (or add RAM!). The memory footprint is not affected by the number of training documents, though.
Q6: I have many text files under a directory, each file is a single document. How do I create a corpus from that?
Answer: See http://radimrehurek.com/gensim/tut1.html#corpus-streaming-one-document-at-a-time . If you're having trouble going through the files, have a look at the following snippet (it accepts all .txt
files, even in nested subdirectories):
def iter_documents(top_directory):
"""Iterate over all documents, yielding a document (=list of utf8 tokens) at a time."""
for root, dirs, files in os.walk(top_directory):
for file in filter(lambda file: file.endswith('.txt'), files):
document = open(os.path.join(root, file)).read() # read the entire document, as one big string
yield gensim.utils.tokenize(document, lower=True) # or whatever tokenization suits you
class MyCorpus(object):
def __init__(self, top_dir):
self.top_dir = top_dir
self.dictionary = gensim.corpora.Dictionary(iter_documents(top_dir))
self.dictionary.filter_extremes(no_below=1, keep_n=30000) # check API docs for pruning params
def __iter__(self):
for tokens in iter_documents(self.top_dir):
yield self.dictionary.doc2bow(tokens)
corpus = MyCorpus('/tmp/test') # create a dictionary
for vector in corpus: # convert each document to a bag-of-word vector
print vector
...
Q7: I have many text files under a directory, each file is a single document. How do I create a word2vec model from that?
Answer: (by Christian Ledermann)
This code makes the simplifying assumption that sentence-ending punctuation should be excluded from the text and that .
and :
always end a sentence. text-sentence text tokenizer and sentence splitter may be a better alternative
class DirOfPlainTextCorpus(object):
"""Iterate over sentences of all plaintext files in a directory """
SPLIT_SENTENCES = re.compile(u"[.!?:]\s+") # split sentences on these characters
def __init__(self, dirname):
self.dirname = dirname
def __iter__(self):
for fn in os.listdir(self.dirname):
text = open(os.path.join(self.dirname, fn)).read()
for sentence in self.SPLIT_SENTENCES.split(text):
yield gensim.utils.simple_preprocess(sentence, deacc=True)
model = gensim.models.Word2Vec(DirOfPlainTextCorpus('/path/to/dir'), vector_size=200, min_count=5, workers=2)
Answer: (by Yaser Martinez)
The function dictionary.filter_extremes
changes the original IDs so we need to reread and (optionally) rewrite the old corpus using a transformation:
import copy
from gensim.models import VocabTransform
# filter the dictionary
old_dict = corpora.Dictionary.load('old.dict')
new_dict = copy.deepcopy(old_dict)
new_dict.filter_extremes(keep_n=100000)
new_dict.save('filtered.dict')
# now transform the corpus
corpus = corpora.MmCorpus('corpus.mm')
old2new = {old_dict.token2id[token]:new_id for new_id, token in new_dict.iteritems()}
vt = VocabTransform(old2new)
corpora.MmCorpus.serialize('filtered_corpus.mm', vt[corpus], id2word=new_dict)
Answer: This was resolved in the 0.13.4 release. If you're using Gensim 0.13.4 or later, such loading should work out of the box. If you are using an earlier version, read the solution below.
(by Matti Lyra)
Python pickling is not backward compatible. There are two things standing in your way:
- Python dictionary pickling
LdaModel.id2word
- Small NumPy arrays
The LdaModel.save
already does save large NumPy arrays (> 10MB) to separate files using
NumPy's own IO functionality. Smaller arrays however - resulting either from small training
corpora or by using alpha=auto
- are pickled along with the rest of the LdaModel
object.
Trying to load a model in Python 3 will result in an error like the following
UnicodeDecodeError: 'ascii' codec can't decode byte 0xba in position 6: ordinal not in range(128)
To get around this you need to remove the id2word
dictionary from the LdaModel before
saving, and ensure that all the NumPy arrays regardless of size are saved to separate
files.
# Python 2
import json
from gensim.models.ldamodel import LdaModel
id2word = {k:v for k, v in lda.id2word.items()}
lda.id2word = None
# save the expElogbeta and state.sstats separately using numpy not pickle,
# if you're using alpha=auto or you've set alpha or eta to some array yourself you
# should add 'alpha', and 'eta' to the 'separately' list
lda.save('~/Desktop/temp/migrate.2to3.gensim', separately=['expElogbeta', 'sstats'])
lda.id2word = id2word # restore the dictionary
with open('~/Desktop/temp/migrate.2to3.id2word.json', 'wb') as out:
json.dump(id2word, out)
You can then load the model normally in Python 3, but have to remember to also load the dictionary:
# Python 3
import json
from gensim.models.ldamodel import LdaModel
with open('~/Desktop/temp/migrate.2to3.id2word.json') as fh:
id2word = json.load(fh)
id2word = {int(k):v for k, v in id2word.items()}
# load the model and replace the separately stored id2word dictionary
lda = LdaModel.load('~/Desktop/temp/migrate.2to3.gensim')
lda.id2word = id2word
Q10: Loading a word2vec model fails with UnicodeDecodeError: 'utf-8' codec can't decode bytes in position ...
Answer: The strings (words) stored in your model are not valid utf8. By default, gensim decodes the words using the strict
encoding settings, which results in the above exception whenever an invalid utf8 sequence is encountered.
The fix is on your side and it is to either:
a) Store your model using a program that understands unicode and utf8 (such as gensim). Some C and Java word2vec tools are known to truncate the strings at byte boundaries, which can result in cutting a multi-byte utf8 character in half, making it non-valid utf8, leading to this error.
b) Set the unicode_errors
flag when running load_word2vec_model
, e.g. load_word2vec_model(..., unicode_errors='ignore')
. Note that this silences the error, but the utf8 problem is still there -- invalid utf8 characters will just be ignored in this case.
Q11: I've trained my Word2Vec
/ Doc2Vec
/ etc model repeatedly using the exact same text corpus, but the vectors are different each time. Is there a bug or have I made a mistake? (*2vec training non-determinism)
Answer: The *2vec models (word2vec, fasttext, doc2vec…) begin with random initialization, then most modes use additional randomization during training. (For example, the training windows are randomly truncated as an efficient way of weighting nearer words higher. The negative examples in the default negative-sampling mode are chosen randomly. And the downsampling of highly-frequent words, as controlled by the sample
parameter, is driven by random choices. These behaviors were all defined in the original Word2Vec paper's algorithm description.)
Even when all this randomness comes from a pseudorandom-number-generator that's been seeded to give a reproducible stream of random numbers (which gensim does by default), the usual case of multi-threaded training can further change the exact training-order of text examples, and thus the final model state. (Further, in Python 3.x, the hashing of strings is randomized each re-launch of the Python interpreter - changing the iteration ordering of vocabulary dicts from run to run, and thus making even the same string-of-random-number-draws pick different words in different launches.)
So, it is to be expected that models vary from run to run, even trained on the same data. There's no single "right place" for any word-vector or doc-vector to wind up: just positions that are at progressively more-useful distances & directions from other vectors co-trained inside the same model. (In general, only vectors that were trained together in an interleaved session of contrasting uses become comparable in their coordinates.)
Suitable training parameters should yield models that are roughly as useful, from run-to-run, as each other. Testing and evaluation processes should be tolerant of any shifts in vector positions, and of small "jitter" in the overall utility of models, that arises from the inherent algorithm randomness. (If the observed quality from run-to-run varies a lot, there may be other problems: too little data, poorly-tuned parameters, or errors/weaknesses in the evaluation method.)
You can try to force determinism, by using workers=1
to limit training to a single thread – and, if in Python 3.x, using the PYTHONHASHSEED
environment variable to disable its usual string hash randomization. But training will be much slower than with more threads. And, you'd be obscuring the inherent randomness/approximateness of the underlying algorithms, in a way that might make results more fragile and dependent on the luck of a particular setup. It's better to tolerate a little jitter, and use excessive jitter as an indicator of problems elsewhere in the data or model setup – rather than impose a superficial determinism.
Q12: I've used Doc2Vec
infer_vector()
on a single text, but the resulting vector is different each time. Is there a bug or have I made a mistake? (doc2vec inference non-determinism)
Answer: Inference is just a constrained form of training, and thus the answer above also applies here: it is normal for subsequent runs on the same text to give different results. However, the results from subsequent runs should be of similar quality – and in the specific constrained case of inferring a single text's vector against an otherwise frozen Doc2Vec
model, that means those resulting vectors should generally be fairly close to one another.
You may be able to achieve 'tighter' results from infer_vector()
by supplying a larger-than-default epochs
parameter. If this does not help, there may be other problems. Your model may be underpowered/overfit, so a text's vector isn't forced to a single place, but could fit equally well many places. The text might be too short, especially after out-of-vocabulary words are ignored. You might be supplying a string, rather than a list-of-tokens, and a plain string will just be seen as a list-of-single-letter tokens (of which very few will be known words).
(There are potential ways to force determinism, discussed in issue #447, but as above, this is not recommended. When such variance from run-to-run is small, that's a good sign and downstream tests should tolerate it. If such variance is large, other issues with model or data quality should be fixed, rather than forcing a superficial stability.)
Answer:
See the wiki page at https://github.com/RaRe-Technologies/gensim/wiki/Using-Gensim-Embeddings-with-Keras-and-Tensorflow
Not seeing your question here? Try asking on the Gensim mailing list. If enough people share the same problem, not only will the Gensim developers add it to this FAQ, but they may even fix the code/documentation ;)