You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a large corpus (30M docs) and a pretrained inference-only tomotopy model. I want to find the argmax topic for each doc in the corpus and have found through benchmarking (see script here) that list-based inference is faster (by a factor of ~2) than corpus-based. What I find is that using defaults (on a 40-core machine), inference is expected to take 125 days. This seems extremely slow considering training the model took 3h on a 10M document corpus.
My inference script is as follows:
importnumpyasnpimporttomotopyastpfrommathimportceilfromfunctoolsimportpartialfromtqdmimporttqdmdefget_highest_lda(model, topic_words, docs):
corpus= [model.make_doc(doc.split()) fordocindocs]
topic_dist, _=model.infer(corpus)
k=np.argmax(topic_dist, axis=1)
return [topic_words[k_] fork_ink]
defchunk(l, n):
foriinrange(0, len(l), n):
yieldl[i:i+n]
if__name__=="__main__":
docs= [line.strip() forlineinopen('corpus.txt')]
lda=tp.LDAModel.load('model.bin')
# get top 100 words from each topicN=100topic_words= [" ".join(wordforword, _inlda.get_topic_words(i, top_n=N)) foriinrange(lda.k)]
# batch size for batched inferencechunk_size=512map_fn=partial(get_highest_lda, lda, topic_words)
results=tqdm(map(map_fn, chunk(docs, chunk_size)), total=ceil(len(docs) /chunk_size))
forchunkinresults:
fordocinchunk:
print(doc)
The text was updated successfully, but these errors were encountered:
Looking at a flamegraph of inference 1, it seems like a large portion of the inference time is spent here. I'm trying to track this down now, but it seems like there's a lot of waiting.
I have a large corpus (30M docs) and a pretrained inference-only tomotopy model. I want to find the argmax topic for each doc in the corpus and have found through benchmarking (see script here) that list-based inference is faster (by a factor of ~2) than corpus-based. What I find is that using defaults (on a 40-core machine), inference is expected to take 125 days. This seems extremely slow considering training the model took 3h on a 10M document corpus.
My inference script is as follows:
The text was updated successfully, but these errors were encountered: