Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference is extremely slow #182

Open
erip opened this issue Aug 1, 2022 · 4 comments
Open

Inference is extremely slow #182

erip opened this issue Aug 1, 2022 · 4 comments
Labels
bug Something isn't working

Comments

@erip
Copy link

erip commented Aug 1, 2022

I have a large corpus (30M docs) and a pretrained inference-only tomotopy model. I want to find the argmax topic for each doc in the corpus and have found through benchmarking (see script here) that list-based inference is faster (by a factor of ~2) than corpus-based. What I find is that using defaults (on a 40-core machine), inference is expected to take 125 days. This seems extremely slow considering training the model took 3h on a 10M document corpus.

My inference script is as follows:

import numpy as np
import tomotopy as tp

from math import ceil
from functools import partial

from tqdm import tqdm

def get_highest_lda(model, topic_words, docs):
    corpus = [model.make_doc(doc.split()) for doc in docs]
    topic_dist, _ = model.infer(corpus)
    k = np.argmax(topic_dist, axis=1)
    return [topic_words[k_] for k_ in k]

def chunk(l, n):
    for i in range(0, len(l), n):
        yield l[i:i+n]

if __name__ == "__main__":
    docs = [line.strip() for line in open('corpus.txt')]
    lda = tp.LDAModel.load('model.bin')
    # get top 100 words from each topic
    N = 100
    topic_words = [" ".join(word for word, _ in lda.get_topic_words(i, top_n=N)) for i in range(lda.k)]
    # batch size for batched inference
    chunk_size = 512
    map_fn = partial(get_highest_lda, lda, topic_words)
    results = tqdm(map(map_fn, chunk(docs, chunk_size)), total=ceil(len(docs) / chunk_size))
    for chunk in results:
        for doc in chunk:
            print(doc)
@bab2min bab2min added the bug Something isn't working label Aug 7, 2022
@wangyi888
Copy link

Hi
do you have some solution to resolve this bug?

@xiaohuzi1996
Copy link

遇到的问题一样,占用内存100G,开了40核,对于长度5000字以内的文本进行推理,2条/s

@narayanacharya6
Copy link

Any luck on this?

@erip
Copy link
Author

erip commented Sep 12, 2023

Looking at a flamegraph of inference 1, it seems like a large portion of the inference time is spent here. I'm trying to track this down now, but it seems like there's a lot of waiting.

Footnotes

  1. image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants