-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: Chat with doc (RAG) is 100x slower than chat with same base model #1418
Comments
How many words are in the documents? You would be surprised that even a bit more context in the window dramatically impacts time to first token. This would include the snippets injected into the prompt. Also, how are you running AnythingLLM? If this is on CPU and you are not on an M1,M2,M3 Mac then this is not unexpected |
I'm using the LM Studio backend, and in the AnythingLLM UI agents don't seem to stream. Their response show up as a full text blob. But in the LM Studio logs I can see that text generation has begun. This makes it appear slower than it actually is... |
@frost19k we dont stream agent responses (because of tool calling) but we will be resolving that soon. The "latency" from this very well may just be the model generating the full response. We do stream the regular responses obviously |
How are you running AnythingLLM?
AnythingLLM Docker (local)
LM Studio (local)
Model: QWen chat 1.5 7B q8 gguf
GPU: NVIDIA RTX4000 SFF Ada 20GB
What happened?
When I am in a workspace without documents, it start streaming after one second.
When I am in a workspace with documents, it takes about 80 seconds to start streaming.
My document is not very complex or large.
My document has two files, one doc and one txt
Are there known steps to reproduce?
No response
The text was updated successfully, but these errors were encountered: