Reset kv cache after each query and infinite inference features #2560
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces two exciting enhancements to the voice assistant functionality in talk-llama.cpp, designed to elevate the user experience during conversations with the assistant:
Previously, the voice assistant was limited by the context length of the KV-cache. For example, with a context length of 16 tokens, once the model generated enough tokens to fill the available context space, it would exit and return a "failed to decode" message. This exit and message comes from the llama.cpp file from the function "llama_kv_cache_find_slot" on line 3592(at the current time of writing this post).
Let's say you have a prompt: 1 2 3 4 and your context length is 16. You could say the context looks like this initially:
|................|
Then the prompt gets fed to the model:
|1234............|
Then the model starts generating tokens:
|1234ABCD........|
until it reaches the end of available context space:
|1234ABCDEFGHIJKL|
When the cache reached its limit, the assistant would terminate the conversation and return a "failed to decode" message(from llama.cpp).
With the new -inf flag, the voice assistant can now dynamically manage the KV-cache, enabling seamless, infinite conversations. This means that even after reaching the context limit, the assistant will handle cache overflow and continue generating responses until . The way this is done by preserving the original prompt (k_prompt_llama) and then shifting half of whatever is remaining next to the original prompt. That would be something like this after it is full and the cache is adjusted:
|1234EFGH........|
For users who prefer resetting the context after each query, the -reset flag offers a practical solution. When enabled, the KV-cache clears automatically after every user question. This allows the model to process each query as an independent request, ideal for use cases where maintaining conversational history isn’t necessary.
Note:
While this feature improves memory management, it comes with a trade-off: the assistant won’t be able to refer back to previous questions or answers due to the cache reset, hence why it is ideal for use cases where maintaining conversational history isn’t necessary.