New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add chat summary memory buffer #13155
Add chat summary memory buffer #13155
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Limitations:
Future improvements:
|
this is great, arguably chat memory classes are lacking in llamaindex and implementing various chat memory approaches will add value and flexibility to users, also for testing their application. |
SUMMARIZE_PROMPT = "The following is a conversation between the user and assistant. Write a concise summary about the contents of this conversation." | ||
|
||
|
||
# TODO: Add option for last N user/assistant history interactions instead of token limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side note, slightly off topic but: does N messages make sense? What if a single user or assistant message is huge? Also another consideration, for function calls, openai (and others) require pairs of an assistant and tool message to be in memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a single user or assistant message is huge, then indeed, this is not such a great idea. On the other hand, the user has knowledge about the prompts and the expected length of the responses (perhaps something along the lines of "Answer in a single sentence: what is LlamaIndex?") and it could be a use case to consider only the last 3 or 5 "interactions", say.
The reasons I considered it is to ensure that USER/ASSISTANT messages are always considered in pairs, and to give some more flexibility to the end user. But it's more of a nice to have for the future if there is appetite for that flexibility.
llama-index-core/llama_index/core/memory/chat_summary_memory_buffer.py
Outdated
Show resolved
Hide resolved
llama-index-core/llama_index/core/memory/chat_summary_memory_buffer.py
Outdated
Show resolved
Hide resolved
def from_dict( | ||
cls, data: Dict[str, Any], **kwargs: Any | ||
) -> "ChatSummaryMemoryBuffer": | ||
raise NotImplementedError("This is not yet supported.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from_dict() / from_string() could be implemented by just passing in the LLM when calling it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I understood you correctly, but I added a basic implementation. If this is not what you are looking for, could you make the code change?
<= self.token_limit_full_text | ||
): | ||
# traverse the history in reverse order, when token limit is about to be | ||
# exceeded, we stop, so remaining messages are summarized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to add some special handling for TOOL and ASSISTANT messages -- see the get()
method for the ChatMemoryBuffer
- we can't start the chat history with an assistant message (this breaks several LLM APIs)
- we need to maintain the TOOL + ASSISTANT message pairs, so as not not break openai, mistral, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented the changes you suggested. I have a few remarks. It may not make sense to summarize tool/function messages, so it would probably be helpful to think about what should happen with such messages in general (completely removed, include as full text, summarize, make it configurable for the user?). Perhaps this is sufficient for now and we can expand on it later.
Also, when a summary is generated (LLM is not none and the messages do not fit in the buffer), the first message will always be a SYSTEM message, followed by potentially an ASSISTANT message. Does that also fail those API’s you mentioned? If so, we need this handling when a summary is made, which the current implementation does. However, if SYSTEM followed by ASSISTANT does not break the API’s, the special handling could be limited to the case when a summary is not generated (when LLM is None or the buffer is not full).
@logan-markewich I don't really understand why my tests are giving the error: self = MockSummarizerLLM(), name = '_i', value = 0
E ValueError: "MockSummarizerLLM" object has no field "_i" The classes MockChatLLM and MockStreamChatLLM seem to be using the exact same mechanism, and my tests are passing locally. Any ideas about what is going wrong (I'm not so versed in Pydantic)? |
@shadyelgewily-slimstock running |
@shadyelgewily-slimstock fixed, was just an issue with not using the compatibility layer on pydantic |
Thanks for your help @logan-markewich, let me know if there's anything I can do to get this merged. I haven't bumped the core version (maybe I should). |
Description
This PR introduces a new ChatSummaryMemoryBuffer to limit the chat history to a certain token length, and iteratively summarize all messages that do not fit in the memory buffer. This can be useful if you want to limit costs and latency (assuming the summarization prompt uses and generates fewer tokens than including the entire history). The original ChatMemoryBuffer gives you the option to truncate the history after a certain number of tokens, which is useful to limit costs and latency, but also removes potentially relevant information from the chat history. The new ChatSummaryMemoryBuffer aims to makes this a bit more flexible, so the user has more control over which chat_history is retained.
Fixes #12845
New Package?
Did I fill in the
tool.llamahub
section in thepyproject.toml
and provide a detailed README.md for my new integration or package?Version Bump?
Did I bump the version in the
pyproject.toml
file of the package I am updating? (Except for thellama-index-core
package)Type of Change
Please delete options that are not relevant.
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration
Suggested Checklist:
make format; make lint
to appease the lint gods