feat: Add token usage to Bedrock Claude + Migrated chain for this model #564

charles-marion · 2024-09-04T15:16:21Z

Issue #, if available:
#502 #495 #230

Description of changes:

To add usage tracking to bedrock models, I migrated the langchain chain for the claude model (ConversationChain is deprecated)

Instead it uses RunnableWithMessageHistory with ChatBedrockConverse that relied on the Bedrock Converse API that is consistent across models and provide the usage in the response.

Changes

Migrate chain for claude Bedrock
Add Usage to metadata
Add Usage JSON log
Add Cloudwatch filter parsing the log and generating a metrics (With a CLI config due to added cosst)
Add Metric to dashboard.
Added npm run vet-all to quickly verify formatting and tests

Testing

Verified with and without RAG and Streaming
Ran integ tests

Future Change

Migrate all models since ConversionChain is deprecated
Add usage tracking to Sagemaker endpoint (if possible)

Note: This change is modifying the prompts to match the new Langchain patterns. For example:
Before

The following is a friendly conversation between a human and an AI. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:
Human: test
AI: I'm afraid I don't have enough context to answer that question. Could you please provide more details?
Human: test
AI: I apologize,...

After

System: The following is a friendly conversation between a human and an AI.If the AI does not know the answer to a question, it truthfully says it does not know.
Human: test
AI: I'm afraid I don't have enough context to answer your question. Could you please provide more details?
Human: test
AI: I don't have enough information to answer your question. The context provided mentions an Integ Test flower that is yellow, but does not include a direct question.

Image with metdata usage

Dashboard

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

gbone-restore · 2024-09-09T16:34:42Z

With the older ConversationRetrievalChain, I limited how much history I would pass into a model. In my organization, we are seeing chat history messages grow across a variety of topics and it can cause inaccurate rephrasing of questions.

I subclassed ConversationBufferMemory to give a rolling window of conversation history that is a smaller subset of the entire history.

eg:

from langchain.memory import ConversationBufferMemory
from typing import Dict, List, Any
from pydantic import Field

class WindowedConversationBufferMemory(ConversationBufferMemory):
    k: int = Field(default=2, description="Number of recent conversations to keep")

    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def save_context(self, inputs: Dict[str, Any], outputs: Dict[str, str]) -> None:
        # Save the full context to the underlying storage (DynamoDB)
        super().save_context(inputs, outputs)

    def load_memory_variables(self, inputs: Dict[str, Any]) -> Dict[str, Any]:
        # Load the full history from the underlying storage
        result = super().load_memory_variables(inputs)

        # If there's no history, return an empty list or dict
        if self.memory_key not in result or not result[self.memory_key]:
            return {self.memory_key: [] if self.return_messages else ""}

        # Windowing: Only return the last k conversations
        if self.return_messages:
            result[self.memory_key] = result[self.memory_key][-2*self.k:]
        else:
            conversations = result[self.memory_key].split('\n\nHuman: ')
            recent_conversations = conversations[-min(self.k, len(conversations)):]
            result[self.memory_key] = '\n\nHuman: '.join(recent_conversations).strip()

        return result

I want to do something similar with RunnableWithMessageHistory but I'm still getting up to speed on this new API. Do you think that limiting the message history to a smaller slice of data is an important feature?

charles-marion · 2024-09-09T17:19:39Z

With the older ConversationRetrievalChain, I limited how much history I would pass into a model. In my organization, we are seeing chat history messages grow across a variety of topics and it can cause inaccurate rephrasing of questions.

I subclassed ConversationBufferMemory to give a rolling window of conversation history that is a smaller subset of the entire history.

eg:

from langchain.memory import ConversationBufferMemory
from typing import Dict, List, Any
from pydantic import Field

class WindowedConversationBufferMemory(ConversationBufferMemory):
    k: int = Field(default=2, description="Number of recent conversations to keep")

    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def save_context(self, inputs: Dict[str, Any], outputs: Dict[str, str]) -> None:
        # Save the full context to the underlying storage (DynamoDB)
        super().save_context(inputs, outputs)

    def load_memory_variables(self, inputs: Dict[str, Any]) -> Dict[str, Any]:
        # Load the full history from the underlying storage
        result = super().load_memory_variables(inputs)

        # If there's no history, return an empty list or dict
        if self.memory_key not in result or not result[self.memory_key]:
            return {self.memory_key: [] if self.return_messages else ""}

        # Windowing: Only return the last k conversations
        if self.return_messages:
            result[self.memory_key] = result[self.memory_key][-2*self.k:]
        else:
            conversations = result[self.memory_key].split('\n\nHuman: ')
            recent_conversations = conversations[-min(self.k, len(conversations)):]
            result[self.memory_key] = '\n\nHuman: '.join(recent_conversations).strip()

        return result

I want to do something similar with RunnableWithMessageHistory but I'm still getting up to speed on this new API. Do you think that limiting the message history to a smaller slice of data is an important feature?

The memory used by RunnableWithMessageHistory in this change
is this class

aws-genai-llm-chatbot/lib/shared/layers/python-sdk/python/genai_core/langchain/chat_message_history.py

Line 48 in 9de3e55

items = response["Item"]["History"]

To implement it I would just add a max messages returned parameter:

aws-genai-llm-chatbot/lib/model-interfaces/langchain/functions/request-handler/adapters/base/base.py

Line 111 in 9de3e55

return DynamoDBChatMessageHistory(

(because you still want to store/return the full history to view the session)

This would make this independent of the chain.

Do you think that limiting the message history to a smaller slice of data is an important feature?
I do agree since it would reduce the number of token used but it would need to be configurable somewhere.

…el (aws-samples#564)

feat: Add usage monitoring to Bedrock Claude

1ff4251

charles-marion requested a review from bigadsoleiman September 4, 2024 15:16

test: Format + add test

f7b01b4

charles-marion force-pushed the usage branch from c7c605e to f7b01b4 Compare September 4, 2024 15:20

Format

9de3e55

bigadsoleiman approved these changes Sep 6, 2024

View reviewed changes

charles-marion merged commit 6242c59 into aws-samples:main Sep 16, 2024
1 check passed

charles-marion deleted the usage branch September 16, 2024 14:22

charles-marion mentioned this pull request Sep 17, 2024

feat: Enable usage token and use Converse API with all Bedrock models #569

Merged

lloydclowes pushed a commit to lloydclowes/gen-ai-playground that referenced this pull request Oct 5, 2024

feat: Add token usage to Bedrock Claude + Migrated chain for this mod…

274d6db

…el (aws-samples#564)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add token usage to Bedrock Claude + Migrated chain for this model #564

feat: Add token usage to Bedrock Claude + Migrated chain for this model #564

charles-marion commented Sep 4, 2024

gbone-restore commented Sep 9, 2024

charles-marion commented Sep 9, 2024