Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lack of Knowledge Database Support in Vision Mode for Bedrock Claude 3 #436

Open
nhat-tranvan opened this issue Mar 29, 2024 · 2 comments
Open
Labels
enhancement New feature or request

Comments

@nhat-tranvan
Copy link

As you're aware, Bedrock Claude 3 is designed to support multi-modal capabilities, including Vision mode. However, during testing of the latest version, it appears that the system does not currently support accessing the knowledge database when operating in Vision mode (see attached image).

Many use cases involve customers uploading images and seeking solutions, with the expectation that the system can retrieve relevant documents from the internal knowledge base and provide appropriate responses based on the visual input and accompanying query.

Suggested Next Steps:

  1. Explore Multi-Modal Document Embeddings: https://blog.langchain.dev/semi-structured-multi-modal-rag/
  2. Implement a mechanism to generate semantic queries or searches based on the uploaded image and the user's question.
  3. Develop a multi-modal response generation capability to provide solutions that integrate information from both the knowledge base and the visual input.

By addressing these points, we can enhance the functionality of Bedrock Claude 3 in Vision mode, enabling it to leverage the knowledge database effectively when processing visual inputs and queries from customers.

@bigadsoleiman bigadsoleiman added the enhancement New feature or request label Mar 29, 2024
@ystoneman
Copy link

Thank you for spotlighting the absence of knowledge database support in Bedrock Claude 3's Vision mode. I think many will want this!

The LangChain blog post suggests three approaches for implementing multi-modal RAG:

  1. Using multi-modal embeddings (e.g., Amazon Titan Multimodal Embeddings model) to generate multimodal (image & text) embeddings and search those embeddings
  2. Using a multi-modal LLM (Claude 3 Vision) to generate image text summaries, then searching those summaries with text/table content
  3. Combining (2) with raw image retrieval, allowing a multi-modal LLM to incorporate images directly in responses

Which approach best fits your Claude's Vision mode use case? Did you want option 1, 2, 3, or another variant?

@nhat-tranvan
Copy link
Author

I'm handle the case for IT support system when users uploading image and asking how to fix the issue. The solution 3 will be the best choice at this time due to complexity and cost of multi-modal embedding.

But, for retrieval, I will generate semantic query using LLM from histories + user's question + image.

Thanks for advice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Todo
Development

No branches or pull requests

3 participants