Lack of Knowledge Database Support in Vision Mode for Bedrock Claude 3 #436

nhat-tranvan · 2024-03-29T04:37:06Z

As you're aware, Bedrock Claude 3 is designed to support multi-modal capabilities, including Vision mode. However, during testing of the latest version, it appears that the system does not currently support accessing the knowledge database when operating in Vision mode (see attached image).

Many use cases involve customers uploading images and seeking solutions, with the expectation that the system can retrieve relevant documents from the internal knowledge base and provide appropriate responses based on the visual input and accompanying query.

Suggested Next Steps:

Explore Multi-Modal Document Embeddings: https://blog.langchain.dev/semi-structured-multi-modal-rag/
Implement a mechanism to generate semantic queries or searches based on the uploaded image and the user's question.
Develop a multi-modal response generation capability to provide solutions that integrate information from both the knowledge base and the visual input.

By addressing these points, we can enhance the functionality of Bedrock Claude 3 in Vision mode, enabling it to leverage the knowledge database effectively when processing visual inputs and queries from customers.

ystoneman · 2024-04-14T19:03:32Z

Thank you for spotlighting the absence of knowledge database support in Bedrock Claude 3's Vision mode. I think many will want this!

The LangChain blog post suggests three approaches for implementing multi-modal RAG:

Using multi-modal embeddings (e.g., Amazon Titan Multimodal Embeddings model) to generate multimodal (image & text) embeddings and search those embeddings
Using a multi-modal LLM (Claude 3 Vision) to generate image text summaries, then searching those summaries with text/table content
Combining (2) with raw image retrieval, allowing a multi-modal LLM to incorporate images directly in responses

Which approach best fits your Claude's Vision mode use case? Did you want option 1, 2, 3, or another variant?

nhat-tranvan · 2024-04-23T08:20:25Z

I'm handle the case for IT support system when users uploading image and asking how to fix the issue. The solution 3 will be the best choice at this time due to complexity and cost of multi-modal embedding.

But, for retrieval, I will generate semantic query using LLM from histories + user's question + image.

Thanks for advice.

bigadsoleiman added the enhancement New feature or request label Mar 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lack of Knowledge Database Support in Vision Mode for Bedrock Claude 3 #436

Lack of Knowledge Database Support in Vision Mode for Bedrock Claude 3 #436

nhat-tranvan commented Mar 29, 2024

ystoneman commented Apr 14, 2024

nhat-tranvan commented Apr 23, 2024

Lack of Knowledge Database Support in Vision Mode for Bedrock Claude 3 #436

Lack of Knowledge Database Support in Vision Mode for Bedrock Claude 3 #436

Comments

nhat-tranvan commented Mar 29, 2024

ystoneman commented Apr 14, 2024

nhat-tranvan commented Apr 23, 2024