Question: What is the best way to use website crawler in a workspace? #605

azaylamba · 2024-11-15T07:34:40Z

The website crawler feature is a great feature and can be used to ingest webpages in the workspace. Just wondering, what is the best way to update the workspace when some of the webpages now have updated content after initial crawling is done.
I believe we would need to crawl the website again, would that result in duplicate documents in the workspace and vector database?
How to avoid the duplication and update the workspace with updated webpages?
Should we create a new workspace and crawl the website again? This doesn't seem scalable when the website content is being updated frequently.

What should be the best approach in this situation?

charles-marion · 2024-11-15T18:11:33Z

Bedrock knowledge Base supports Crawling websites and has an API to sync the data
https://docs.aws.amazon.com/bedrock/latest/userguide/kb-data-source-sync-ingest.html

You might be able to set upEvent Bridge to periodically call the Bedrock API StartIngestionJob.

An alternative is to periodically remove the website and add it back from the workspace. The integration test has an example where it adds a RSS Feed and remove it (document)
https://github.com/aws-samples/aws-genai-llm-chatbot/blob/main/integtests/chatbot-api/aurora_workspace_test.py#L62

azaylamba · 2024-11-15T18:45:13Z

@charles-marion Thanks for the links, I will have a look.

azaylamba · 2024-11-15T18:58:21Z

@charles-marion Currently I am using OpenSearch vector storage and primarily uploading PDF documents in the workspace using file upload option. I am thinking to use website crawler so that I don't have to manually upload the documents as the documents are also being uploaded as web pages on the website.
Setting up a bedrock knowledge base would require whole new setup around workspace and vector storage. So, I am exploring if website crawling can be used with existing workspace and the OpenSearch vector stoarge without having to do additional setup.
I think the second approach you suggested where we need to periodically remove and add the website to the workspace can be explored. I am thinking about the complexity and the downtime of workspace during this time. Probably we would need to make sure that documents are deleted from everywhere including S3, OpenSearch, DynamoDB etc. which increases the complexity in setting up the periodic removal and addition.

github-project-automation bot added this to AWS GenAI Chatbot Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: What is the best way to use website crawler in a workspace? #605

Question: What is the best way to use website crawler in a workspace? #605

azaylamba commented Nov 15, 2024

charles-marion commented Nov 15, 2024

azaylamba commented Nov 15, 2024

azaylamba commented Nov 15, 2024

Question: What is the best way to use website crawler in a workspace? #605

Question: What is the best way to use website crawler in a workspace? #605

Comments

azaylamba commented Nov 15, 2024

charles-marion commented Nov 15, 2024

azaylamba commented Nov 15, 2024

azaylamba commented Nov 15, 2024