webLoader loads a previously registered URL only #153

leVoT8 · 2024-11-02T21:46:36Z

🐛 Describe the bug

My original code included:

await ragApplication.addLoader(new WebLoader({ urlOrContent: 'https://www.forbes.com/profile/elon-musk' }));

That works fine. But then I then changed to a new URL to ingest:

await ragApplication.addLoader(new WebLoader({ urlOrContent: 'https://platform.openai.com/docs/guides/prompt-engineering' }));

The output I receive is still referencing the old, original WebLoader source: https://www.forbes.com/profile/elon-musk as indicated in the output of running a new prompt:

    ...{
      source: 'https://www.forbes.com/profile/elon-musk',
      loaderId: 'WebLoader_8cf46026cabf9b05394a2658bd1fe890'
    }...

When what should be happening is a new webLoader be instantiated, with a reference to the new URL https://platform.openai.com/docs/guides/prompt-engineering.

The text was updated successfully, but these errors were encountered:

adhityan · 2024-11-03T14:46:23Z

Loaders are cumulative. When a query is made, the knowledge from all previously loaded data is searched for what is most relevant to the query (using embeddings and vector databases). This data is sent to the LLM and the source of that data is sent back in the sources array. Adding a new loader does not delete data from other loaders. You can have many several hundreds of loaders. For example, if you were to build an Elon Musk app - you would add multiple sources of data from different web pages and youtube videos about Elon Musk.

If you want to clear all loaded data, then you can invoke the delete methods available within the RagApplication API.

leVoT8 · 2024-11-04T00:25:25Z

Understood. Thank you for the clarification. However, something is not working correctly: see my code and output below:

const ragApplication = await new RAGApplicationBuilder()
.setModel(new OpenAi({model: "gpt-4o-mini",}))
.setEmbeddingModel(new OpenAiEmbeddings())
.setVectorDb(new PineconeDb({
	projectName: 'medicalinfo',
	namespace: 'ns1',
	indexSpec: {
		serverless: {
			cloud: 'aws',
			environment: 'us-east-1'
		},
	},
}))
.setQueryTemplate("Only include information provided to you, do not make up answers. If the information is not available, state that you do not know.")
.build();

await ragApplication.addLoader(new WebLoader({ urlOrContent: 'https://platform.openai.com/docs/guides/prompt-engineering' }));

const res = await ragApplication.query('How do I write a GPT Prompt?')

The output I receive is:

{
  id: 'f8333958-9e9e-469d-a856-87753184f977',
  timestamp: 2024-11-04T00:20:06.024Z,
  content: 'I do not know.',
  actor: 'AI',
  sources: [
    {
      source: 'https://www.forbes.com/profile/elon-musk',
      loaderId: 'WebLoader_8cf46026cabf9b05394a2658bd1fe890'
    }
  ],
  tokenUse: { inputTokens: 2012, outputTokens: 5 }
}

-Why is it referencing only the source as https://www.forbes.com/profile/elon-musk? Nothing in my query has to do with Elon Musk.
-Why isn't it able to answer my question? How do I write a GPT Prompt? when I provided it a guide via WebLoader?

leVoT8 · 2024-11-04T00:51:43Z

I just double checked my Pinecone DB and it appears the application has stopped uploading vectors to the DB. This is likely the issue.

adhityan · 2024-11-04T15:10:30Z

I will take a look at it. This should work from the sample code you posted.

adhityan · 2024-11-04T21:53:26Z

So, I took a look at it. This is not an error on the application's end. The URL https://platform.openai.com/docs/guides/prompt-engineering is protected by a CDN that is using some form of TLS fingerprinting to disable programmatic access to the URL. When getting the URL via NodeJs, you get a 403 error. This is not tied to just user agents. The latest version of the app, spoofs the default user agent string of a windows 10 chrome browser.

Getting around TLS fingerprinting and blocking is possible but not easy. I will explore if this needs to be supported in a later version. For now, the application gets a 403 error and skips recording the page. You can see more details on what the application is doing internally by enabling the debug logs.

leVoT8 · 2024-11-05T02:03:06Z

Thank you for the answer to this question. That makes perfect sense.

github-actions bot assigned adhityan Nov 2, 2024

adhityan added the invalid This doesn't seem right label Nov 3, 2024

adhityan added bug Something isn't working and removed invalid This doesn't seem right labels Nov 4, 2024

adhityan added question Further information is requested and removed bug Something isn't working labels Nov 4, 2024

leVoT8 closed this as completed Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

webLoader loads a previously registered URL only #153

webLoader loads a previously registered URL only #153

leVoT8 commented Nov 2, 2024

adhityan commented Nov 3, 2024

leVoT8 commented Nov 4, 2024

leVoT8 commented Nov 4, 2024

adhityan commented Nov 4, 2024

adhityan commented Nov 4, 2024

leVoT8 commented Nov 5, 2024

webLoader loads a previously registered URL only #153

webLoader loads a previously registered URL only #153

Comments

leVoT8 commented Nov 2, 2024

🐛 Describe the bug

adhityan commented Nov 3, 2024

leVoT8 commented Nov 4, 2024

leVoT8 commented Nov 4, 2024

adhityan commented Nov 4, 2024

adhityan commented Nov 4, 2024

leVoT8 commented Nov 5, 2024