Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

webLoader loads a previously registered URL only #153

Closed
leVoT8 opened this issue Nov 2, 2024 · 6 comments
Closed

webLoader loads a previously registered URL only #153

leVoT8 opened this issue Nov 2, 2024 · 6 comments
Assignees
Labels
question Further information is requested

Comments

@leVoT8
Copy link

leVoT8 commented Nov 2, 2024

🐛 Describe the bug

My original code included:

await ragApplication.addLoader(new WebLoader({ urlOrContent: 'https://www.forbes.com/profile/elon-musk' }));

That works fine. But then I then changed to a new URL to ingest:

await ragApplication.addLoader(new WebLoader({ urlOrContent: 'https://platform.openai.com/docs/guides/prompt-engineering' }));

The output I receive is still referencing the old, original WebLoader source: https://www.forbes.com/profile/elon-musk as indicated in the output of running a new prompt:

    ...{
      source: 'https://www.forbes.com/profile/elon-musk',
      loaderId: 'WebLoader_8cf46026cabf9b05394a2658bd1fe890'
    }...

When what should be happening is a new webLoader be instantiated, with a reference to the new URL https://platform.openai.com/docs/guides/prompt-engineering.

@adhityan
Copy link
Collaborator

adhityan commented Nov 3, 2024

Loaders are cumulative. When a query is made, the knowledge from all previously loaded data is searched for what is most relevant to the query (using embeddings and vector databases). This data is sent to the LLM and the source of that data is sent back in the sources array. Adding a new loader does not delete data from other loaders. You can have many several hundreds of loaders. For example, if you were to build an Elon Musk app - you would add multiple sources of data from different web pages and youtube videos about Elon Musk.

If you want to clear all loaded data, then you can invoke the delete methods available within the RagApplication API.

@adhityan adhityan added the invalid This doesn't seem right label Nov 3, 2024
@leVoT8
Copy link
Author

leVoT8 commented Nov 4, 2024

Understood. Thank you for the clarification. However, something is not working correctly: see my code and output below:

const ragApplication = await new RAGApplicationBuilder()
.setModel(new OpenAi({model: "gpt-4o-mini",}))
.setEmbeddingModel(new OpenAiEmbeddings())
.setVectorDb(new PineconeDb({
	projectName: 'medicalinfo',
	namespace: 'ns1',
	indexSpec: {
		serverless: {
			cloud: 'aws',
			environment: 'us-east-1'
		},
	},
}))
.setQueryTemplate("Only include information provided to you, do not make up answers. If the information is not available, state that you do not know.")
.build();

await ragApplication.addLoader(new WebLoader({ urlOrContent: 'https://platform.openai.com/docs/guides/prompt-engineering' }));

const res = await ragApplication.query('How do I write a GPT Prompt?')

The output I receive is:

{
  id: 'f8333958-9e9e-469d-a856-87753184f977',
  timestamp: 2024-11-04T00:20:06.024Z,
  content: 'I do not know.',
  actor: 'AI',
  sources: [
    {
      source: 'https://www.forbes.com/profile/elon-musk',
      loaderId: 'WebLoader_8cf46026cabf9b05394a2658bd1fe890'
    }
  ],
  tokenUse: { inputTokens: 2012, outputTokens: 5 }
}

-Why is it referencing only the source as https://www.forbes.com/profile/elon-musk? Nothing in my query has to do with Elon Musk.
-Why isn't it able to answer my question? How do I write a GPT Prompt? when I provided it a guide via WebLoader?

@leVoT8
Copy link
Author

leVoT8 commented Nov 4, 2024

I just double checked my Pinecone DB and it appears the application has stopped uploading vectors to the DB. This is likely the issue.

@adhityan
Copy link
Collaborator

adhityan commented Nov 4, 2024

I will take a look at it. This should work from the sample code you posted.

@adhityan adhityan added bug Something isn't working and removed invalid This doesn't seem right labels Nov 4, 2024
@adhityan
Copy link
Collaborator

adhityan commented Nov 4, 2024

So, I took a look at it. This is not an error on the application's end. The URL https://platform.openai.com/docs/guides/prompt-engineering is protected by a CDN that is using some form of TLS fingerprinting to disable programmatic access to the URL. When getting the URL via NodeJs, you get a 403 error. This is not tied to just user agents. The latest version of the app, spoofs the default user agent string of a windows 10 chrome browser.

Getting around TLS fingerprinting and blocking is possible but not easy. I will explore if this needs to be supported in a later version. For now, the application gets a 403 error and skips recording the page. You can see more details on what the application is doing internally by enabling the debug logs.

@adhityan adhityan added question Further information is requested and removed bug Something isn't working labels Nov 4, 2024
@leVoT8
Copy link
Author

leVoT8 commented Nov 5, 2024

Thank you for the answer to this question. That makes perfect sense.

@leVoT8 leVoT8 closed this as completed Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants