Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add HTML page Loader #54

Open
converseKarl opened this issue May 14, 2024 · 4 comments
Open

add HTML page Loader #54

converseKarl opened this issue May 14, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@converseKarl
Copy link

not the same as add WebLoader, but LangChain has this hook to pass in a HTML page content, and some other settings. Under certain conditions this could be very userful to customize the HTML content or pre-process it before passing it to the loader

@adhityan
Copy link
Collaborator

Not sure I understand. What is the use case?

@converseKarl
Copy link
Author

i was hoping there was a HTML text extraction method so meta can be added before passing it into the splitter service. I might have been thing this was all baked into langchain. If i'm wrong, by all means close this. The idea was to add supporting data / meta into the html doc tat's been beutifully souped ,and then pass it into be split after pre processing via one of my own cstom services.

@adhityan adhityan added the enhancement New feature or request label May 19, 2024
@converseKarl
Copy link
Author

converseKarl commented May 24, 2024

OK i have a user case for this. Two types of website, static html, you can add web urls and loaders. All fine. Some dynamic pages have ajax or scripts that run to generate content or reguire login. The second types don't render pages imediately and quite a few websites out there have this have this (CRM backends, Salesforce, Ajax etc etc). So the pages show but there is a delay in rendering content in that page..

Point is this

  1. Static pages work fine using add webloader
  2. Dynamic pages will not (to my knowledge)
  3. I know I need to use a headerless crawler
  4. But need a mathod once rendered to get the page content from the headerless crawler into the rag. Textloader? HTML Loader (not web) or some other loader type? There are more loader types in langchain. Maybe get text from the headless, and pass in as a text loader is the right method?

Just sharing my thoughts with everyone!

@converseKarl
Copy link
Author

In langchain,
// Initialize the HTMLLoader with the HTML content
const loader = new HTMLLoader(htmlContent);

// Load the document (parsing the HTML)
const document = await loader.load();

This will cover the situation of dynamic rendered pages, rather than assuming all pages are static which the webLoader does.

many thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants