add HTML page Loader #54

converseKarl · 2024-05-14T11:26:40Z

not the same as add WebLoader, but LangChain has this hook to pass in a HTML page content, and some other settings. Under certain conditions this could be very userful to customize the HTML content or pre-process it before passing it to the loader

adhityan · 2024-05-14T15:29:35Z

Not sure I understand. What is the use case?

converseKarl · 2024-05-16T16:53:16Z

i was hoping there was a HTML text extraction method so meta can be added before passing it into the splitter service. I might have been thing this was all baked into langchain. If i'm wrong, by all means close this. The idea was to add supporting data / meta into the html doc tat's been beutifully souped ,and then pass it into be split after pre processing via one of my own cstom services.

converseKarl · 2024-05-24T12:37:44Z

OK i have a user case for this. Two types of website, static html, you can add web urls and loaders. All fine. Some dynamic pages have ajax or scripts that run to generate content or reguire login. The second types don't render pages imediately and quite a few websites out there have this have this (CRM backends, Salesforce, Ajax etc etc). So the pages show but there is a delay in rendering content in that page..

Point is this

Static pages work fine using add webloader
Dynamic pages will not (to my knowledge)
I know I need to use a headerless crawler
But need a mathod once rendered to get the page content from the headerless crawler into the rag. Textloader? HTML Loader (not web) or some other loader type? There are more loader types in langchain. Maybe get text from the headless, and pass in as a text loader is the right method?

Just sharing my thoughts with everyone!

converseKarl · 2024-05-29T09:58:12Z

In langchain,
// Initialize the HTMLLoader with the HTML content
const loader = new HTMLLoader(htmlContent);

// Load the document (parsing the HTML)
const document = await loader.load();

This will cover the situation of dynamic rendered pages, rather than assuming all pages are static which the webLoader does.

many thanks

github-actions bot assigned adhityan May 14, 2024

adhityan added the enhancement New feature or request label May 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add HTML page Loader #54

add HTML page Loader #54

converseKarl commented May 14, 2024

adhityan commented May 14, 2024

converseKarl commented May 16, 2024

converseKarl commented May 24, 2024 •

edited

Loading

converseKarl commented May 29, 2024

add HTML page Loader #54

add HTML page Loader #54

Comments

converseKarl commented May 14, 2024

adhityan commented May 14, 2024

converseKarl commented May 16, 2024

converseKarl commented May 24, 2024 • edited Loading

converseKarl commented May 29, 2024

converseKarl commented May 24, 2024 •

edited

Loading