Maintaining Semantic Search Quality, Build as you add Information Resources (to Loaders) #32

converseKarl · 2024-04-24T12:13:16Z

What is best practice to initialize ragbuilder/ragapplication vars and add loaders/buid into lance

create a new rag builder and ragapplication var from it which is public as part of server initialization
I can add additional loaders via an api end point, set Lance and build each time
Should i be setting vector db once, then just calling builder after add loader each time?

What is the best way to build up incremental data in the session to answer questions reliably across web pages and PDF's mixed in?

Problem

when i add urls it seems to work, when i re-add they kind of work answer questions but i notice it becomes much less reliable. I tried adding 2 urls' then a PDF. initially the url's worked, answering questions, then when i added a PDF, that worked answered questions but the original two web pages started getting a lot more "I don't know the answer to the question" like it become much dumber.

Is this because the way I'm adding / building in Lance below or is it because Lance perhaps isn't the best semantic search rag I should be using as more data is added?

Code

initialization

let ragApplication;
let ragApplicationBuilder;

router.get('/test', async (req, res) => {
ragApplication = await new RAGApplicationBuilder()
.setTemperature(0)
res.status(200).send("test Done");
})

adding pages

router.post('/addURL', async (req, res) => { // Changed to POST
const { url } = req.body; // Extract URL from request body
if (!url) {
return res.status(400).send('URL is required');
}
try {
ragApplicationBuilder
.addLoader(new WebLoader({ url: url }))
ragApplication = await ragApplicationBuilder
.setVectorDb(new LanceDb({ path: './db' }))
.build();
res.status(200).send("updated web page");
} catch (error) {
console.error('Error adding URL:', error);
return res.status(500).send('Internal Server Error');
}
});

adhityan · 2024-04-24T14:50:29Z

I think this is what you should be doing -

As for your specific question, I think it's a combination of multiple things. I am going to assume there is a cache, so the quality of the response depends on the embedding model and LLM choice. The vectorDB choice itself is only a little relevant as most vectorDBs have similar performance these days.

One thing you should look at is the sources field in the query response. This will contain references to the loaded info pieces that were used to form the response (essentially the items that were picked up from the vectorDB and sent to the LLM)

Another tool that can help are the logs. The library emits a lot of debug logs which are hidden by default. These logs can give you a lot of info. To see them, set the env vairable DEBUG=embedjs:*

Let me know what you discover.

converseKarl · 2024-04-24T17:54:46Z

Thanks. I can investigate this but the 2 web pages i originally loaded seemed to be less reliable as i loaded a pdf. I will play and follow your diagram. Thanks for your detailed response.

I really appreciate this and i would buy you a pizza or round of coffees anytime.

I like the way you have structured this project. It is very cool. We all need to get behind you and support it. Regards.

adhityan · 2024-05-11T14:35:01Z

Thank you for the kind words. Were you able to investigate this further?

github-actions bot assigned adhityan Apr 24, 2024

adhityan added the question Further information is requested label Apr 24, 2024

adhityan closed this as completed May 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maintaining Semantic Search Quality, Build as you add Information Resources (to Loaders) #32

Maintaining Semantic Search Quality, Build as you add Information Resources (to Loaders) #32

converseKarl commented Apr 24, 2024

adhityan commented Apr 24, 2024 •

edited

Loading

converseKarl commented Apr 24, 2024

adhityan commented May 11, 2024

Maintaining Semantic Search Quality, Build as you add Information Resources (to Loaders) #32

Maintaining Semantic Search Quality, Build as you add Information Resources (to Loaders) #32

Comments

converseKarl commented Apr 24, 2024

Problem

Code

initialization

adding pages

adhityan commented Apr 24, 2024 • edited Loading

converseKarl commented Apr 24, 2024

adhityan commented May 11, 2024

adhityan commented Apr 24, 2024 •

edited

Loading