How do you search your archive? #931

bibor · 2022-02-16T21:35:59Z

bibor
Feb 16, 2022

TL;DR: Is the sonic search backend working?, Would it be a good idea to have a shallow and a deep search?

I use ArchiveBox now for ~2 years to archive web content for me personally and using it as a bookmark collection. During this time I used the ripgrep search backend. Mostly because I did not know that the sonic backend exists til last week.

The ripgrep backend never worked for me in a satisfying way. It always timed out. So I basically relied on the index data in the sqlite db for searching. I am not surprised that the rg backend always times out. My archive is 76GB in size and is backed by really slow storage (sshfs backed with zfs). Last week I wanted to be finally able to play a little bit around to make it searchable in a feasible manor, so i tried to run a couple of benchmarks. Even if i search directly on my ZFS backend the time for a full ripgrep run is ~15min. During the search run the ArchiveBox WebUI is not responsive and doesn't show intermediate results. Waiting 15mins every time if I search for a bookmark is not feasible for me. Moving the storage backend to a local SSD will improve this value but not dramatically and increases the cost of storage enormously. Especially with a big (hundreds of gigabytes) archive this doesn't seem satisfying, either.

By reading the code I stumbled on the sonic backend, which is not documented outside of the code at all. What i got from it's own repo and reading the ArchiveBox code is that it tries to create a full text index of parts of the archived data and put it in a searchable database. This would be interesting for me, but I did not got it running till now. Even if I tried the docker-compose setup, it threw some errors with the sonic container. I'm not very familiar with docker so i haven't debugged it, yet.
So my Question to other users are:

Do you use the sonic backend?
How is the quality of your search results?
How resource intensive is the sonic backend with your archives?
What is included in the index?

Nevertheless, I like the idea of a ripgrep like deep search, because of it's unmet coverage. But in my opinion this would only be practical if the following conditions are met.

There is a fast "shallow" search function, which uses index data separated from the deep search.
The results of the "deep" search are streamed to the front-end while generated.
There is some job control feature for "deep" search to schedule search jobs and run them in the background, while the results of these are stored for later retrieval.

What do you think about splitting up the search function in a deep and shallow search? Is this something you worry about? Was this discussed before? Have I overlooked something that makes this unnecessary?

pirate · 2022-03-16T20:39:39Z

pirate
Mar 16, 2022
Maintainer

Ripgrep is not appropriate for >10gb archives, you definitely want sonic. What errors did you see with the docker-compose.yml setup exactly? It shouldn't be too difficult to get it running based on the instructions in docker-compose.yml https://github.com/ArchiveBox/ArchiveBox/blob/dev/docker-compose.yml#L30.

As for the high-level design direction, I don't want to invest in making ripgrep faster, I'd rather just make the sonic setup easier.

1 reply

bibor Apr 14, 2022
Author

The sonic backend is not even documented anywhere. I stumbled around it because i looked in the docker file by accident. So the first thing to improve it is, write some documentation, that it exist, what it indexes and how to get it running. Basically the questions, I asked above. I would do this, but as I mentioned, I am not there yet.

As i wanted to try it in the docker compose setup again, to tell you the error messages I recieved in February I didn't even manage to start the container this time. I followed the official docker installation guide for fedora and got stuck. It throws a known error. Since I don't want to investigate broken software I do not intent to use, I will try to setup in more robust ways then using docker. </rant>
If I am succesfull with this I'll will answer all of the above questions myself and write proper documentation on the (currently secret) sonic search backend. If it is welcomed you'll receive a pull request afterwards.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do you search your archive? #931

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How do you search your archive? #931

bibor Feb 16, 2022

Replies: 1 comment · 1 reply

pirate Mar 16, 2022 Maintainer

bibor Apr 14, 2022 Author

bibor
Feb 16, 2022

Replies: 1 comment 1 reply

pirate
Mar 16, 2022
Maintainer

bibor Apr 14, 2022
Author