Replies: 1 comment 1 reply
-
Ripgrep is not appropriate for >10gb archives, you definitely want sonic. What errors did you see with the docker-compose.yml setup exactly? It shouldn't be too difficult to get it running based on the instructions in As for the high-level design direction, I don't want to invest in making ripgrep faster, I'd rather just make the sonic setup easier. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
TL;DR: Is the sonic search backend working?, Would it be a good idea to have a shallow and a deep search?
I use ArchiveBox now for ~2 years to archive web content for me personally and using it as a bookmark collection. During this time I used the ripgrep search backend. Mostly because I did not know that the sonic backend exists til last week.
The ripgrep backend never worked for me in a satisfying way. It always timed out. So I basically relied on the index data in the sqlite db for searching. I am not surprised that the rg backend always times out. My archive is 76GB in size and is backed by really slow storage (sshfs backed with zfs). Last week I wanted to be finally able to play a little bit around to make it searchable in a feasible manor, so i tried to run a couple of benchmarks. Even if i search directly on my ZFS backend the time for a full ripgrep run is ~15min. During the search run the ArchiveBox WebUI is not responsive and doesn't show intermediate results. Waiting 15mins every time if I search for a bookmark is not feasible for me. Moving the storage backend to a local SSD will improve this value but not dramatically and increases the cost of storage enormously. Especially with a big (hundreds of gigabytes) archive this doesn't seem satisfying, either.
By reading the code I stumbled on the sonic backend, which is not documented outside of the code at all. What i got from it's own repo and reading the ArchiveBox code is that it tries to create a full text index of parts of the archived data and put it in a searchable database. This would be interesting for me, but I did not got it running till now. Even if I tried the docker-compose setup, it threw some errors with the sonic container. I'm not very familiar with docker so i haven't debugged it, yet.
So my Question to other users are:
Nevertheless, I like the idea of a ripgrep like deep search, because of it's unmet coverage. But in my opinion this would only be practical if the following conditions are met.
What do you think about splitting up the search function in a deep and shallow search? Is this something you worry about? Was this discussed before? Have I overlooked something that makes this unnecessary?
Beta Was this translation helpful? Give feedback.
All reactions