-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pluggable cache system? #103
Comments
Another option is to pass/return |
@jimsmart thanks for this great idea and for the detailed and clear specification. I'm currently working on a storage interface (04d96ab) which follows pretty much the same concept. It allows us to use any kind of backend to store visited urls and cookies (http://go-colly.org/docs/best_practices/storage/). I think, a similar concept could work with the 2nd cache interface from your specification. What do you think? Also, I'd suggest an additional |
Yes. That's what I was thinking, similar to package And have a I don't need Your — Slight aside: I see you have a lot of stutter in your interface names. It might be better to move (or perhaps copy) them up to the parent package, then you'd have |
Yes.
You're right, i've removed it from Storage too, thanks.
I agree, it is a bit redundant, but the package doesn't use internals of the |
Of course not! :) I wasn't suggesting it would be, I was suggesting that specific implementations of |
@jimsmart what exactly are you get/putting in the cache? My imagination is having a hard time going beyond image and css files which aren't downloaded through Colly in any case. If it's just HTML isn't the |
I have several hundred thousand web pages I am processing to extract data. Currently, if the task fails part way through, I just resume later, and rely on the cached data (in the current file system cache) to get reprocessed and then it simply resumes from where it got to. But there are so many files, and most are hundreds of kb each. The cache folder is >6gb after a crawl. Compressed and in Sqlite, that same data is a single file, of <1.5gb. It's fast. It's easy to manage. I've pushed two repos with working prototype code, run https://github.com/jimsmart/collysqlite - implements a cache using Sqlite. Both of these use |
Basically, if I'm going to invest the time and resources to download that much data, I'd most certainly like to keep it around, for later reruns. |
Scrapy, the leading Python web scraper, has pluggable caches. Maybe take a look? |
FWIW: some log output from collyzstandard tests showing compression stats...
|
@jimsmart impressive! Could you please open a PR? |
Sure, I'll get something together ASAP, I'm just on the tail-end of a bit of a coding marathon, so it won't be today. I don't think my code will integrate cleanly yet, it needs some more work. But I'm on with that already. I'm also concerned about 'breaking changes' / whether you have any API compatibility guarantees, or anything similar? |
In my local repo I also have an implementation of 99% of the 'storage' API... for SQLite. |
Awesome, there is no rush. Thanks!
Does this change introduce any backward incompatibility? |
Um, maybe. But I want to minimise that, obviously. In an ideal world there'd be no breaking changes. That's partly why the code needs more work. I'm new to the party, so I don't know much about any current plans for features/improvements/etc. So I think a slow and gentle approach to this is probably best. |
Perfect! |
I wonder if a good solution is to use an interface like that in go-cloud: https://github.com/google/go-cloud/tree/master/blob Check out https://github.com/google/go-cloud/blob/master/blob/example_test.go for a "local/file" cache. Similarly, check out the GCS or S3 tests. In this way, I could pass my own bucket to colly and it would work with non-local file systems. |
@jonathaningram I think the interface used in |
FYI: my use case. |
I just wanted to drop this here https://www.nayuki.io/page/self-encrypted-cache-structure |
Hey I am thinking of picking that up but just wondering where this is at? I should probably describe my use case. I ran into a blocker as we use heroku and their dynos' file systems are ephemeral. If you're not using a cloud storage mechanism (which you can but is quiet limited under the free tier) then you'll loose and will only have a day worth of caching at most... |
Hi, the repos mentioned in #103 (comment) (above) contain my prototype code. Feel free to open issues on those repos if you have any questions — happy to help, as best I can. I recall there were also changes needed with Colly to make these work, but I don't believe I ever pushed those, sorry. But IIRC the interface was fairly straightforward. |
Hi, just want to catch up this issue @jimsmart I don't see any advantage to use sqlite as a cache here, you can even tar files on top of filesystem which could deliver the similar performance compared to sqlite. @asciimoo If you are seeking an api backward compatibility changes, maybe you can replace filesystem (currently using pkg "os") with https://github.com/spf13/afero, so we could easily swtich to memory fs with tar/gzip functionality on top of it. any thoughts? |
@hsinhoyeh
For our use case, scraping many hundreds of thousands of HTML files (with a big variety in sizes), there is certainly a benefit to storing the files in SQLite, it’s actually faster than using raw files on the file system, and also takes much less disk space. There are docs on the SQLite website to support this, and our own benchmarks and metrics confirm this, without doubt.
I’m not sure why you say you don’t see any advantage here? - have you actually tried using SQLite as a storage system in such cases, and compared the metrics? It’s easy to flippantly dismiss something as a bad idea if one has not tried it, but as the old adage goes “to measure is to know” - perhaps your dismissal is better reasoned than it first appears, and you’ve made some decent measurements/comparisons, in which case I would certainly be interested to see your benchmarks/stats, and to know some details about the system they were produced upon (e.g. the underlying file system used, etc)
Furthermore we also compress each page body individually before storing it into SQLite - we tried quite a few compression algorithms, including tar/gzip, but the benchmarks/metrics showed that Zstandard was in fact best for our use case, both speed-wise and size-wise. Naturally, this works very well on top of the aforementioned SQLite storage layer, but also produced improvements when used on top of the native file system.
Unfortunately, we no longer use Colly to build our scrapers, ours now have a completely different architecture, for a variety of reasons, and Colly was no longer the best fit for us, although I still think it is a great project. But because of this, I’ve not pursued any further development of these storage-related plugins, although the original prototype code is still available in my GitHub repos.
Hope that helps.
… On 19 Nov 2021, at 07:26, hsinhoyeh ***@***.***> wrote:
Hi, just want to catch up this issue
@jimsmart I don't see any advantage to use sqlite as a cache here, you can even tar files on top of filesystem which could deliver the similar performance compared to sqlite.
@asciimoo If you are seeking an api backward compatibility changes, maybe you can replace filesystem (currently using pkg "os") with https://github.com/spf13/afero, so we could easily swtich to memory fs with tar/gzip functionality on top of it.
any thoughts?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
|
I really like the colly API for scraping, but having just the filesystem for storage is not nice. I came back to this issue a few times over the years and always went back to use my own simple crawler, that basically just uses httpclient and goquery. Thats why I would like to bring this topic up again. Is it worth investigating this issue further? I think @jimsmart did some nice prototyping. Is there any chance to get this included, if we have a complete pull request @hsinhoyeh |
Bumping up. I'd love to see more pluggable caching/storage system, but I don't believe my go-fu is strong enough to attempt refactoring. I may need to, because I would really use it tho, so I wonder if there are people at least willing to do some solid peer-review. |
Create an interface `Cache`, similiar to `storage.Storage` that will allow for pluggable caching extensions. Caching subsystem is created with full access to Request and Response objects, to potentialy make more complex caching decisions based on both request and response headers. Because of the need to access those objects, `Cache` is not created in its own package like `storage`, but in root `colly`. Closes gocolly#103
Hi,
Thanks for Colly! :)
I have a task with hundreds of thousands of pages, so obviously I am using Colly's caching, but it's basically 'too much' for the filesystem. (Wasted disk space, a pain to manage, slow to backup, etc)
I'd like to propose a pluggable cache system, similar to how you've made other Colly components.
Perhaps with an API like this:-
...or...
The first one won't be possible if you then wish to implement FileSystemCache in a subpackage to Colly though.
The reason I also need a Remove method is because one project has a site that sometimes serves maintenance pages, and whilst I can detect these, Colly currently has no method of saying stop, or of rejecting a page after processing. Obviously, the last thing I want part way through a crawl is to have my cache poisoned. But that's probably a separate issue, that I can live with if I can do the removal of bad pages myself.
If pluggable caches were to be implemented, I have existing code from another project that has a cache built using SQLite as a key-value store, compressing/decompressing the data with Zstandard (it's both surprisingly quick and super efficient on disk space), that I would happily port over. This can either become part of Colly, or a separate thing on my own Github.
I did start implementing this myself, but ran into a problem with how I went about it. (I followed the pattern you have established of having the separate components as subpackages, I then got bitten because my FileSystemCache couldn't easily reach into the Collector to get the CacheDir, I was trying to preserve existing behaviour / API compatibility. Maybe that's not an issue. Maybe these bits shouldn't be subpackages. Obviously once I started thinking things like that I figured it was best to discuss before/instead of proceeding any further.)
— Your thoughts?
The text was updated successfully, but these errors were encountered: