Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor existing caching mechanism into pluggable system #766

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

unjello
Copy link

@unjello unjello commented Apr 29, 2023

Create an interface Cache, similiar to storage.Storage that will allow for pluggable caching extensions. Caching subsystem is created with full access to Request and Response objects, to potentialy make more complex caching decisions based on both request and response headers. Because of the need to access those objects, Cache is not created in its own package like storage, but in root colly.

Closes #103

unjello added 2 commits April 29, 2023 16:26
Create an interface `Cache`, similiar to `storage.Storage` that
will allow for pluggable caching extensions. Caching subsystem is
created with full access to Request and Response objects, to potentialy
make more complex caching decisions based on both request and response
headers. Because of the need to access those objects, `Cache` is not
created in its own package like `storage`, but in root `colly`.

Closes gocolly#103
@unjello
Copy link
Author

unjello commented Apr 29, 2023

@asciimoo would you happen to have a second to review this?

Comment on lines +35 to +36
// Init initializes the caching backend
Init() error
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need Init() as a part of the interface? This is not typical in Go. Can't we just assume that whateve Cache implementation is passed to colly, it's already initialized?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason I've decided to go with Init is consistency - it may not be idiomatic for Go, but it is used by colly all over the place... Collector, httpBackend, etc. My other reason was the way I was thinking about backwards compatibility - i.e. passing SetCollector via NewCollector, but after some thinking, I may call NewFileSystemCache just as well. If you'd like me to remove Init just let me know.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may call NewFileSystemCache just as well. If you'd like me to remove Init just let me know.

Yeah, that sounds much better. Please do.

Comment on lines +41 to +42
// Close finalizes caching backend
Close() error
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the expected lifecycle of the Cache implementation? Is it tightly coupled to the Collector's one, or it can be shared between several collectors, and something else must close it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very good question. The backwards-compatible FileSystemCache is tightly coupled with Collector, but the way to use other caches would be via Collector.SetCache only. Unfortunately in current form, where it's .SetCache who's calling cache.Init it kind of defeats the purpose. I will remove .Init from the .SetCache and document that this method should pass an initialised object, and a caller is responsible for closing. For the backwards-compatibility scenario, it should not matter too much. Collector does not have a .Close method, and FileSystemCache did not do any closing either.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this a bit.

I believe if cache is set through the environment variable, collector's Close should call the cache's Close as well. If cache is initialized through SetCache, then it should not be closed. Moreover, if there's a cache previously initialized through the environment variable when SetCache is called, it should be closed before replacing it with the new one.

Do you agree?

Comment on lines +102 to +104
if request.Method != "GET" || request.Header.Get("Cache-Control") == "no-cache" {
return nil, ErrRequestNoCache
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this check has to be inside Cache implementation rather than somewhere in its caller? I think cacheability of a HTTP request doesn't really depend on where you store its response.

Copy link
Author

@unjello unjello Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason I put that inside a Cache is for each caching to make their own decisions - i.e. override Cache-Control and cache everything if they want. It did not feel right to have a decision made in two places. OTOH, maybe it'd be good to refactor this to a method like Cache.Filter or maybe to make a CacheFilter class with a default implementation, such that other Cache implementations do not have to reinvent the wheel if they just want to take a default behaviour.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, at least let's move this check into unexported function for the time being.

@WGH-
Copy link
Collaborator

WGH- commented Jun 20, 2023

I certainly approve the efforts do decouple caching from the guts of colly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pluggable cache system?
2 participants