-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skipping downloads when they're not necessary to execute parsers #10
Conversation
This is done by inspecting parsers and injectable dependencies. We search for type annotations and the use of DummyRequests.
Codecov Report
@@ Coverage Diff @@
## master #10 +/- ##
==========================================
+ Coverage 98.01% 98.68% +0.66%
==========================================
Files 6 6
Lines 101 152 +51
==========================================
+ Hits 99 150 +51
Misses 2 2
Continue to review full report at Codecov.
|
scrapy_po/middleware.py
Outdated
def is_response_going_to_be_used(self, request, spider): | ||
"""Check whether the request's response is going to be used.""" | ||
callback = get_callback(request, spider) | ||
plan, _ = build_plan(callback, {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for making it work end-to-end, but it seems the current implementation is not complete.
There are 2 things to consider:
- have user annotated response as DummyResponse in a callback?
- does a dependency of any of the arguments need to access scrapy Response (e.g. page object asks for ReesponseData)?
Then I think actions should be these:
- User hasn't annotated response as DummyResponse. Always download response, pass it to the callback.
- There is response: DummyResponse; page objects don't need response. Don't download; pass DummyResponse instance to the callback.
- There is response: DummyResponse; page objects need a Response. Download a response, but pass DummyResponse instance to the callback, as user asked for that.
A next question is how you detect
- if response is annotated as DummyResponse, and
- if a page object needs a response.
Currently this is done by checking if there is an argument named response
. I think this is not correct, both for callback and for page objects, but for different reasons.
Callback: Scrapy passes response as a positional argument, not as a keyword argument. It means the name of the argument doesn't matter: in case of def parse(self, resp, page: ProductPage, response: DummyResponse):
response should be downloaded.
Page objects / injectables: argument names and types doen't matter; what matters is an implementation of providers for the types which page object uses. Examples:
- There can be e.g. page object which receives
response
argument of type SplashResponse. - There can be a page object which receives
resp
argument of type ResponseData, and the provider for ResponseData needs an access to response. - There can be a page object which receives
response
argument of type ResponseData, but a provider which is currently registered for ResponseData doesn't use Scrapy at all (e.g. it always gets a response from a local cache).
So, if I'm not mistaken, It means that the logic for callback and for page objects should be different:
- For callback - check type of the first argument.
- For page object - find providers for all dependencies, and check if any of the providers need a Response.
The tricky part here is "if any of the providers need a Response"; currently there is no way of doing it. I think we can use dependency injection again here, and inspect their __init__
method; if Response or its subclass is found, then mark the provider as requiring a Response. This (or some other way of doing it - e.g. having an attribute requires_scrapy_response = True
) should be documented - at least in a comment for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just refactored the source code. Could you please take another review?
Co-Authored-By: Mikhail Korobov <[email protected]>
@kmike I'll wait for your feedback before implementing more test cases (for example, testing middleware and parts pointed by coverage bot). |
scrapy_po/utils.py
Outdated
# Provider not found. | ||
continue | ||
|
||
spec = inspect.getfullargspec(provider) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't checked this in detail, but it can be better to use andi.inspect here, because this way you can work around some attr.s-specific issues, and potentially support Union
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Output is quite different :/
(Pdb) inspect.getfullargspec(provider)
FullArgSpec(args=['self', 'response'], varargs=None, varkw=None, defaults=None, kwonlyargs=[], kwonlydefaults=None, annotations={'response': <class 'scrapy.http.response.Response'>})
(Pdb) andi.inspect(provider)
{'response': []}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike andi.plan, you need to pass __init__
method explicitly:
In [1]: import andi
In [2]: from scrapy.http import Response
In [3]: class Foo:
...: def __init__(self, resp: Response):
...: pass
...:
In [4]: andi.inspect(Foo)
Out[4]: {'resp': []}
In [5]: andi.inspect(Foo.__init__)
Out[5]: {'resp': [scrapy.http.response.Response]}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually wonder if this logic (https://github.com/scrapinghub/andi/blob/e26bcfd9f946aead542cacac1fa4462af7d01443/andi/andi.py#L289) should be moved to andi.inspect, so that there is less surprises. What do you think @ivanprado ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmike I've just refactored this method to make use of andi.inspect
instead of inspect.getfullargspec
as you had requested.
@@ -21,6 +22,10 @@ class AutoextractProductResponse: | |||
|
|||
@provides(AutoextractProductResponse) | |||
class AutoextractProductProvider(PageObjectInputProvider): | |||
|
|||
def __init__(self, response: DummyResponse): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about making it explicit that this thing needs an URL, not a DummyResponse? Not sure how best to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. In any case, DummyResponse instance does have a valid URL and references the original request so that users/developers could consult meta and other attributes.
@victor-torres the approach and proposed code refactoring makes sense, please go ahead with further changes! Could you please add some integration-style tests as well, to check that response is actually not being downloaded in cases where it shouldn't? |
Hi @kmike Thank you for you last review. I've just updated the pull request with some comments answering some of your questions and I have also introduced new test cases and refactored some parts of the code. Looking forward to your thoughts. |
9ce0f97
to
3903f22
Compare
3903f22
to
0ed559d
Compare
def is_provider_using_response(provider): | ||
"""Check whether injectable provider makes use of a valid Response.""" | ||
for argument, possible_types in andi.inspect(provider.__init__).items(): | ||
for cls in possible_types: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should fail with a useful message here in case cls
is not expected (i.e. it is not any type of Response).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we allowing other types of injectables in the future?
Perhaps it's out of the scope of this method having to check this. What do you think? Should we raise a TypeError here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking in the case where somebody declares a provider with a wrong __init__
. It would be more meaningful to fail here with a TypeError saying something like: Type foo not supported for arguments for providers __init__. Only Response is supported.
or Too many arguments for provider __init__, only empty or one supported
.
Otherwise, it will fail on invocation but the message will be much less informative. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it makes sense. But I still think it's out of the scope of this method.
Think of that Config injectable you've suggested in another comment. Does that make sense to check for sub-classes of this Config here too? I don't think so.
But let me know if you have a different opinion. Also, your two cents here would be helpful, @kmike.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, forget about it. I think I'm messing things here 😝 and this is very minor.
@victor-torres thanks for the PR, nice to meet you!. I see that this PR introduces some kind of manual |
Hey, @ivanprado. Thank you for your comments. Would you mind explaining a little bit more about your thoughts? Maybe giving an example? |
@victor-torres this is just an idea, so I don't want to mess this PR with it. Is out of the scope of it. But let's imagine we have some kind of @provides(AutoextractProductResponse)
class AutoextractProductProvider(PageObjectInputProvider):
def __init__(self, response: DummyResponse):
self.response = response
... Or alternatively, if you require the configuration to configure some behaviour, you could declare: @provides(AutoextractProductResponse)
class AutoextractProductProvider(PageObjectInputProvider):
def __init__(self, response: DummyResponse, conf: Conf):
self.response = response
self.conf = conf
... You would need also a provider for @provides(Conf)
class ConfProvider(PageObjectInputProvider):
# It could read a file or could access a database depending the environment This would require some injection mechanism using A difficulty is that some providers might depend on other providers. Anyway, I'm not sure we really need something like that. It is only an idea. Forget about it. |
Oh, I see @ivanprado. That also crossed my mind when developing the |
Thanks @victor-torres! The PR looks good to me as-is, let's merge it. |
I noticed that when
The reason is that the request skip the downloader, so none of these limits can be applied to DummyRequests. Whether this is a useful behaviour or not depends on the case. |
I have analyzed my last comment deeper and it seems that
|
@ivanprado, I think this is the expected behavior. DummyRequests are meant to skip downloads, so it makes sense not checking for concurrent requests, delays, or autothrottle settings since we won't be making any download at all. If your parser needs a regular Request, it will be downloaded and I believe all of those settings will be respected. I suspect the problem here is when you are using third-party libraries to acquire content for a page object. Auto Extract is a good example. Before DummyRequests implementation, there was a delay between Auto Extract API requests caused by regular Requests going through the Scrapy's scheduler. The only issue here is that in this case, the crawler was downloading a Response that was going to be discarded. That's a waste of network resources. I believe that if a third-party library needs to implement some kind of rate limit or delay between its requests, it must do it on its side of the source code or try to leverage Scrapy's mechanisms to do so — for example, making use of Requests to fetch its data. Let me know if you have other considerations or if you would like to suggest another way to outcome this limitation. |
@victor-torres I think your comment makes a lot of sense. My intention was just to state what is the current behaviour. In the presence of Whether if that is a problem or not and what would be the solution is another question. I think your ideas are in the right direction anyway. |
Nice to see that we are on the same page, @ivanprado. Just documenting here that after talking today we discussed exposing Scrapy settings to the Injectables. That might at least provide some information to the so-called third-party libraries so that they can enforce the spider's concurrency settings. Cheers! |
This is done by inspecting parsers and injectable dependencies.
We search for type annotations and the use of DummyRequests.