Skipping downloads when they're not necessary to execute parsers #10

victor-torres · 2020-04-09T21:23:15Z

This is done by inspecting parsers and injectable dependencies.
We search for type annotations and the use of DummyRequests.

This is done by inspecting parsers and injectable dependencies. We search for type annotations and the use of DummyRequests.

codecov · 2020-04-09T21:25:01Z

Codecov Report

Merging #10 into master will increase coverage by 0.66%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master      #10      +/-   ##
==========================================
+ Coverage   98.01%   98.68%   +0.66%     
==========================================
  Files           6        6              
  Lines         101      152      +51     
==========================================
+ Hits           99      150      +51     
  Misses          2        2

Impacted Files	Coverage Δ
scrapy_po/middleware.py	`100.00% <100.00%> (ø)`
scrapy_po/page_input_providers.py	`95.00% <100.00%> (ø)`
scrapy_po/utils.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 236b6ae...212547f. Read the comment docs.

example/example/autoextract.py

example/example/spiders/books_05_1.py

scrapy_po/middleware.py

kmike · 2020-04-10T05:38:12Z

scrapy_po/middleware.py

+    def is_response_going_to_be_used(self, request, spider):
+        """Check whether the request's response is going to be used."""
+        callback = get_callback(request, spider)
+        plan, _ = build_plan(callback, {})


👍 for making it work end-to-end, but it seems the current implementation is not complete.

There are 2 things to consider:

have user annotated response as DummyResponse in a callback?

does a dependency of any of the arguments need to access scrapy Response (e.g. page object asks for ReesponseData)?

Then I think actions should be these:

User hasn't annotated response as DummyResponse. Always download response, pass it to the callback.

There is response: DummyResponse; page objects don't need response. Don't download; pass DummyResponse instance to the callback.

There is response: DummyResponse; page objects need a Response. Download a response, but pass DummyResponse instance to the callback, as user asked for that.

A next question is how you detect

if response is annotated as DummyResponse, and

if a page object needs a response.

Currently this is done by checking if there is an argument named response. I think this is not correct, both for callback and for page objects, but for different reasons.

Callback: Scrapy passes response as a positional argument, not as a keyword argument. It means the name of the argument doesn't matter: in case of def parse(self, resp, page: ProductPage, response: DummyResponse): response should be downloaded.

Page objects / injectables: argument names and types doen't matter; what matters is an implementation of providers for the types which page object uses. Examples:

There can be e.g. page object which receives response argument of type SplashResponse.

There can be a page object which receives resp argument of type ResponseData, and the provider for ResponseData needs an access to response.

There can be a page object which receives response argument of type ResponseData, but a provider which is currently registered for ResponseData doesn't use Scrapy at all (e.g. it always gets a response from a local cache).

So, if I'm not mistaken, It means that the logic for callback and for page objects should be different:

For callback - check type of the first argument.

For page object - find providers for all dependencies, and check if any of the providers need a Response.

The tricky part here is "if any of the providers need a Response"; currently there is no way of doing it. I think we can use dependency injection again here, and inspect their __init__ method; if Response or its subclass is found, then mark the provider as requiring a Response. This (or some other way of doing it - e.g. having an attribute requires_scrapy_response = True) should be documented - at least in a comment for now.

I've just refactored the source code. Could you please take another review?

scrapy_po/middleware.py

Co-Authored-By: Mikhail Korobov <[email protected]>

victor-torres · 2020-04-13T16:40:23Z

@kmike I'll wait for your feedback before implementing more test cases (for example, testing middleware and parts pointed by coverage bot).

scrapy_po/utils.py

kmike · 2020-04-13T17:23:21Z

scrapy_po/utils.py

+            # Provider not found.
+            continue
+
+        spec = inspect.getfullargspec(provider)


I haven't checked this in detail, but it can be better to use andi.inspect here, because this way you can work around some attr.s-specific issues, and potentially support Union

Output is quite different :/

(Pdb) inspect.getfullargspec(provider) FullArgSpec(args=['self', 'response'], varargs=None, varkw=None, defaults=None, kwonlyargs=[], kwonlydefaults=None, annotations={'response': <class 'scrapy.http.response.Response'>}) (Pdb) andi.inspect(provider) {'response': []}

Unlike andi.plan, you need to pass __init__ method explicitly:

In [1]: import andi In [2]: from scrapy.http import Response In [3]: class Foo: ...: def __init__(self, resp: Response): ...: pass ...: In [4]: andi.inspect(Foo) Out[4]: {'resp': []} In [5]: andi.inspect(Foo.__init__) Out[5]: {'resp': [scrapy.http.response.Response]}

I actually wonder if this logic (https://github.com/scrapinghub/andi/blob/e26bcfd9f946aead542cacac1fa4462af7d01443/andi/andi.py#L289) should be moved to andi.inspect, so that there is less surprises. What do you think @ivanprado ?

@kmike I've just refactored this method to make use of andi.inspect instead of inspect.getfullargspec as you had requested.

scrapy_po/utils.py

kmike · 2020-04-13T17:35:28Z

example/example/autoextract.py

@@ -21,6 +22,10 @@ class AutoextractProductResponse:

 @provides(AutoextractProductResponse)
 class AutoextractProductProvider(PageObjectInputProvider):
+
+    def __init__(self, response: DummyResponse):


What do you think about making it explicit that this thing needs an URL, not a DummyResponse? Not sure how best to do it.

I'm not sure. In any case, DummyResponse instance does have a valid URL and references the original request so that users/developers could consult meta and other attributes.

example/example/autoextract.py

tests/test_utils.py

kmike · 2020-04-13T17:45:39Z

@victor-torres the approach and proposed code refactoring makes sense, please go ahead with further changes!

Could you please add some integration-style tests as well, to check that response is actually not being downloaded in cases where it shouldn't?

…esponse argument optional

victor-torres · 2020-04-13T20:30:52Z

Hi @kmike

Thank you for you last review.

I've just updated the pull request with some comments answering some of your questions and I have also introduced new test cases and refactored some parts of the code.

Looking forward to your thoughts.

@kmike

…requested by @kmike

scrapy_po/utils.py

ivanprado · 2020-04-14T09:35:02Z

scrapy_po/utils.py

+def is_provider_using_response(provider):
+    """Check whether injectable provider makes use of a valid Response."""
+    for argument, possible_types in andi.inspect(provider.__init__).items():
+        for cls in possible_types:


I think we should fail with a useful message here in case cls is not expected (i.e. it is not any type of Response).

Are we allowing other types of injectables in the future?

Perhaps it's out of the scope of this method having to check this. What do you think? Should we raise a TypeError here?

I'm thinking in the case where somebody declares a provider with a wrong __init__. It would be more meaningful to fail here with a TypeError saying something like: Type foo not supported for arguments for providers __init__. Only Response is supported. or Too many arguments for provider __init__, only empty or one supported.

Otherwise, it will fail on invocation but the message will be much less informative. What do you think?

Well, it makes sense. But I still think it's out of the scope of this method.

Think of that Config injectable you've suggested in another comment. Does that make sense to check for sub-classes of this Config here too? I don't think so.

But let me know if you have a different opinion. Also, your two cents here would be helpful, @kmike.

Ok, forget about it. I think I'm messing things here 😝 and this is very minor.

ivanprado · 2020-04-14T09:38:00Z

@victor-torres thanks for the PR, nice to meet you!. I see that this PR introduces some kind of manual injection into the providers (setting the request or the dummy request). I wonder if we could extend the concept and make the providers also injectable. I thought a little bit about it but I didn't found the answer, so leaving there the question open.

victor-torres · 2020-04-14T15:57:04Z

@victor-torres thanks for the PR, nice to meet you!. I see that this PR introduces some kind of manual injection into the providers (setting the request or the dummy request). I wonder if we could extend the concept and make the providers also injectable. I thought a little bit about it but I didn't found the answer, so leaving there the question open.

Hey, @ivanprado. Thank you for your comments.

Would you mind explaining a little bit more about your thoughts? Maybe giving an example?

ivanprado · 2020-04-14T16:30:23Z

@victor-torres this is just an idea, so I don't want to mess this PR with it. Is out of the scope of it.

But let's imagine we have some kind of Conf object that allows configuring some particular behaviour. You could declare a provider as usual:

@provides(AutoextractProductResponse)
class AutoextractProductProvider(PageObjectInputProvider):

    def __init__(self, response: DummyResponse):
        self.response = response
    ...

Or alternatively, if you require the configuration to configure some behaviour, you could declare:

@provides(AutoextractProductResponse)
class AutoextractProductProvider(PageObjectInputProvider):

    def __init__(self, response: DummyResponse, conf: Conf):
        self.response = response
        self.conf = conf
    ...

You would need also a provider for Conf:

@provides(Conf)
class ConfProvider(PageObjectInputProvider):
     # It could read a file or could access a database depending the environment

This would require some injection mechanism using andi.plan similar to what is applied for the callback, but applied to the provider itself.

A difficulty is that some providers might depend on other providers.

Anyway, I'm not sure we really need something like that. It is only an idea. Forget about it.

victor-torres · 2020-04-14T20:41:48Z

Oh, I see @ivanprado. That also crossed my mind when developing the build_providers method. Agree that it might be out of current scope, but it's a very interesting feature :D

kmike · 2020-04-15T18:31:18Z

Thanks @victor-torres! The PR looks good to me as-is, let's merge it.
It seems we all have many ideas for improvements, which is good :)

ivanprado · 2020-04-21T14:15:55Z

I noticed that when DummyRequest is used, some functionality of Scrapy does not work:

CONCURRENT_REQUESTS
CONCURRENT_REQUESTS_PER_DOMAIN
CONCURRENT_REQUESTS_PER_IP
RANDOMIZE_DOWNLOAD_DELAY
DownloaderAwarePriorityQueue
AutoThrottle

The reason is that the request skip the downloader, so none of these limits can be applied to DummyRequests.

Whether this is a useful behaviour or not depends on the case.

ivanprado · 2020-04-23T11:25:25Z

I have analyzed my last comment deeper and it seems that CONCURRENT_REQUESTS are actually enforced even in the presence of DummyRequest. What is not working is everything that needs to use the downloader slots. So it doesn't work for:

CONCURRENT_REQUESTS_PER_DOMAIN
CONCURRENT_REQUESTS_PER_IP
RANDOMIZE_DOWNLOAD_DELAY
DownloaderAwarePriorityQueue
AutoThrottle

victor-torres · 2020-04-23T15:42:43Z

@ivanprado, I think this is the expected behavior.

DummyRequests are meant to skip downloads, so it makes sense not checking for concurrent requests, delays, or autothrottle settings since we won't be making any download at all. If your parser needs a regular Request, it will be downloaded and I believe all of those settings will be respected.

I suspect the problem here is when you are using third-party libraries to acquire content for a page object. Auto Extract is a good example.

Before DummyRequests implementation, there was a delay between Auto Extract API requests caused by regular Requests going through the Scrapy's scheduler. The only issue here is that in this case, the crawler was downloading a Response that was going to be discarded. That's a waste of network resources.

I believe that if a third-party library needs to implement some kind of rate limit or delay between its requests, it must do it on its side of the source code or try to leverage Scrapy's mechanisms to do so — for example, making use of Requests to fetch its data.

Let me know if you have other considerations or if you would like to suggest another way to outcome this limitation.

ivanprado · 2020-04-24T12:34:10Z

@victor-torres I think your comment makes a lot of sense.

My intention was just to state what is the current behaviour. In the presence of DummyRequest when a thirty party library is used to fetch content some Scrapy limits still apply (the CONCURRENT_REQUEST) but not others.

Whether if that is a problem or not and what would be the solution is another question. I think your ideas are in the right direction anyway.

victor-torres · 2020-04-24T15:57:14Z

Nice to see that we are on the same page, @ivanprado.

Just documenting here that after talking today we discussed exposing Scrapy settings to the Injectables. That might at least provide some information to the so-called third-party libraries so that they can enforce the spider's concurrency settings.

Cheers!

Skipping downloads when they're not necessary to execute parsers

309e873

This is done by inspecting parsers and injectable dependencies. We search for type annotations and the use of DummyRequests.

kmike reviewed Apr 10, 2020

View reviewed changes

example/example/autoextract.py Outdated Show resolved Hide resolved

kmike reviewed Apr 10, 2020

View reviewed changes

example/example/spiders/books_05_1.py Show resolved Hide resolved

kmike requested changes Apr 10, 2020

View reviewed changes

victor-torres and others added 8 commits April 13, 2020 11:56

Update scrapy_po/middleware.py

9835ba4

Co-Authored-By: Mikhail Korobov <[email protected]>

remove not used import

7212193

refactoring

dab4aaf

refactoring

d4e240f

create simple test case for callback methods

5f8c177

refactoring and fixing logic based on test cases

2dd4b03

improving documentation

5ea5ef2

remove code used for tests

586db7a