Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skipping downloads when they're not necessary to execute parsers #10

Merged
merged 16 commits into from
Apr 15, 2020

Conversation

victor-torres
Copy link
Contributor

This is done by inspecting parsers and injectable dependencies.
We search for type annotations and the use of DummyRequests.

This is done by inspecting parsers and injectable dependencies.
We search for type annotations and the use of DummyRequests.
@codecov
Copy link

codecov bot commented Apr 9, 2020

Codecov Report

Merging #10 into master will increase coverage by 0.66%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #10      +/-   ##
==========================================
+ Coverage   98.01%   98.68%   +0.66%     
==========================================
  Files           6        6              
  Lines         101      152      +51     
==========================================
+ Hits           99      150      +51     
  Misses          2        2              
Impacted Files Coverage Δ
scrapy_po/middleware.py 100.00% <100.00%> (ø)
scrapy_po/page_input_providers.py 95.00% <100.00%> (ø)
scrapy_po/utils.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 236b6ae...212547f. Read the comment docs.

scrapy_po/middleware.py Outdated Show resolved Hide resolved
scrapy_po/middleware.py Show resolved Hide resolved
def is_response_going_to_be_used(self, request, spider):
"""Check whether the request's response is going to be used."""
callback = get_callback(request, spider)
plan, _ = build_plan(callback, {})
Copy link
Member

@kmike kmike Apr 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for making it work end-to-end, but it seems the current implementation is not complete.

There are 2 things to consider:

  1. have user annotated response as DummyResponse in a callback?
  2. does a dependency of any of the arguments need to access scrapy Response (e.g. page object asks for ReesponseData)?

Then I think actions should be these:

  1. User hasn't annotated response as DummyResponse. Always download response, pass it to the callback.
  2. There is response: DummyResponse; page objects don't need response. Don't download; pass DummyResponse instance to the callback.
  3. There is response: DummyResponse; page objects need a Response. Download a response, but pass DummyResponse instance to the callback, as user asked for that.

A next question is how you detect

  1. if response is annotated as DummyResponse, and
  2. if a page object needs a response.

Currently this is done by checking if there is an argument named response. I think this is not correct, both for callback and for page objects, but for different reasons.

Callback: Scrapy passes response as a positional argument, not as a keyword argument. It means the name of the argument doesn't matter: in case of def parse(self, resp, page: ProductPage, response: DummyResponse): response should be downloaded.

Page objects / injectables: argument names and types doen't matter; what matters is an implementation of providers for the types which page object uses. Examples:

  • There can be e.g. page object which receives response argument of type SplashResponse.
  • There can be a page object which receives resp argument of type ResponseData, and the provider for ResponseData needs an access to response.
  • There can be a page object which receives response argument of type ResponseData, but a provider which is currently registered for ResponseData doesn't use Scrapy at all (e.g. it always gets a response from a local cache).

So, if I'm not mistaken, It means that the logic for callback and for page objects should be different:

  1. For callback - check type of the first argument.
  2. For page object - find providers for all dependencies, and check if any of the providers need a Response.

The tricky part here is "if any of the providers need a Response"; currently there is no way of doing it. I think we can use dependency injection again here, and inspect their __init__ method; if Response or its subclass is found, then mark the provider as requiring a Response. This (or some other way of doing it - e.g. having an attribute requires_scrapy_response = True) should be documented - at least in a comment for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just refactored the source code. Could you please take another review?

scrapy_po/middleware.py Outdated Show resolved Hide resolved
@victor-torres
Copy link
Contributor Author

@kmike I'll wait for your feedback before implementing more test cases (for example, testing middleware and parts pointed by coverage bot).

scrapy_po/utils.py Outdated Show resolved Hide resolved
# Provider not found.
continue

spec = inspect.getfullargspec(provider)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked this in detail, but it can be better to use andi.inspect here, because this way you can work around some attr.s-specific issues, and potentially support Union

Copy link
Contributor Author

@victor-torres victor-torres Apr 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Output is quite different :/

(Pdb) inspect.getfullargspec(provider)
FullArgSpec(args=['self', 'response'], varargs=None, varkw=None, defaults=None, kwonlyargs=[], kwonlydefaults=None, annotations={'response': <class 'scrapy.http.response.Response'>})

(Pdb) andi.inspect(provider)
{'response': []}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike andi.plan, you need to pass __init__ method explicitly:

In [1]: import andi

In [2]: from scrapy.http  import Response

In [3]: class Foo:
   ...:     def __init__(self, resp: Response):
   ...:         pass
   ...:

In [4]: andi.inspect(Foo)
Out[4]: {'resp': []}

In [5]: andi.inspect(Foo.__init__)
Out[5]: {'resp': [scrapy.http.response.Response]}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually wonder if this logic (https://github.com/scrapinghub/andi/blob/e26bcfd9f946aead542cacac1fa4462af7d01443/andi/andi.py#L289) should be moved to andi.inspect, so that there is less surprises. What do you think @ivanprado ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmike I've just refactored this method to make use of andi.inspect instead of inspect.getfullargspec as you had requested.

scrapy_po/utils.py Outdated Show resolved Hide resolved
@@ -21,6 +22,10 @@ class AutoextractProductResponse:

@provides(AutoextractProductResponse)
class AutoextractProductProvider(PageObjectInputProvider):

def __init__(self, response: DummyResponse):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about making it explicit that this thing needs an URL, not a DummyResponse? Not sure how best to do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. In any case, DummyResponse instance does have a valid URL and references the original request so that users/developers could consult meta and other attributes.

@kmike
Copy link
Member

kmike commented Apr 13, 2020

@victor-torres the approach and proposed code refactoring makes sense, please go ahead with further changes!

Could you please add some integration-style tests as well, to check that response is actually not being downloaded in cases where it shouldn't?

@victor-torres
Copy link
Contributor Author

Hi @kmike

Thank you for you last review.

I've just updated the pull request with some comments answering some of your questions and I have also introduced new test cases and refactored some parts of the code.

Looking forward to your thoughts.

def is_provider_using_response(provider):
"""Check whether injectable provider makes use of a valid Response."""
for argument, possible_types in andi.inspect(provider.__init__).items():
for cls in possible_types:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should fail with a useful message here in case cls is not expected (i.e. it is not any type of Response).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we allowing other types of injectables in the future?

Perhaps it's out of the scope of this method having to check this. What do you think? Should we raise a TypeError here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking in the case where somebody declares a provider with a wrong __init__. It would be more meaningful to fail here with a TypeError saying something like: Type foo not supported for arguments for providers __init__. Only Response is supported. or Too many arguments for provider __init__, only empty or one supported.

Otherwise, it will fail on invocation but the message will be much less informative. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it makes sense. But I still think it's out of the scope of this method.

Think of that Config injectable you've suggested in another comment. Does that make sense to check for sub-classes of this Config here too? I don't think so.

But let me know if you have a different opinion. Also, your two cents here would be helpful, @kmike.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, forget about it. I think I'm messing things here 😝 and this is very minor.

@ivanprado
Copy link
Contributor

@victor-torres thanks for the PR, nice to meet you!. I see that this PR introduces some kind of manual injection into the providers (setting the request or the dummy request). I wonder if we could extend the concept and make the providers also injectable. I thought a little bit about it but I didn't found the answer, so leaving there the question open.

@victor-torres
Copy link
Contributor Author

@victor-torres thanks for the PR, nice to meet you!. I see that this PR introduces some kind of manual injection into the providers (setting the request or the dummy request). I wonder if we could extend the concept and make the providers also injectable. I thought a little bit about it but I didn't found the answer, so leaving there the question open.

Hey, @ivanprado. Thank you for your comments.

Would you mind explaining a little bit more about your thoughts? Maybe giving an example?

@ivanprado
Copy link
Contributor

@victor-torres this is just an idea, so I don't want to mess this PR with it. Is out of the scope of it.

But let's imagine we have some kind of Conf object that allows configuring some particular behaviour. You could declare a provider as usual:

@provides(AutoextractProductResponse)
class AutoextractProductProvider(PageObjectInputProvider):

    def __init__(self, response: DummyResponse):
        self.response = response
    ...

Or alternatively, if you require the configuration to configure some behaviour, you could declare:

@provides(AutoextractProductResponse)
class AutoextractProductProvider(PageObjectInputProvider):

    def __init__(self, response: DummyResponse, conf: Conf):
        self.response = response
        self.conf = conf
    ...

You would need also a provider for Conf:

@provides(Conf)
class ConfProvider(PageObjectInputProvider):
     # It could read a file or could access a database depending the environment

This would require some injection mechanism using andi.plan similar to what is applied for the callback, but applied to the provider itself.

A difficulty is that some providers might depend on other providers.

Anyway, I'm not sure we really need something like that. It is only an idea. Forget about it.

@victor-torres
Copy link
Contributor Author

victor-torres commented Apr 14, 2020

Oh, I see @ivanprado. That also crossed my mind when developing the build_providers method. Agree that it might be out of current scope, but it's a very interesting feature :D

@kmike
Copy link
Member

kmike commented Apr 15, 2020

Thanks @victor-torres! The PR looks good to me as-is, let's merge it.
It seems we all have many ideas for improvements, which is good :)

@kmike kmike merged commit 7bb6ed1 into scrapinghub:master Apr 15, 2020
@victor-torres victor-torres deleted the skip-downloads branch April 15, 2020 18:46
@ivanprado
Copy link
Contributor

I noticed that when DummyRequest is used, some functionality of Scrapy does not work:

  • CONCURRENT_REQUESTS
  • CONCURRENT_REQUESTS_PER_DOMAIN
  • CONCURRENT_REQUESTS_PER_IP
  • RANDOMIZE_DOWNLOAD_DELAY
  • DownloaderAwarePriorityQueue
  • AutoThrottle

The reason is that the request skip the downloader, so none of these limits can be applied to DummyRequests.

Whether this is a useful behaviour or not depends on the case.

@ivanprado
Copy link
Contributor

I have analyzed my last comment deeper and it seems that CONCURRENT_REQUESTS are actually enforced even in the presence of DummyRequest. What is not working is everything that needs to use the downloader slots. So it doesn't work for:

  • CONCURRENT_REQUESTS_PER_DOMAIN
  • CONCURRENT_REQUESTS_PER_IP
  • RANDOMIZE_DOWNLOAD_DELAY
  • DownloaderAwarePriorityQueue
  • AutoThrottle

@victor-torres
Copy link
Contributor Author

@ivanprado, I think this is the expected behavior.

DummyRequests are meant to skip downloads, so it makes sense not checking for concurrent requests, delays, or autothrottle settings since we won't be making any download at all. If your parser needs a regular Request, it will be downloaded and I believe all of those settings will be respected.

I suspect the problem here is when you are using third-party libraries to acquire content for a page object. Auto Extract is a good example.

Before DummyRequests implementation, there was a delay between Auto Extract API requests caused by regular Requests going through the Scrapy's scheduler. The only issue here is that in this case, the crawler was downloading a Response that was going to be discarded. That's a waste of network resources.

I believe that if a third-party library needs to implement some kind of rate limit or delay between its requests, it must do it on its side of the source code or try to leverage Scrapy's mechanisms to do so — for example, making use of Requests to fetch its data.

Let me know if you have other considerations or if you would like to suggest another way to outcome this limitation.

@ivanprado
Copy link
Contributor

@victor-torres I think your comment makes a lot of sense.

My intention was just to state what is the current behaviour. In the presence of DummyRequest when a thirty party library is used to fetch content some Scrapy limits still apply (the CONCURRENT_REQUEST) but not others.

Whether if that is a problem or not and what would be the solution is another question. I think your ideas are in the right direction anyway.

@victor-torres
Copy link
Contributor Author

Nice to see that we are on the same page, @ivanprado.

Just documenting here that after talking today we discussed exposing Scrapy settings to the Injectables. That might at least provide some information to the so-called third-party libraries so that they can enforce the spider's concurrency settings.

Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants