-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
introduce ZyteAPITextResponse and ZyteAPIResponse to store raw Zyte Data API Response #10
Conversation
Codecov Report
@@ Coverage Diff @@
## main #10 +/- ##
==========================================
+ Coverage 91.95% 92.70% +0.74%
==========================================
Files 2 3 +1
Lines 87 137 +50
==========================================
+ Hits 80 127 +47
- Misses 7 10 +3
|
fd33ed9
to
8812a05
Compare
remove 'Content-Encoding' header when returning responses
I was going to suggest changing the class names, because eventually we probably want to allow non-browser responses to be However, now I am thinking this naming is OK, and what needs to change to address #15 is only that we use I imagine there is a case for users wanting to know if the response content comes from browser or non-browser data, but I don’t think having different classes so that this can be checked with |
hey! I think that here we should focus on exposing raw API response, as JSON. Auto-filling httpResponseHeaders also looks out of scope, it's a separate discussion. @Gallaecio's questions:
This sounds good to me. I'm not sure we would even need to override this behavior in future.
That's correct. Though browserHtml responses are always text responses.
I'd keep it out of scope here. We can add more helpers later, and discuss it separately.
I think that's fine not to implement it now, and focus on returning the raw data.
Let's keep this out of scope. What do you think about implementing the following behavior?
|
Sounds good to me. I do think 3 could be eventually handled like 4, since content detection is possible without headers, only not ideal, but even then it can be handled separately later. |
Sounds good to me as well. I've got an idea on how to implement point 4 but testing it with a lot of difference cases is indeed quite tricky. I'll have an update tomorrow after the test cases are sorted out, to see if we're okay with approach. |
Yes, this one is already implemented. Albeit,
This is implemented but I've added some tests in 910085b. Thanks for the reminder!
Implemented this on e3214d8 |
README.rst
Outdated
<https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.meta>`_ | ||
key to download a request using Zyte API. Full list of parameters is provided in the | ||
`Zyte API Specification <https://docs.zyte.com/zyte-api/openapi.html#zyte-openapi-spec>`_. | ||
To enable every request to be sent through Zyte API, you can set the following |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the setting below makes every request to be sent through Zyte API. If I'm reading the code correctly, it first checks if zyte_api meta key is present and the value is true-ish, and uses ZYTE_API_DEFAULT_PARAMS only in this case. So, parameters are only applied for requests which are already marked explicitly as Zyte API requests.
That's actually the behavior I'm fine with :) It looks useful e.g. to set default geolocation for Zyte API requests made from a spider. On the other hand, making every request going through Zyte API by using this option looks quite problematic; it'd require more thought. For example,
- would this example mean that robots.txt files will be downloaded through Zyte API with browserHtml: True?
- would it mean that e.g. a POST request won't work properly, while looking like a regular request in the spider code, and without raising any kind of warning?
It seems some feature to "enable Zyte API transparently" makes sense after we put a bit more effort into "transparent" integration of scrapy Requests with Zyte API direct downloader, it doesn't look as straightforward as setting default parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the setting below makes every request to be sent through Zyte API. If I'm reading the code correctly, it first checks if zyte_api meta key is present and the value is true-ish, and uses ZYTE_API_DEFAULT_PARAMS only in this case. So, parameters are only applied for requests which are already marked explicitly as Zyte API requests.
That's right. #13 attempted to not change the behavior fully. It only addressed to prevent the user from repeating the same Zyte API parameters in different parts of the code. In any case, hopefully it should serve as a base point to enable all requests go through Zyte API. :)
would this example mean that robots.txt files will be downloaded through Zyte API with browserHtml: True?
That's a good point. This also extends to other things like downloading an image, file, etc.
I think one option is to prevent browserHtml
from being set in the ZYTE_API_DEFAULT_PARAMS
. The same is true for javascript
, product
, articles
, etc. I think one way to think of it is having ZYTE_API_DEFAULT_PARAMS
support only a handful of parameters, like geolocation
or httpResponseBody
(though it needs to toggled off when browserHtml
is turned on), etc.
This could be a bit tricky since Zyte API offers a lot of features that doesn't only cater to being a simply proxy. I think for users to have a seamless experience when using Zyte API, we should have an interface that could cover general Scrapy use. This could mean only downloading the httpResponseBody
and perhaps httpResponseHeaders
by default which is available to all scrapy.Request
unless overridden by the zyte_api
param in meta.
We can open up another PR to explore the different options here.
would it mean that e.g. a POST request won't work properly, while looking like a regular request in the spider code, and without raising any kind of warning?
I think this could be alleviated somehow by the proposed interface in #20.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @kmike’s main point was that the documentation needs an update, as it is currently quite misleading. The other points are about issues if we decided to make the implementation match the current documentation, which I don’t think we want, at least not at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for clarifying! What do you think about this doc update? 2adc8a6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation change looks great.
I do wonder if we should cause zyte_api: {}
to enable Zyte Data API with default parameters. I think that is what I would expect to happen as a user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm a good point. There could be a few cases for this:
- The user is aware that
ZYTE_API_DEFAULT_PARAMS
is set in the settings and wish to use exactly the same values. Thus,zyte_api
would be represented as an empty{}
. ZYTE_API_DEFAULT_PARAMS
is not in the settings. The user might've thought that it was set but was actually not. This would result in the requests to be executed without using Zyte API.
We could use meta={"zyte_api": {}}
to represent it but it kinda feels unnatural as an empty-dict. Perhaps we could also support meta={"zyte_api": True}
as another alternative. The {}
is still a valid use-case though since the user might have a dynamic process to determine the contents of meta field. For example:
meta= {"zyte_api": {}} # default
if does_it_look_like_we_need_javascript():
meta["zyte_api"].update({"javascript": True})
if how_about_using_fr_region():
meta["zyte_api"].update({"geolocation": "FR"})
yield scrapy.Request(url, self.callback_func, meta=meta)
To summarize, we could use {}
and True
to enable Zyte API Requests bearing in mind the need for ZYTE_API_DEFAULT_PARAMS
be set.
Any thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to having {}
and True
work and behave the same. At some point Zyte Data API may (should? 🙂) work without parameter other than the URL (e.g. consider browserHtml
or httpResponseBody
enabled if no other output is enabled), in which case zyte_api=True
may become a common pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be the only remaining fix to make, the PR looks good to me otherwise 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored the code to accept True
and {}
in f5a9bb0.
Co-authored-by: Adrián Chaves <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
🎉 thanks @BurnzZ! 🚀 |
Fixes #9
The approach of storing the Zyte API response inside the
response.meta
was avoided since it will break the expectation in Scrapy thatresponse.meta
comes directly fromresponse.request.meta
, as what this code snippet embodies:In addition, this is what the official Scrapy docs says about
response.meta
:Thus, this PR introduces these new classes which inherits from Scrapy's
Response
andTextResponse
respectively:ZyteAPITextResponse
ZyteAPIResponse
This enables users to retrieve the raw Zyte Data API response as a
dict
viaresponse.zyte_api_response
.TODO: