-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SwitchPage #103
base: master
Are you sure you want to change the base?
Add SwitchPage #103
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #103 +/- ##
=======================================
Coverage 99.56% 99.56%
=======================================
Files 25 25
Lines 921 927 +6
=======================================
+ Hits 917 923 +6
Misses 4 4
|
web_poet/pages.py
Outdated
@abc.abstractmethod | ||
async def switch(self) -> Injectable: | ||
"""Return the right :ref:`page object class <page-objects>` based on | ||
the received input.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a question about the expected behavior of the implementing framework (e.g. scrapy-poet) when the Switch PO has chosen the PO to parse the page.
Specifically, would the frameworks need to check if the chosen PO is declared in any of the rules' instead_of
parameter?
Approach 1
One use case I can think of is:
if layout_1(self.response):
return ProductLayout1Page
elif layout_2(self.response):
return ProductLayout2Page
else:
return AutomaticExtractionPage
In this example, the last resort is Automatic Extraction (e.g. by ML models). If the instead_of
parameters are supported, then it would be of great convenience to the user to simply rely on the other overrides stored in the registry to return a more apt PO.
Approach 2
Alternatively, users can simply do:
if layout_1(self.response):
return ProductLayout1Page
elif layout_2(self.response):
return ProductLayout2Page
else:
raise web_poet.DelegateFallback
(Reference: #26)
In this case, it'd be up to the implementing framework to decide if the instead_of
parameters are used, or perhaps some other means of declaring and using the fallback PO via some user-defined settings.
Approach 3
The other approach is not to use the instead_of
parameters at all since it should exactly follow what PO class the user has returned.
Other approaches
Could perhaps be a combination of the ideas above to give a finer grain of control to the user; or perhaps completely something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no clear idea on what the best approach here is. I think your questions aligns a bit with one of my open question:
Do we need to have some override mechanism for candidate page object classes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to support overrides for candidate page object classes, at least in the first version.
web_poet/pages.py
Outdated
required input, and return the output of its | ||
:meth:`~web_poet.pages.ItemPage.to_item` method.""" | ||
page_object_class = await self.switch() | ||
# page_object = page_object_class(...) # TODO: pass the right inputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is tricky since dependencies are dynamic.
For example:
└── ProductSelectPage
├── ProductLayout1Page → needs httpResponseBody
├── ProductLayout2Page → needs browserHtml
└── AutomaticExtractionPage → needs AutoExtractProductData
Providing all three of the dependencies should solve it but it could be expensive.
There should be a way for the Switch/Select/Choose PO to declare the dependency of the PO to use. This won't work in the current way of how dependencies are provided since the Switch/Select/Choose PO is already built as an instance and getting another dependency input is something we need to work on.
Alternatively, I'm thinking if we should glance at the perspective of solving this via overrides with some slight tweaks. For example:
ApplyRule(
for_patterns="example.com",
use=web_poet.DynamicPO,
instead_of=ProductSelectPage,
to_return=Product,
)
The web_poet.DynamicPO
is simply a sentinel class which serves as a stand-in while we still don't know the final PO to use.
If this is present, then the behavior of ApplyRule
changes a bit:
- instantiate whatever's in the
instead_of
parameter and resolve any dependencies that it needs (e.g.httpResponseBody
) - call the
.switch()
method (or other names we can think of for this) - this determines the class to be placed in the
use
parameter - use the
ApplyRule
as usual
The concept of overrides could fit in here since the Switch/Select/Choose PO is essentially overridden by the PO it has selected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposed solution indeed works best for a scenario where the inputs are the same for all layout-specific page objects, and where the required layout is likely to be random (i.e. A/B testing scenario, as opposed to special URLs where the same URL is likely to get the same layout when you refresh).
In this random scenario, requesting httpResponseBody to determine the layout, determining that the layout is 2, and then fetching browserHtml to parse it with layout 2, would not work, because by the time you get layout 2 the response may be for layout 1.
I think an approach in line with the current proposal may still make sense, at least for some scenarios like the A/B test one.
I wonder if it would make sense to implement 2 different solutions for the 2 different scenarios. Specially because to allow for incremental request of inputs based on their underlying cost, we will probably need a significantly more complicated approach.
Note to self: sublayout support. |
Applied the improvement proposal by @kmike. It is very clean, and saves me a lot of trouble. It also makes nesting (i.e. multi-layout page object classes that use other multi-layout page object classes) work by default, I believe. Considering the problems of input mixing and override handling out of scope, my only remaining thoughts are:
|
@@ -68,6 +68,25 @@ async def to_item(self) -> ItemT: | |||
) | |||
|
|||
|
|||
class MultiLayoutPage(ItemPage[ItemT]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should support fields for MultiLayoutPage, or not.
Currently all fields defined for MultiLayoutPage are silently ignored.
There is also a use case when you need a "partial" layout, layout of a region of a page. In this case, only some fields are extracted by the layout; others are common. It seems we have a few options:
- Use a base class with common fields. Create layouts which inherit from the base class and handle necessary regions. Use them in MultiLayoutPage.
- Allow to define fields on the MultiLayoutPage. When layout is picked, combine the fields from the layout with the fields defined in the class itself, to compute the final item. Supporting to_item for layouts can be more tricky in this case, as it's not clear which item they should use.
A separate, but related issue is if it's possible to use 2 or more regions, with different layouts, in the same page object.
If fields are supported, it seems it makes sense to move the logic to ItemPage. I think it may simplify typing, and inheritance as well. E.g. layouts can be used with ProductPage from zytedata/zyte-common-items#19 without using multiple inheritance.
It seems that if fields are not supported, it's better to keep MultiLayoutPage as a separate class, and probably raise an error or issue a warning if fields are defined. There is one argument for keeping it separate and not supporting fields: it'd allow to define fields named layout
. If we provide a standard method named layout
, and allow to define fields, it's not possible to have a field of the same name.
Sorry for a braindump :) What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A disadvantage of using MultiLayoutPage without fields: it's not possible to use fields in the code which uses the page object. So let's say we have ProductPage, it uses fields for data extraction. Then, it's refactored to use MultiLayoutPage as a base class. It means that the fields are no longer supported, and so the code which uses this page may break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am slightly against field support in MultiLayoutPage.
In addition to those 2 options you mention, I can think of a 3rd that is a variation on 1 based precisely on how you amended my API proposal for MultiLayoutPage:
- Use composition rather than inheritance: define a page object class that can extract the common fields, make it a dependency of the layouts that extract additional fields, and use it in the to_item of those layouts (which would be the layouts which MultiLayoutPage handles).
We would still have the same problem as with the lack of fields in MultiLayoutPage, i.e. you could not access the fields of the dependency layout through the layout that uses it (other than accessing the dependency directly, e.g. layout.dependency.field
). But I wonder if we could implement a getattr fallback mechanism to solve this issue for both scenarios: allow MultiLayoutPage to expose the fields of the object that its layout() method returns, and allow layouts with other layouts as dependencies to expose the fields of their dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also not sure about support @fields
in MultiLayoutPage
.
It would seem it'd be best to keep it's task simple wherein it simply identifies and returns the PO instance based on the layout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one argument for keeping it separate and not supporting fields: it'd allow to define fields named
layout
. If we provide a standard method namedlayout
, and allow to define fields, it's not possible to have a field of the same name.
This is a valid point, but I think we could switch to get_layout
to avoid this issue, and it would be a better method name anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Gallaecio can you please explain the last idea in more details, e.g. with some code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this looks nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were discussing it today on a call with @proway2. They implemented a multi-layout page object, tried a few approaches. In short - having fields on the final page object is a must :) That's the reason the documented approach here won't work well for them. Taking union of all dependencies is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently they define all the fields, and call to the self._layout in each field. It's a lot of boilerplate; exactly something a library should be solving.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems having fields availabe is required, but not necessarily being able to define top-level fields.
Co-authored-by: Kevin Lloyd Bernal <[email protected]>
…common, partial layout
assert item2.url == url | ||
assert item2.text == "b" | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also add a test for use cases when a MultiLayoutPage
subclass also uses multiple MultiLayoutPage
s underneath. I think this could be a use case when users import POs from different packages and repacking them.
class MyMultiLayoutPage(MultiLayoutPage[SomeItem]):
response: HttpResponse
layout_page_us: LayoutPageUS
layout_page_uk: LayoutPageUK
async def get_layout(self) -> ItemPage[SomeItem]:
if self.response.css(".origin::text") == "us":
return self.layout_page_us.get_layout()
return self.layout_page_uk.get_layout()
Might also be worth creating a doc about this as well.
page1: FullPage1 | ||
page2: FullPage2 | ||
|
||
async def get_layout(self) -> ItemPage[FullItem]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
async def get_layout(self) -> ItemPage[FullItem]: | |
async def get_layout(self) -> ItemPage: |
Removes the need for the type: ignore[return-value]
below.
This should be okay since we're expecting ItemPage
(and its subclasses) to be returned. For some reason, mypy fixates on the FullItem
return type.
""" | ||
|
||
@abc.abstractmethod | ||
async def get_layout(self) -> ItemPage[ItemT]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this sounds right but how about to_layout()
? In that way, we establish some sort of consistency with the PO's API naming, i.e. to_item()
.
Changes:
SwitchPage
class, that can be subclassed to create special page object classes that call other page object classes based on the received input.It took me a while to write it in a way that did not presume too much prior knowledge from the reader, and I am still unhappy with the result, so I would love some feedback here.
Arguments for the proposed API:
switch
method in the current proposal), as opposed to defining it on each page object class. Otherwise, handling priorities and corner cases could become cumbersome.switch
if no right page object class is found, or from theto_item
method of the selected page object class if, despite that page object class getting selected fromswitch
, the input it gets does not match some other expectation.To do:
Do we need to have some override mechanism for candidate page object classes? i.e. Does there need to be a way through rules to replace a candidate page object class instead of replacing the whole switch page object class?(out of scope)Implement input passing to the selected candidate page object class (andi? inspect and simple type matching?).(layout page object classes are now declared as inputs, letting web-poet frameworks handle this automatically through dependency injection)Provide a nice exception if there is a missing input?(not relevant given the resolution of the bullet point above)