-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
introduce PageObjectRegistry with @hande_urls annotations #27
Conversation
Codecov Report
@@ Coverage Diff @@
## master #27 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 5 6 +1
Lines 127 209 +82
=========================================
+ Hits 127 209 +82
|
…nstructor: from_override_rules()
Adjust the code in line with the refactoring of ResponseData into HttpResponse from this PR: #30
docs/intro/overrides.rst
Outdated
---------------------------------- | ||
|
||
The :meth:`~.PageObjectRegistry.get_overrides` method from the ``web_poet.default_registry`` | ||
allows discovery and retrieval of all :class:`~.OverrideRule` from your project. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think get_overrides is not doing discovery, and also it doesn't limit rules to a certain project:
allows discovery and retrieval of all :class:`~.OverrideRule` from your project. | |
allows retrieval of all :class:`~.OverrideRule` in the registry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Updated this on 37e5ba1
docs/intro/overrides.rst
Outdated
|
||
:meth:`~.PageObjectRegistry.get_overrides` relies on the fact that all essential | ||
packages/modules which contains the :func:`web_poet.handle_urls` | ||
annotations are properly loaded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
annotations are properly loaded. | |
annotations are properly loaded (i.e. imported). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I think it may be a bit more clear if in some places further in the docs instead of "loaded" is replaced with "imported". consume_modules is not doing anything fancy, it just imports modules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this on 37e5ba1
docs/intro/overrides.rst
Outdated
rules = default_registry.get_overrides() | ||
|
||
# Fortunately, `get_overrides()` provides a shortcut for the lines above: | ||
rules = default_registry.get_overrides(consume=["external_package_A.po", "another_ext_package.lib"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of the caveats I'm not sure it makes sense to provide consume
keyword for get_overrides - the code above reads as it'd return only POs from the listed modules. With
consume_modules("external_package_A.po", "another_ext_package.lib")
rules = default_registry.get_overrides()
the caveat is more clear IMHO, and it's the same amount of code. Do you think we can remove this parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we can make it more explicit by having to access consume_modules()
externally.
Removed this on 37e5ba1
|
||
There are two main ways we recommend in solving this. | ||
|
||
**1. Priority Resolution** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main use case here could be the following:
- There is a library of page objects
- In your project you want to have a different version for one of POs available in the library
In this case there is no need to modify any override rules, it's a matter of using handle_urls decorator in your own project, and setting a priority higher than the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmike to clarify, are you suggesting something like the following?
If we have the conflict like:
rules = default_registry.get_overrides()
print(rules)
# OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_2.EcomSite2'>, instead_of=<class 'common_items.ProductGenericPage'>, meta={})
# OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_2.GadgetSite2'>, instead_of=<class 'common_items.ProductGenericPage'>, meta={})
It can be fixed in the project as something like:
import handle_urls
handle_urls(ProductGenericPage, overrides=ProductGenericPage, priority=600)(EcomSite2)
At the moment, this won't work since the PageObjectRegistry
stores the PO class as the key, and if a rule is already existing for a given PO, it ignores new rules attempted to be registered. Reference.
Do you think it's worth refactoring this from Dict[<PO class>, OverrideRule]
into Dict[<PO class>, List[OverrideRule]]
instead?
Though this would affect the other implementations like search_overrides()
and how it's being read in scrapy-poet as well. Nonetheless, we're still quite early in the process and we can still accommodate this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's say we have ShopProductPage defined in a common library, along with many other POs:
# shop_crawling_lib
@handle_urls("shop.com", overrides=GenericProductPage)
class ShopProductPage(ItemWebPage):
# ...
@handle_urls("shop2.com", overrides=GenericProductPage)
class Shop2ProductPage(ItemWebPage):
# ...
@handle_urls("shop3.com", overrides=GenericProductPage)
class Shop2ProductPage(ItemWebPage):
# ...
A developer installs and uses this library:
consume_modules("shop_crawling_lib")
Most POs are fine, but the PO for shop.com
doesn't fit the bill. So, developer creates a new PO for this:
@handle_urls("shop.com", overrides=GenericProductPage, priority=600)
class ShopProductPage(ItemWebPage):
# ...
Then we need to make sure all the annotations are discovered:
consume_modules("shop_crawling_lib")
consume_modules("my_project") # order doesn't matter
rules = default_registry.get_overrides()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs/intro/overrides.rst
Outdated
were recently added. This could lead to a `silent-error` of receiving a different | ||
set of rules than expected. | ||
|
||
For this approach, you can use the :meth:`~.PageObjectRegistry.search_overrides` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I was thinking about importing POs directly. How would you do inclusion in search_overrides?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added another in-depth example for search_overrides()
in 7457331.
|
||
- ``my_page_obj_project`` `(since it's the same module as the file above)` | ||
- ``other_external_pkg.po`` | ||
- ``another_pkg.lib`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plus any other modules imported in the same process. I think the example only works if the code above is a whole script, not a module within a larger package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good one. Updated this on 37e5ba1
web_poet/overrides.py
Outdated
|
||
If the ``default_registry`` had other ``@handle_urls`` annotations outside | ||
of the packages/modules listed above, then the corresponding | ||
:class:`~.OverrideRule` won't be returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this is only true if they're not imported in some other way.
I think it may be fine to just explain that consume_module imports modules recursively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this in 37e5ba1
@BurnzZ the PR looks good to me. There are some ways to improve the docs, but it can be done later. +1 to merge after removing "consume" option from get_overrides method. |
…ow rules are imported
Co-authored-by: Mikhail Korobov <[email protected]>
Thanks for the review @kmike ! I've addressed all of your comments alongside the specific commit for the change. Let me know if there's anything left to update here. 🙏 |
web_poet/utils.py
Outdated
""" | ||
if value is None: | ||
return [] | ||
if not isinstance(value, list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to add a test for tuple, e.g. what should as_list(("foo", "bar", 123))
return? Should it be a single-item list, or a tuple converted to a list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! After looking into this further, I've added more input types that as_list()
should also be able to handle and convert. Updated in 67e7297
Thanks @BurnzZ! |
This is built as a light-weight version of the new features introduce by this PR: https://github.com/scrapinghub/web-poet/pull/16/files
In contrast, this only contains a minimal set of features that allows developers to:
default_registry
's@handle_urls
annotation for declaringOverrideRule
s.search_overrides()
to find specificOverrideRule
s amased from importing multiple Page Object Projects (POP).This means that other non-essential features from the old PR aren't included:
registry_pool
which allows keeping track of other non-defaultPageObjectRegistry
instancesOverrideRule
scopy_overrides_from()
which allows copying ofOverrideRule
s from multiple non-defaultPageObjectRegistry
instancesremove_overrides()
since an exclusion-list strategy may be brittle if the underlying source changes. This results in some unintentionalOverrideRules
to be present in the output if the developer isn't aware of them.replace_overrides()
since a more explicit alternative would be creating a newOverrideRule
which has the intended attribute replacements.filter
param ofget_overrides()
since it could break the developer's expectation if newOverrideRule
s are added to the filter module which could lead to unintentionalOverrideRules
to be present.Progress:
To address in other PRs:
OverrideRule.instead_of
to support data types as well