-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for custom attr results to DropLowProbabilityItemPipeline. #125
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #125 +/- ##
=======================================
Coverage ? 97.76%
=======================================
Files ? 63
Lines ? 2375
Branches ? 0
=======================================
Hits ? 2322
Misses ? 53
Partials ? 0
|
zyte_common_items/pipelines.py
Outdated
if item_type == "customAttributes": | ||
new_item[item_type] = sub_item | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need special handling for custom attributes?
They don't have probability field, so they shouldn't be dropped?
Is it related to the fact the pipeline only supports zyte-common-items items, and breaks on arbitrary items? If so, it may be good to fix it as well, as it should (hopefully) simplify custom attributes implementation - they won't be a special case anymore (hopefully :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things: to simplify the flow somewhat (but I agree that with the current flow, which can process several sub-items, the flow is not simplified by this) and to not emit drop_low_probability_item/{processed,kept}/customAttributes
stats which I thought are pointless (but harmless as there are just 2 of them). Should I remove the special handling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was probably thinking about a slightly different thing:
- Pipeline should handle items with probability field - e.g. with zyte-common-items's item.get_probability(), and (maybe) by taking item["probability"] through itemadapter (but this one could be out of scope).
- For items without probability field, do nothing - don't drop them, don't change the stats.
- There should be some special handling for nested items, where the exact items are inside a dictionary, probably with arbitrary keys.
I'm not sure how best to distinguish between (2) and (3) scenarios though; maybe it can be an explicit configuration, but maybe not.
With this approach customAttributes is just an item without probability field, so it doesn't need any special handling, and there shouldn't be stats logged for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can omit stats for items where probability is None, or should we make some more changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we should replace item_proba = item.get_probability()
with something smarter, because it currently breaks on all non-zci items. It's possible to do it in another PR, but it seems pretty important to fix this, and it looked somewhat relevant here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just check if the item inherits ProbabilityMixin
(or even zyte_common_items.base.Item
?)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would work.
Sometimes users may be converting zci items to dicts before they reach the middleware, and it'd be nice to support probability in non-zci items in some way in general, but it looks like a larger change; just making sure we don't break completely is way better than what's currently in main.
No description provided.