Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for custom attr results to DropLowProbabilityItemPipeline. #125

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

wRAR
Copy link
Member

@wRAR wRAR commented Dec 20, 2024

No description provided.

@wRAR wRAR marked this pull request as draft December 20, 2024 14:57
Copy link

codecov bot commented Dec 24, 2024

Codecov Report

Attention: Patch coverage is 94.11765% with 1 line in your changes missing coverage. Please review.

Please upload report for BASE (main@f87d98b). Learn more about missing BASE report.

Files with missing lines Patch % Lines
zyte_common_items/pipelines.py 94.11% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #125   +/-   ##
=======================================
  Coverage        ?   97.76%           
=======================================
  Files           ?       63           
  Lines           ?     2375           
  Branches        ?        0           
=======================================
  Hits            ?     2322           
  Misses          ?       53           
  Partials        ?        0           
Files with missing lines Coverage Δ
zyte_common_items/pipelines.py 96.42% <94.11%> (ø)

Comment on lines 139 to 141
if item_type == "customAttributes":
new_item[item_type] = sub_item
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need special handling for custom attributes?
They don't have probability field, so they shouldn't be dropped?
Is it related to the fact the pipeline only supports zyte-common-items items, and breaks on arbitrary items? If so, it may be good to fix it as well, as it should (hopefully) simplify custom attributes implementation - they won't be a special case anymore (hopefully :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things: to simplify the flow somewhat (but I agree that with the current flow, which can process several sub-items, the flow is not simplified by this) and to not emit drop_low_probability_item/{processed,kept}/customAttributes stats which I thought are pointless (but harmless as there are just 2 of them). Should I remove the special handling?

Copy link
Contributor

@kmike kmike Dec 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was probably thinking about a slightly different thing:

  1. Pipeline should handle items with probability field - e.g. with zyte-common-items's item.get_probability(), and (maybe) by taking item["probability"] through itemadapter (but this one could be out of scope).
  2. For items without probability field, do nothing - don't drop them, don't change the stats.
  3. There should be some special handling for nested items, where the exact items are inside a dictionary, probably with arbitrary keys.

I'm not sure how best to distinguish between (2) and (3) scenarios though; maybe it can be an explicit configuration, but maybe not.

With this approach customAttributes is just an item without probability field, so it doesn't need any special handling, and there shouldn't be stats logged for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can omit stats for items where probability is None, or should we make some more changes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we should replace item_proba = item.get_probability() with something smarter, because it currently breaks on all non-zci items. It's possible to do it in another PR, but it seems pretty important to fix this, and it looked somewhat relevant here :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just check if the item inherits ProbabilityMixin (or even zyte_common_items.base.Item?)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would work.
Sometimes users may be converting zci items to dicts before they reach the middleware, and it'd be nice to support probability in non-zci items in some way in general, but it looks like a larger change; just making sure we don't break completely is way better than what's currently in main.

@wRAR wRAR marked this pull request as ready for review December 25, 2024 11:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants