Add support for custom attr results to DropLowProbabilityItemPipeline. #125

wRAR · 2024-12-20T14:09:42Z

No description provided.

codecov · 2024-12-24T10:25:51Z

Codecov Report

Attention: Patch coverage is 94.11765% with 1 line in your changes missing coverage. Please review.

Please upload report for BASE (main@f87d98b). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
zyte_common_items/pipelines.py	94.11%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #125   +/-   ##
=======================================
  Coverage        ?   97.76%           
=======================================
  Files           ?       63           
  Lines           ?     2375           
  Branches        ?        0           
=======================================
  Hits            ?     2322           
  Misses          ?       53           
  Partials        ?        0

Files with missing lines	Coverage Δ
zyte_common_items/pipelines.py	`96.42% <94.11%> (ø)`

kmike · 2024-12-24T18:29:21Z

zyte_common_items/pipelines.py

+                if item_type == "customAttributes":
+                    new_item[item_type] = sub_item
+                    continue


Why do you need special handling for custom attributes?
They don't have probability field, so they shouldn't be dropped?
Is it related to the fact the pipeline only supports zyte-common-items items, and breaks on arbitrary items? If so, it may be good to fix it as well, as it should (hopefully) simplify custom attributes implementation - they won't be a special case anymore (hopefully :)

Two things: to simplify the flow somewhat (but I agree that with the current flow, which can process several sub-items, the flow is not simplified by this) and to not emit drop_low_probability_item/{processed,kept}/customAttributes stats which I thought are pointless (but harmless as there are just 2 of them). Should I remove the special handling?

I was probably thinking about a slightly different thing:

Pipeline should handle items with probability field - e.g. with zyte-common-items's item.get_probability(), and (maybe) by taking item["probability"] through itemadapter (but this one could be out of scope).

For items without probability field, do nothing - don't drop them, don't change the stats.

There should be some special handling for nested items, where the exact items are inside a dictionary, probably with arbitrary keys.

I'm not sure how best to distinguish between (2) and (3) scenarios though; maybe it can be an explicit configuration, but maybe not.

With this approach customAttributes is just an item without probability field, so it doesn't need any special handling, and there shouldn't be stats logged for it.

We can omit stats for items where probability is None, or should we make some more changes?

Ideally, we should replace item_proba = item.get_probability() with something smarter, because it currently breaks on all non-zci items. It's possible to do it in another PR, but it seems pretty important to fix this, and it looked somewhat relevant here :)

Should we just check if the item inherits ProbabilityMixin (or even zyte_common_items.base.Item?)?

This would work.
Sometimes users may be converting zci items to dicts before they reach the middleware, and it'd be nice to support probability in non-zci items in some way in general, but it looks like a larger change; just making sure we don't break completely is way better than what's currently in main.

…Pipeline.

Add support for custom attr results to DropLowProbabilityItemPipeline.

70f6204

Gallaecio approved these changes Dec 20, 2024

View reviewed changes

wRAR marked this pull request as draft December 20, 2024 14:57

Change the DropLowProbabilityItemPipeline logic for nested items.

719b841

wRAR added 3 commits December 24, 2024 17:28

Fix handling of customAttributes sub-items.

1327741

Update the DropLowProbabilityItemPipeline test.

447c503

Small cleanup.

205221e

kmike reviewed Dec 24, 2024

View reviewed changes

wRAR marked this pull request as ready for review December 25, 2024 11:05

wRAR added 3 commits December 25, 2024 16:06

Remove special handling of customAttributes in DropLowProbabilityItem…

5b96eb9

…Pipeline.

Improve handling of items without probability.

2ebcf5d

Fixes.

1373297

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for custom attr results to DropLowProbabilityItemPipeline. #125

Add support for custom attr results to DropLowProbabilityItemPipeline. #125

wRAR commented Dec 20, 2024

codecov bot commented Dec 24, 2024 •

edited

Loading

kmike Dec 24, 2024

wRAR Dec 25, 2024

kmike Dec 25, 2024 •

edited

Loading

wRAR Dec 25, 2024

kmike Dec 25, 2024

wRAR Dec 25, 2024

kmike Dec 25, 2024

Add support for custom attr results to DropLowProbabilityItemPipeline. #125

Are you sure you want to change the base?

Add support for custom attr results to DropLowProbabilityItemPipeline. #125

Conversation

wRAR commented Dec 20, 2024

codecov bot commented Dec 24, 2024 • edited Loading

Codecov Report

kmike Dec 24, 2024

Choose a reason for hiding this comment

wRAR Dec 25, 2024

Choose a reason for hiding this comment

kmike Dec 25, 2024 • edited Loading

Choose a reason for hiding this comment

wRAR Dec 25, 2024

Choose a reason for hiding this comment

kmike Dec 25, 2024

Choose a reason for hiding this comment

wRAR Dec 25, 2024

Choose a reason for hiding this comment

kmike Dec 25, 2024

Choose a reason for hiding this comment

codecov bot commented Dec 24, 2024 •

edited

Loading

kmike Dec 25, 2024 •

edited

Loading