Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid incorrect currency when no price available #20

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

rennerocha
Copy link

This should solve problem described in #12 .

Regex to match currencies is improved, so only currencies strings that are surrounded by some specific characters are matched as a currency, avoiding problems with currencies that can also be substrings of unrelated strings.

This fix also solved two examples that are marked as xfail.

An additional list was added for custom regexes that are not well handled by the other rules (done that to handle the HT problem in one test case)

@codecov
Copy link

codecov bot commented Oct 2, 2019

Codecov Report

Merging #20 into master will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master       #20   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            3         3           
  Lines           96       103    +7     
  Branches        21        22    +1     
=========================================
+ Hits            96       103    +7     
Impacted Files Coverage Δ
price_parser/parser.py 100.00% <100.00%> (ø)

Copy link
Contributor

@manycoding manycoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use non-greedy quantifiers wherever possible https://docs.python.org/3/howto/regex.html#greedy-versus-non-greedy

Everything else looks good to me.

@@ -142,7 +162,9 @@ def extract_currency_symbol(price: Optional[str],
for meth, attr in methods:
m = meth(attr) if attr else None
if m:
return m.group(0)
groups = [match for match in m.groups() if match is not None]
if groups:
Copy link
Member

@Gallaecio Gallaecio Oct 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the coverage data, this always evaluates to True with the current tests. It would be great to add a test for the case where it evaluates to False or, if that cannot happen, replace the if by an assert.

@rennerocha
Copy link
Author

@Gallaecio and @manycoding could you please review it again?

""" Return a regex which matches any of ``symbols`` """
return re.compile('|'.join(re.escape(s) for s in symbols))
""" Return a regex which matches any of ``symbols`` surrounded by some special characters """
left_tokens = [r"^", r"\s+?", r"\d+?"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about the non-greedy here, what's the result of this test case?

    Example(None, '  Prices   in   EUR  ',
            'EUR', None, None),
    Example(None, '13800   ₶  ',
            '₶', '13800', 13800)
            '₶', '13800', 13800),

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should not be a problem, since their match is not captured. But I guess it’s worth adding a test for it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both tests are passing. I included them in the test suite just in case...

""" Return a regex which matches any of ``symbols`` """
return re.compile('|'.join(re.escape(s) for s in symbols))
""" Return a regex which matches any of ``symbols`` surrounded by some special characters """
left_tokens = [r"^", r"\s+?", r"\d+?"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should not be a problem, since their match is not captured. But I guess it’s worth adding a test for it.

Comment on lines +57 to +72
left_tokens = [r"^", r"\s+?", r"\d+?"]
right_tokens = [r"$", r"\s+?", r"\d+?", r"[^a-zA-Z0-9]+?"]

return re.compile(
"|".join(
[
left + "({})".format(re.escape(s)) + right
for left, right in itertools.product(left_tokens, right_tokens)
for s in symbols
]
)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this can be simplified a bit:

symbols_pattern = '|'.join(re.escape(s) for s in symbols)
template = r'(?:^|\b|(?<=\d)){}(?:(?=\d)|\b|$)'
return re.compile(template.format(symbols_pattern))

If I’m not wrong, this should make some other changes below unnecessary, since we are no longer using a capture group.

And since this function is now doing more than simply an OR regex, but also imposing word boundaries specific to currency symbols (it allows digits as word boundaries), maybe we need to deprecate the existing function and provide a new, _private one for the new logic.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested and it didn't work in all examples.
I don't consider simple any code that auto generate regexes. Your code may work but it will require more effort on it to solve all test cases and we could try to improve it, but for now, I believe that solving the existing issues is more interesting for the project.

OTHER_PARTICULAR_REGEXES = [
# HT is the French abbreviation for "Hors Tax" (tax not added to the price)
# and it may appear after € currency symbol
r"(€)HT+?",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can’t the same be achieved by adding '€HT' to SAFE_CURRENCY_SYMBOLS?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

€HT is not a currency symbol (only is) , so it won't work just including there. It is a more specific case.

@rennerocha
Copy link
Author

@Gallaecio and @manycoding could you please review again?
Sorry for the extra commits ... I did a rebase and things are a bit messy now :-(

@Gallaecio
Copy link
Member

Did something go wrong while merging/rebasing? https://github.com/scrapinghub/price-parser/pull/20/files shows many changes that seem unrelated.

@rennerocha
Copy link
Author

Did something go wrong while merging/rebasing? https://github.com/scrapinghub/price-parser/pull/20/files shows many changes that seem unrelated.

Yes, it did! Not sure how to fix it. Maybe creating a new fresh PR with only my changes?

@Gallaecio
Copy link
Member

You could try git pull --rebase upstream master first, hopefully there are not too many conflicts.

@Gallaecio
Copy link
Member

Gallaecio commented Nov 5, 2019

I’ve tried rebasing it against master, but it’s far from straightforward. You don’t need to create a new pull request, but I recommend that you delete your local branch, recreate it from master, re-apply your changes (either manually or cherry-picking and fixing conflicts), and push-force your changes:

git checkout master  # make sure you are not in your local avoid-incorrect-currency-when-no-price branch
git branch -D avoid-incorrect-currency-when-no-price  # remove your local avoid-incorrect-currency-when-no-price branch
git fetch upstream  # fetch the upstream remote
git checkout upstream/master -b avoid-incorrect-currency-when-no-price  # re-create your branch from upstream/master
# Apply your changes, either manually or cherry-picking (you can find your commit hashes in this pull request) and solving the corresponding conflicts (easier than rebasing, but maybe still hard)
git push -f origin avoid-incorrect-currency-when-no-price  # update the remote branch, hence this pull request

@rennerocha rennerocha force-pushed the avoid-incorrect-currency-when-no-price branch from ad4b510 to d09418f Compare April 15, 2020 18:45
@rennerocha
Copy link
Author

@Gallaecio
Thanks for the steps you provided to fix my branch :-)
I need to study a bit more about rebasing again ...
PR is ready for review again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants