-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid incorrect currency when no price available #20
base: master
Are you sure you want to change the base?
Avoid incorrect currency when no price available #20
Conversation
Codecov Report
@@ Coverage Diff @@
## master #20 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 3 3
Lines 96 103 +7
Branches 21 22 +1
=========================================
+ Hits 96 103 +7
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use non-greedy quantifiers wherever possible https://docs.python.org/3/howto/regex.html#greedy-versus-non-greedy
Everything else looks good to me.
price_parser/parser.py
Outdated
@@ -142,7 +162,9 @@ def extract_currency_symbol(price: Optional[str], | |||
for meth, attr in methods: | |||
m = meth(attr) if attr else None | |||
if m: | |||
return m.group(0) | |||
groups = [match for match in m.groups() if match is not None] | |||
if groups: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the coverage data, this always evaluates to True
with the current tests. It would be great to add a test for the case where it evaluates to False
or, if that cannot happen, replace the if by an assert
.
@Gallaecio and @manycoding could you please review it again? |
""" Return a regex which matches any of ``symbols`` """ | ||
return re.compile('|'.join(re.escape(s) for s in symbols)) | ||
""" Return a regex which matches any of ``symbols`` surrounded by some special characters """ | ||
left_tokens = [r"^", r"\s+?", r"\d+?"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about the non-greedy here, what's the result of this test case?
Example(None, ' Prices in EUR ',
'EUR', None, None),
Example(None, '13800 ₶ ',
'₶', '13800', 13800)
'₶', '13800', 13800),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should not be a problem, since their match is not captured. But I guess it’s worth adding a test for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both tests are passing. I included them in the test suite just in case...
""" Return a regex which matches any of ``symbols`` """ | ||
return re.compile('|'.join(re.escape(s) for s in symbols)) | ||
""" Return a regex which matches any of ``symbols`` surrounded by some special characters """ | ||
left_tokens = [r"^", r"\s+?", r"\d+?"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should not be a problem, since their match is not captured. But I guess it’s worth adding a test for it.
left_tokens = [r"^", r"\s+?", r"\d+?"] | ||
right_tokens = [r"$", r"\s+?", r"\d+?", r"[^a-zA-Z0-9]+?"] | ||
|
||
return re.compile( | ||
"|".join( | ||
[ | ||
left + "({})".format(re.escape(s)) + right | ||
for left, right in itertools.product(left_tokens, right_tokens) | ||
for s in symbols | ||
] | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this can be simplified a bit:
symbols_pattern = '|'.join(re.escape(s) for s in symbols)
template = r'(?:^|\b|(?<=\d)){}(?:(?=\d)|\b|$)'
return re.compile(template.format(symbols_pattern))
If I’m not wrong, this should make some other changes below unnecessary, since we are no longer using a capture group.
And since this function is now doing more than simply an OR regex, but also imposing word boundaries specific to currency symbols (it allows digits as word boundaries), maybe we need to deprecate the existing function and provide a new, _private one for the new logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested and it didn't work in all examples.
I don't consider simple any code that auto generate regexes. Your code may work but it will require more effort on it to solve all test cases and we could try to improve it, but for now, I believe that solving the existing issues is more interesting for the project.
OTHER_PARTICULAR_REGEXES = [ | ||
# HT is the French abbreviation for "Hors Tax" (tax not added to the price) | ||
# and it may appear after € currency symbol | ||
r"(€)HT+?", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can’t the same be achieved by adding '€HT'
to SAFE_CURRENCY_SYMBOLS
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
€HT
is not a currency symbol (only €
is) , so it won't work just including there. It is a more specific case.
@Gallaecio and @manycoding could you please review again? |
Did something go wrong while merging/rebasing? https://github.com/scrapinghub/price-parser/pull/20/files shows many changes that seem unrelated. |
Yes, it did! Not sure how to fix it. Maybe creating a new fresh PR with only my changes? |
You could try |
I’ve tried rebasing it against git checkout master # make sure you are not in your local avoid-incorrect-currency-when-no-price branch
git branch -D avoid-incorrect-currency-when-no-price # remove your local avoid-incorrect-currency-when-no-price branch
git fetch upstream # fetch the upstream remote
git checkout upstream/master -b avoid-incorrect-currency-when-no-price # re-create your branch from upstream/master
# Apply your changes, either manually or cherry-picking (you can find your commit hashes in this pull request) and solving the corresponding conflicts (easier than rebasing, but maybe still hard)
git push -f origin avoid-incorrect-currency-when-no-price # update the remote branch, hence this pull request |
…ot price-related string
…failures) to list of examples that are correctly handled now
ad4b510
to
d09418f
Compare
@Gallaecio |
This should solve problem described in #12 .
Regex to match currencies is improved, so only currencies strings that are surrounded by some specific characters are matched as a currency, avoiding problems with currencies that can also be substrings of unrelated strings.
This fix also solved two examples that are marked as xfail.
An additional list was added for custom regexes that are not well handled by the other rules (done that to handle the
HT
problem in one test case)