-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix incorrectly formatted description property #119
Fix incorrectly formatted description property #119
Conversation
…ot work without this)
Codecov Report
@@ Coverage Diff @@
## master #119 +/- ##
=========================================
+ Coverage 87.63% 87.73% +0.1%
=========================================
Files 11 11
Lines 469 473 +4
Branches 101 101
=========================================
+ Hits 411 415 +4
Misses 52 52
Partials 6 6
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #119 +/- ##
=========================================
+ Coverage 87.63% 87.73% +0.1%
=========================================
Files 11 11
Lines 469 473 +4
Branches 101 101
=========================================
+ Hits 411 415 +4
Misses 52 52
Partials 6 6
Continue to review full report at Codecov.
|
@@ -7,5 +7,6 @@ requests | |||
rdflib | |||
rdflib-jsonld | |||
mf2py>=1.1.0 | |||
six | |||
six>=1.11 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jakubwasikowski why did you updated six version? Maybe this is useful for @croqaz in #120
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I did exactly the same thing, but updated to 1.12 :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, forgot to mention it in description of PR. I had to select minimal version because six in version 1.10.0 causes errors for python3.4 when installing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So when I added six>=1.12 it still crashes for Python 3.4.
But six>=1.11 works 🤦♂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 1.11 it works, but for 1.12 it doesn't? Weird 🤦♂️ Especially taking into account that 1.12 is the latest version.
@jakubwasikowski is the code ready to be reviewed or still is in WIP state? |
Hey @ivanprado, the code is ready to be review! Updated title of the PR 👍 |
|
||
from extruct.utils import parse_html | ||
|
||
|
||
# Cleaner which is similar to html_text cleaner, but is less aggressive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I see, the difference between this cleaner and the html_text one is embedded=False
and frames=False
. It would be nice to include this in the comment and the reasoning about why. I imagine the reason is that we want to include frames and iframes content as well, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ivanprado! This is because in the previous version the only removed tags were script and style, so probably removing the other ones will be too strict. I can imagine a situation with <embed>
like this:
<div>You can check our product here<embed type="video/webm"
src="/media/examples/video.mp4"
width="250"
height="200">
<embed src="helloworld.swf">in the video</embed>
</div>
And similar thing applies to frames.
Thanks @jakubwasikowski for this fix. Everything looks great. I left some comments to try to understand better why we use a different |
Co-Authored-By: Iván de Prado <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jakubwasikowski understood, fair enough. Thank you for this fix, it looks nice.
Thanks for your review @ivanprado! @croqaz, would you like to take a look as you were mentioned and you're in reviewers now 😄? |
Sure, I'll take a look today. I also have a PR that might need a bit of attention, so it's a good trade 😆 |
So would you like me to review something @croqaz 😄? Add me as reviewer if you'd like to 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @jakubwasikowski ! 👍
Thanks for your review @ivanprado and @croqaz! 🍻 |
@@ -182,7 +203,8 @@ def _extract_property_value(self, node, items_seen, base_url, force=False): | |||
return self._extract_textContent(node) | |||
|
|||
def _extract_textContent(self, node): | |||
return u"".join(self._xp_clean_text(node)).strip() | |||
clean_node = cleaner.clean_html(node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey! I'm concerned about performance implications of this; we're copying & cleaning a tree many times for each page. Could you please run it on a large sample of pages, to see how bad is an impact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, it makes sense to do so - will check that 👍
@@ -182,7 +203,8 @@ def _extract_property_value(self, node, items_seen, base_url, force=False): | |||
return self._extract_textContent(node) | |||
|
|||
def _extract_textContent(self, node): | |||
return u"".join(self._xp_clean_text(node)).strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like _xp_clean_text
is not used any more, can it be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @lopuhin! Didn't notice that. Sure, will remove it 👍
The following has been done in this PR:
html_text
gets rid of weird new lines).The fix was based on the code pushed by @kmike in this PR: #114. Thanks @kmike!