Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrectly formatted description property #119

Merged
merged 18 commits into from
Jul 19, 2019

Conversation

jakubwasikowski
Copy link
Contributor

@jakubwasikowski jakubwasikowski commented Jul 15, 2019

The following has been done in this PR:

  • Fixed issue Extruct returns incorrectly formatted description property #113 with incorrectly formatted description property.
  • Added new test case with website for which the issue occured.
  • Fixed old test cases (the usage of html_text gets rid of weird new lines).
  • I had to select minimal version because six in version 1.10.0 causes errors for python3.4 when installing.

The fix was based on the code pushed by @kmike in this PR: #114. Thanks @kmike!

@jakubwasikowski jakubwasikowski self-assigned this Jul 15, 2019
@codecov
Copy link

codecov bot commented Jul 17, 2019

Codecov Report

Merging #119 into master will increase coverage by 0.1%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #119     +/-   ##
=========================================
+ Coverage   87.63%   87.73%   +0.1%     
=========================================
  Files          11       11             
  Lines         469      473      +4     
  Branches      101      101             
=========================================
+ Hits          411      415      +4     
  Misses         52       52             
  Partials        6        6
Impacted Files Coverage Δ
extruct/w3cmicrodata.py 99.14% <100%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 50a0915...17c2982. Read the comment docs.

@codecov
Copy link

codecov bot commented Jul 17, 2019

Codecov Report

Merging #119 into master will increase coverage by 0.1%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #119     +/-   ##
=========================================
+ Coverage   87.63%   87.73%   +0.1%     
=========================================
  Files          11       11             
  Lines         469      473      +4     
  Branches      101      101             
=========================================
+ Hits          411      415      +4     
  Misses         52       52             
  Partials        6        6
Impacted Files Coverage Δ
extruct/w3cmicrodata.py 99.14% <100%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 50a0915...670702e. Read the comment docs.

@@ -7,5 +7,6 @@ requests
rdflib
rdflib-jsonld
mf2py>=1.1.0
six
six>=1.11
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakubwasikowski why did you updated six version? Maybe this is useful for @croqaz in #120

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did exactly the same thing, but updated to 1.12 :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, forgot to mention it in description of PR. I had to select minimal version because six in version 1.10.0 causes errors for python3.4 when installing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So when I added six>=1.12 it still crashes for Python 3.4.
But six>=1.11 works 🤦‍♂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 1.11 it works, but for 1.12 it doesn't? Weird 🤦‍♂️ Especially taking into account that 1.12 is the latest version.

@ivanprado
Copy link
Contributor

@jakubwasikowski is the code ready to be reviewed or still is in WIP state?

@jakubwasikowski jakubwasikowski changed the title [WIP] Fix incorrectly formatted description property Fix incorrectly formatted description property Jul 17, 2019
@jakubwasikowski
Copy link
Contributor Author

Hey @ivanprado, the code is ready to be review! Updated title of the PR 👍

@jakubwasikowski jakubwasikowski removed the request for review from lopuhin July 17, 2019 13:06
extruct/w3cmicrodata.py Outdated Show resolved Hide resolved

from extruct.utils import parse_html


# Cleaner which is similar to html_text cleaner, but is less aggressive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I see, the difference between this cleaner and the html_text one is embedded=False and frames=False. It would be nice to include this in the comment and the reasoning about why. I imagine the reason is that we want to include frames and iframes content as well, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ivanprado! This is because in the previous version the only removed tags were script and style, so probably removing the other ones will be too strict. I can imagine a situation with <embed> like this:

<div>You can check our product here<embed type="video/webm"
       src="/media/examples/video.mp4"
       width="250"
       height="200">
<embed src="helloworld.swf">in the video</embed> 
</div>

And similar thing applies to frames.

@ivanprado
Copy link
Contributor

Thanks @jakubwasikowski for this fix. Everything looks great. I left some comments to try to understand better why we use a different Cleaner than default html-text.

Co-Authored-By: Iván de Prado <[email protected]>
Copy link
Contributor

@ivanprado ivanprado left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakubwasikowski understood, fair enough. Thank you for this fix, it looks nice.

@jakubwasikowski
Copy link
Contributor Author

Thanks for your review @ivanprado!

@croqaz, would you like to take a look as you were mentioned and you're in reviewers now 😄?
I first added Konstantin, but forgot that he is on holidays.

@croqaz
Copy link
Member

croqaz commented Jul 19, 2019

@croqaz, would you like to take a look as you were mentioned and you're in reviewers now 😄?

Sure, I'll take a look today. I also have a PR that might need a bit of attention, so it's a good trade 😆

@jakubwasikowski
Copy link
Contributor Author

So would you like me to review something @croqaz 😄? Add me as reviewer if you'd like to 👍

Copy link
Member

@croqaz croqaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @jakubwasikowski ! 👍

@jakubwasikowski
Copy link
Contributor Author

Thanks for your review @ivanprado and @croqaz! 🍻

@jakubwasikowski jakubwasikowski merged commit 6df8e19 into master Jul 19, 2019
@jakubwasikowski jakubwasikowski deleted the fix-incorrectly-formatted-description-property branch July 19, 2019 11:35
@@ -182,7 +203,8 @@ def _extract_property_value(self, node, items_seen, base_url, force=False):
return self._extract_textContent(node)

def _extract_textContent(self, node):
return u"".join(self._xp_clean_text(node)).strip()
clean_node = cleaner.clean_html(node)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey! I'm concerned about performance implications of this; we're copying & cleaning a tree many times for each page. Could you please run it on a large sample of pages, to see how bad is an impact?

Copy link
Contributor Author

@jakubwasikowski jakubwasikowski Jul 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it makes sense to do so - will check that 👍

@@ -182,7 +203,8 @@ def _extract_property_value(self, node, items_seen, base_url, force=False):
return self._extract_textContent(node)

def _extract_textContent(self, node):
return u"".join(self._xp_clean_text(node)).strip()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like _xp_clean_text is not used any more, can it be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @lopuhin! Didn't notice that. Sure, will remove it 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants