Fix incorrectly formatted description property #119

jakubwasikowski · 2019-07-15T18:36:21Z

The following has been done in this PR:

Fixed issue Extruct returns incorrectly formatted description property #113 with incorrectly formatted description property.
Added new test case with website for which the issue occured.
Fixed old test cases (the usage of html_text gets rid of weird new lines).
I had to select minimal version because six in version 1.10.0 causes errors for python3.4 when installing.

The fix was based on the code pushed by @kmike in this PR: #114. Thanks @kmike!

…roperty

…ot work without this)

codecov · 2019-07-17T10:19:37Z

Codecov Report

Merging #119 into master will increase coverage by 0.1%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master     #119     +/-   ##
=========================================
+ Coverage   87.63%   87.73%   +0.1%     
=========================================
  Files          11       11             
  Lines         469      473      +4     
  Branches      101      101             
=========================================
+ Hits          411      415      +4     
  Misses         52       52             
  Partials        6        6

Impacted Files	Coverage Δ
extruct/w3cmicrodata.py	`99.14% <100%> (+0.03%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 50a0915...17c2982. Read the comment docs.

codecov · 2019-07-17T10:19:37Z

Codecov Report

Merging #119 into master will increase coverage by 0.1%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master     #119     +/-   ##
=========================================
+ Coverage   87.63%   87.73%   +0.1%     
=========================================
  Files          11       11             
  Lines         469      473      +4     
  Branches      101      101             
=========================================
+ Hits          411      415      +4     
  Misses         52       52             
  Partials        6        6

Impacted Files	Coverage Δ
extruct/w3cmicrodata.py	`99.14% <100%> (+0.03%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 50a0915...670702e. Read the comment docs.

ivanprado · 2019-07-17T10:35:28Z

requirements.txt

@@ -7,5 +7,6 @@ requests
 rdflib
 rdflib-jsonld
 mf2py>=1.1.0
-six
+six>=1.11


@jakubwasikowski why did you updated six version? Maybe this is useful for @croqaz in #120

Yes, I did exactly the same thing, but updated to 1.12 :D

Sorry, forgot to mention it in description of PR. I had to select minimal version because six in version 1.10.0 causes errors for python3.4 when installing.

So when I added six>=1.12 it still crashes for Python 3.4.
But six>=1.11 works 🤦‍♂

For 1.11 it works, but for 1.12 it doesn't? Weird 🤦‍♂️ Especially taking into account that 1.12 is the latest version.

ivanprado · 2019-07-17T10:36:00Z

@jakubwasikowski is the code ready to be reviewed or still is in WIP state?

jakubwasikowski · 2019-07-17T10:45:34Z

Hey @ivanprado, the code is ready to be review! Updated title of the PR 👍

extruct/w3cmicrodata.py

ivanprado · 2019-07-17T14:50:39Z

extruct/w3cmicrodata.py


 from extruct.utils import parse_html


+# Cleaner which is similar to html_text cleaner, but is less aggressive


As far as I see, the difference between this cleaner and the html_text one is embedded=False and frames=False. It would be nice to include this in the comment and the reasoning about why. I imagine the reason is that we want to include frames and iframes content as well, right?

Hey @ivanprado! This is because in the previous version the only removed tags were script and style, so probably removing the other ones will be too strict. I can imagine a situation with <embed> like this:

<div>You can check our product here<embed type="video/webm" src="/media/examples/video.mp4" width="250" height="200"> <embed src="helloworld.swf">in the video</embed> </div>

And similar thing applies to frames.

ivanprado · 2019-07-17T14:56:56Z

Thanks @jakubwasikowski for this fix. Everything looks great. I left some comments to try to understand better why we use a different Cleaner than default html-text.

Co-Authored-By: Iván de Prado <[email protected]>

ivanprado

@jakubwasikowski understood, fair enough. Thank you for this fix, it looks nice.

jakubwasikowski · 2019-07-19T09:40:38Z

Thanks for your review @ivanprado!

@croqaz, would you like to take a look as you were mentioned and you're in reviewers now 😄?
I first added Konstantin, but forgot that he is on holidays.

croqaz · 2019-07-19T09:50:05Z

@croqaz, would you like to take a look as you were mentioned and you're in reviewers now 😄?

Sure, I'll take a look today. I also have a PR that might need a bit of attention, so it's a good trade 😆

jakubwasikowski · 2019-07-19T10:46:55Z

So would you like me to review something @croqaz 😄? Add me as reviewer if you'd like to 👍

croqaz

Looks good @jakubwasikowski ! 👍

jakubwasikowski · 2019-07-19T11:34:50Z

Thanks for your review @ivanprado and @croqaz! 🍻

kmike · 2019-07-19T12:38:32Z

extruct/w3cmicrodata.py

@@ -182,7 +203,8 @@ def _extract_property_value(self, node, items_seen, base_url, force=False):
            return self._extract_textContent(node)

    def _extract_textContent(self, node):
-        return u"".join(self._xp_clean_text(node)).strip()
+        clean_node = cleaner.clean_html(node)


hey! I'm concerned about performance implications of this; we're copying & cleaning a tree many times for each page. Could you please run it on a large sample of pages, to see how bad is an impact?

Indeed, it makes sense to do so - will check that 👍

lopuhin · 2019-07-29T10:20:21Z

extruct/w3cmicrodata.py

@@ -182,7 +203,8 @@ def _extract_property_value(self, node, items_seen, base_url, force=False):
            return self._extract_textContent(node)

    def _extract_textContent(self, node):
-        return u"".join(self._xp_clean_text(node)).strip()


Looks like _xp_clean_text is not used any more, can it be removed?

Hey @lopuhin! Didn't notice that. Sure, will remove it 👍

kmike and others added 16 commits June 4, 2019 12:19

[wip] use html_text to get element text content in microdata

7e952b0

Add new test case

ee4da8b

Add expected value

2a7ae11

Merge branch 'html-text' into fix-incorrectly-formatted-description-p…

f380f2a

…roperty

Fix test case for description (it should contain new lines)

e463414

Move cleaning html to extracting content (extracting properties did n…

92d0646

…ot work without this)

Removed unused import

4e2dbf9

Fix formatting in test

48bb049

Fix test case for custom url

c02c349

Fix test case with custom url and node id

86f47d2

Fix test for umicrodata

c8dc164

Fix test for product ref

f902dad

Fix test for product join None

75686f1

Fix test_w3c_5_2

0f5a632

Fix test case for event, fix formatting

c2b59b1

Fix test for music recording

1e990b3

jakubwasikowski self-assigned this Jul 15, 2019

Add minimal version of six

17c2982

jakubwasikowski requested review from ivanprado and lopuhin July 17, 2019 10:29

ivanprado reviewed Jul 17, 2019

View reviewed changes

jakubwasikowski changed the title ~~[WIP] Fix incorrectly formatted description property~~ Fix incorrectly formatted description property Jul 17, 2019

croqaz mentioned this pull request Jul 17, 2019

Ignore empty OpenGraph props #120

Open

jakubwasikowski removed the request for review from lopuhin July 17, 2019 13:06

ivanprado reviewed Jul 17, 2019

View reviewed changes

extruct/w3cmicrodata.py Outdated Show resolved Hide resolved

ivanprado reviewed Jul 17, 2019

View reviewed changes

Fix comment

670702e

Co-Authored-By: Iván de Prado <[email protected]>

ivanprado approved these changes Jul 19, 2019

View reviewed changes

croqaz approved these changes Jul 19, 2019

View reviewed changes

jakubwasikowski merged commit 6df8e19 into master Jul 19, 2019

jakubwasikowski deleted the fix-incorrectly-formatted-description-property branch July 19, 2019 11:35

kmike reviewed Jul 19, 2019

View reviewed changes

jakubwasikowski mentioned this pull request Jul 22, 2019

Extruct returns incorrectly formatted description property #113

Closed

lopuhin reviewed Jul 29, 2019

View reviewed changes

jakubwasikowski mentioned this pull request Aug 2, 2019

Make the html cleaning for microdata faster #123

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix incorrectly formatted description property #119

Fix incorrectly formatted description property #119

jakubwasikowski commented Jul 15, 2019 •

edited

Loading

codecov bot commented Jul 17, 2019

codecov bot commented Jul 17, 2019 •

edited

Loading

ivanprado Jul 17, 2019

croqaz Jul 17, 2019

jakubwasikowski Jul 17, 2019

croqaz Jul 17, 2019

jakubwasikowski Jul 17, 2019

ivanprado commented Jul 17, 2019

jakubwasikowski commented Jul 17, 2019

ivanprado Jul 17, 2019

jakubwasikowski Jul 19, 2019

ivanprado commented Jul 17, 2019

ivanprado left a comment

jakubwasikowski commented Jul 19, 2019

croqaz commented Jul 19, 2019

jakubwasikowski commented Jul 19, 2019

croqaz left a comment

jakubwasikowski commented Jul 19, 2019

kmike Jul 19, 2019

jakubwasikowski Jul 19, 2019 •

edited

Loading

lopuhin Jul 29, 2019

jakubwasikowski Jul 29, 2019


		from extruct.utils import parse_html


		# Cleaner which is similar to html_text cleaner, but is less aggressive

Fix incorrectly formatted description property #119

Fix incorrectly formatted description property #119

Conversation

jakubwasikowski commented Jul 15, 2019 • edited Loading

codecov bot commented Jul 17, 2019

Codecov Report

codecov bot commented Jul 17, 2019 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivanprado commented Jul 17, 2019

jakubwasikowski commented Jul 17, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivanprado commented Jul 17, 2019

ivanprado left a comment

Choose a reason for hiding this comment

jakubwasikowski commented Jul 19, 2019

croqaz commented Jul 19, 2019

jakubwasikowski commented Jul 19, 2019

croqaz left a comment

Choose a reason for hiding this comment

jakubwasikowski commented Jul 19, 2019

Choose a reason for hiding this comment

jakubwasikowski Jul 19, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakubwasikowski commented Jul 15, 2019 •

edited

Loading

codecov bot commented Jul 17, 2019 •

edited

Loading

jakubwasikowski Jul 19, 2019 •

edited

Loading