Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrectly formatted description property #119

Merged
merged 18 commits into from
Jul 19, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 23 additions & 1 deletion extruct/w3cmicrodata.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,32 @@
from urllib.parse import urljoin

import lxml.etree
from lxml.html.clean import Cleaner
from w3lib.html import strip_html5_whitespace
import html_text

from extruct.utils import parse_html


# Cleaner which is similar to html_text cleaner, but is less aggressive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I see, the difference between this cleaner and the html_text one is embedded=False and frames=False. It would be nice to include this in the comment and the reasoning about why. I imagine the reason is that we want to include frames and iframes content as well, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ivanprado! This is because in the previous version the only removed tags were script and style, so probably removing the other ones will be too strict. I can imagine a situation with <embed> like this:

<div>You can check our product here<embed type="video/webm"
       src="/media/examples/video.mp4"
       width="250"
       height="200">
<embed src="helloworld.swf">in the video</embed> 
</div>

And similar thing applies to frames.

cleaner = Cleaner(
scripts=True,
javascript=False, # onclick attributes are fine
comments=True,
style=True,
links=True,
meta=True,
page_structure=False, # <title> may be nice to have
processing_instructions=True,
embedded=False, # keep embedded content
frames=False, # keep frames
forms=False, # keep forms
annoying_tags=False,
remove_unknown_tags=False,
safe_attrs_only=False,
)


class LxmlMicrodataExtractor(object):
_xp_item = lxml.etree.XPath('descendant-or-self::*[@itemscope]')
_xp_prop = lxml.etree.XPath("""set:difference(.//*[@itemprop],
Expand Down Expand Up @@ -182,7 +203,8 @@ def _extract_property_value(self, node, items_seen, base_url, force=False):
return self._extract_textContent(node)

def _extract_textContent(self, node):
return u"".join(self._xp_clean_text(node)).strip()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like _xp_clean_text is not used any more, can it be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @lopuhin! Didn't notice that. Sure, will remove it 👍

clean_node = cleaner.clean_html(node)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey! I'm concerned about performance implications of this; we're copying & cleaning a tree many times for each page. Could you please run it on a large sample of pages, to see how bad is an impact?

Copy link
Contributor Author

@jakubwasikowski jakubwasikowski Jul 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it makes sense to do so - will check that 👍

return html_text.etree_to_text(clean_node)


MicrodataExtractor = LxmlMicrodataExtractor
3 changes: 2 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,6 @@ requests
rdflib
rdflib-jsonld
mf2py>=1.1.0
six
six>=1.11
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakubwasikowski why did you updated six version? Maybe this is useful for @croqaz in #120

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did exactly the same thing, but updated to 1.12 :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, forgot to mention it in description of PR. I had to select minimal version because six in version 1.10.0 causes errors for python3.4 when installing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So when I added six>=1.12 it still crashes for Python 3.4.
But six>=1.11 works 🤦‍♂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 1.11 it works, but for 1.12 it doesn't? Weird 🤦‍♂️ Especially taking into account that 1.12 is the latest version.

w3lib
html-text
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ def get_version():
'rdflib-jsonld',
'mf2py',
'w3lib',
'html-text>=0.5.1',
'six'],
extras_require={
'service': [
Expand Down
2 changes: 1 addition & 1 deletion tests/samples/schema.org/Event.002.json
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@
"offers": "foo-fighters-everlong-buy.html",
"url": "foo-fighters-everlong.html"},
"type": "http://schema.org/MusicRecording"}],
"video": {"properties": {"description": "Catch this exclusive interview with\n Dave Grohl and the Foo Fighters about their new album, Rope.",
"video": {"properties": {"description": "Catch this exclusive interview with Dave Grohl and the Foo Fighters about their new album, Rope.",
"duration": "T1M33S",
"name": "Interview with the Foo Fighters",
"thumbnail": "foo-fighters-interview-thumb.jpg"},
Expand Down
2 changes: 1 addition & 1 deletion tests/samples/schema.org/MusicRecording.001.json
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@
"offers": "foo-fighters-everlong-buy.html",
"url": "foo-fighters-everlong.html"},
"type": "http://schema.org/MusicRecording"}],
"video": {"properties": {"description": "Catch this exclusive interview with\n Dave Grohl and the Foo Fighters about their new album, Rope.",
"video": {"properties": {"description": "Catch this exclusive interview with Dave Grohl and the Foo Fighters about their new album, Rope.",
"duration": "T1M33S",
"name": "Interview with the Foo Fighters",
"thumbnail": "foo-fighters-interview-thumb.jpg"},
Expand Down
2 changes: 1 addition & 1 deletion tests/samples/schema.org/product-ref.json
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
],
"brand": "ACME",
"name": "Executive Anvil",
"description": "Sleeker than ACME's Classic Anvil, the\n Executive Anvil is perfect for the business traveler\n looking for something to drop from a height.",
"description": "Sleeker than ACME's Classic Anvil, the Executive Anvil is perfect for the business traveler looking for something to drop from a height.",
"mpn": "925872",
"aggregateRating": {
"type": "http://schema.org/AggregateRating",
Expand Down
2 changes: 1 addition & 1 deletion tests/samples/schema.org/product.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"properties": {"brand": "ACME",
"name": "Executive Anvil",
"image": "anvil_executive.jpg",
"description": "Sleeker than ACME's Classic Anvil, the\n Executive Anvil is perfect for the business traveler\n looking for something to drop from a height.",
"description": "Sleeker than ACME's Classic Anvil, the Executive Anvil is perfect for the business traveler looking for something to drop from a height.",
"mpn": "925872",
"aggregateRating": {"type": "http://schema.org/AggregateRating",
"properties": {"ratingValue": "4.4",
Expand Down
2 changes: 1 addition & 1 deletion tests/samples/schema.org/product_custom_url.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"properties": {"brand": "ACME",
"name": "Executive Anvil",
"image": "http://some-example.com/anvil_executive.jpg",
"description": "Sleeker than ACME's Classic Anvil, the\n Executive Anvil is perfect for the business traveler\n looking for something to drop from a height.",
"description": "Sleeker than ACME's Classic Anvil, the Executive Anvil is perfect for the business traveler looking for something to drop from a height.",
"mpn": "925872",
"aggregateRating": {"type": "http://schema.org/AggregateRating",
"properties": {"ratingValue": "4.4",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"properties": {"brand": "ACME",
"name": "Executive Anvil",
"image": "http://some-example.com/anvil_executive.jpg",
"description": "Sleeker than ACME's Classic Anvil, the\n Executive Anvil is perfect for the business traveler\n looking for something to drop from a height.",
"description": "Sleeker than ACME's Classic Anvil, the Executive Anvil is perfect for the business traveler looking for something to drop from a height.",
"mpn": "925872",
"aggregateRating": {"type": "http://schema.org/AggregateRating",
"_nodeId_": "aggregateRating",
Expand Down
6 changes: 3 additions & 3 deletions tests/samples/w3c/microdata.5.2.withtext.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,18 @@
"name": "Tank Locomotive (DB 80)",
"product-code": "33041",
"scale": "HO"},
"textContent": "Name:\n Tank Locomotive (DB 80)\n Product code:\n 33041\n Scale:\n HO\n Digital:\n Delta",
"textContent": "Name:\nTank Locomotive (DB 80)\nProduct code:\n33041\nScale:\nHO\nDigital:\nDelta",
"type": ["http://md.example.com/loco",
"http://md.example.com/lighting"]},
{"properties": {"name": "Turnout Lantern Kit",
"product-code": "74470",
"scale": "HO",
"track-type": "C"},
"textContent": "Name:\n Turnout Lantern Kit\n Product code:\n 74470\n Purpose:\n For retrofitting 2 C Track\n turnouts.",
"textContent": "Name:\nTurnout Lantern Kit\nProduct code:\n74470\nPurpose:\nFor retrofitting 2 C Track turnouts.",
"type": ["http://md.example.com/track",
"http://md.example.com/lighting"]},
{"properties": {"name": "Express Train Passenger Car (DB Am 203)",
"product-code": "8710",
"scale": "Z"},
"textContent": "Name:\n Express Train Passenger Car (DB Am 203)\n Product code:\n 8710\n Scale:\n Z",
"textContent": "Name:\nExpress Train Passenger Car (DB Am 203)\nProduct code:\n8710\nScale:\nZ",
"type": "http://md.example.com/passengers"}]
Loading