-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix incorrectly formatted description property #119
Changes from all commits
7e952b0
ee4da8b
2a7ae11
f380f2a
e463414
92d0646
4e2dbf9
48bb049
c02c349
86f47d2
c8dc164
f902dad
75686f1
0f5a632
c2b59b1
1e990b3
17c2982
670702e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,11 +18,32 @@ | |
from urllib.parse import urljoin | ||
|
||
import lxml.etree | ||
from lxml.html.clean import Cleaner | ||
from w3lib.html import strip_html5_whitespace | ||
import html_text | ||
|
||
from extruct.utils import parse_html | ||
|
||
|
||
# Cleaner which is similar to html_text cleaner, but is less aggressive | ||
cleaner = Cleaner( | ||
scripts=True, | ||
javascript=False, # onclick attributes are fine | ||
comments=True, | ||
style=True, | ||
links=True, | ||
meta=True, | ||
page_structure=False, # <title> may be nice to have | ||
processing_instructions=True, | ||
embedded=False, # keep embedded content | ||
frames=False, # keep frames | ||
forms=False, # keep forms | ||
annoying_tags=False, | ||
remove_unknown_tags=False, | ||
safe_attrs_only=False, | ||
) | ||
|
||
|
||
class LxmlMicrodataExtractor(object): | ||
_xp_item = lxml.etree.XPath('descendant-or-self::*[@itemscope]') | ||
_xp_prop = lxml.etree.XPath("""set:difference(.//*[@itemprop], | ||
|
@@ -182,7 +203,8 @@ def _extract_property_value(self, node, items_seen, base_url, force=False): | |
return self._extract_textContent(node) | ||
|
||
def _extract_textContent(self, node): | ||
return u"".join(self._xp_clean_text(node)).strip() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hey @lopuhin! Didn't notice that. Sure, will remove it 👍 |
||
clean_node = cleaner.clean_html(node) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hey! I'm concerned about performance implications of this; we're copying & cleaning a tree many times for each page. Could you please run it on a large sample of pages, to see how bad is an impact? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Indeed, it makes sense to do so - will check that 👍 |
||
return html_text.etree_to_text(clean_node) | ||
|
||
|
||
MicrodataExtractor = LxmlMicrodataExtractor |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,5 +7,6 @@ requests | |
rdflib | ||
rdflib-jsonld | ||
mf2py>=1.1.0 | ||
six | ||
six>=1.11 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jakubwasikowski why did you updated six version? Maybe this is useful for @croqaz in #120 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I did exactly the same thing, but updated to 1.12 :D There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, forgot to mention it in description of PR. I had to select minimal version because six in version 1.10.0 causes errors for python3.4 when installing. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So when I added six>=1.12 it still crashes for Python 3.4. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For 1.11 it works, but for 1.12 it doesn't? Weird 🤦♂️ Especially taking into account that 1.12 is the latest version. |
||
w3lib | ||
html-text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I see, the difference between this cleaner and the html_text one is
embedded=False
andframes=False
. It would be nice to include this in the comment and the reasoning about why. I imagine the reason is that we want to include frames and iframes content as well, right?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ivanprado! This is because in the previous version the only removed tags were script and style, so probably removing the other ones will be too strict. I can imagine a situation with
<embed>
like this:And similar thing applies to frames.