-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HTML5Parser option #133
Conversation
Codecov Report
@@ Coverage Diff @@
## master scrapy/scrapy#133 +/- ##
==========================================
+ Coverage 99.63% 99.64% +<.01%
==========================================
Files 5 5
Lines 271 278 +7
Branches 48 49 +1
==========================================
+ Hits 270 277 +7
Misses 1 1
Continue to review full report at Codecov.
|
f3484d5
to
b6030f6
Compare
b6030f6
to
6a9e309
Compare
parsel/selector.py
Outdated
@@ -39,8 +44,12 @@ def create_root_node(text, parser_cls, base_url=None): | |||
"""Create root node for text using given parser class. | |||
""" | |||
body = text.strip().replace('\x00', '').encode('utf8') or b'<html/>' | |||
parser = parser_cls(recover=True, encoding='utf8') | |||
root = etree.fromstring(body, parser=parser, base_url=base_url) | |||
if parser_cls != html5parser.HTMLParser: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be legible if we define the option as ==
for the case of html5parser
and then else
for the other options
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey!, thanks for review!, I also think that, i'll change it. changed
tests/test_selector.py
Outdated
</body> | ||
</html>''' | ||
sel = self.sscls(text=body, type='html5') | ||
res = sel.xpath('//div/text()').get() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't the test confirm that it gets the div
element? Or maybe I am not understanding the new functionality correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right the testcase isn't clear. The difference that this test pretend to show is the difference in parsers view. In this particular case:
In [1]: body = u'''<html>
...: <head></head>
...: <body>
...: <li>one<div></li>
...: <li>two</li>
...: </body>
...: </html>'''
In [2]: from parsel import Selector
In [3]: sel = Selector(body, 'html')
In [4]: sel.get()
Out[4]: '<html>\n <head></head>\n <body>\n <li>one<div>\n <li>two</li>\n </div></li></body>\n </html>'
In [5]: selfive = Selector(body, 'html5')
In [6]: selfive.get()
Out[6]: '<html><head></head>\n <body>\n <li>one<div></div></li>\n <li>two</li>\n \n </body></html>'
As you can see, div
tag is filled/completed/closed in different way with html
than html5parser
, just as extra info, the way html5parser
see is the way that browsers would see that html.
In [7]: sel.xpath('//div').get()
Out[7]: '<div>\n <li>two</li>\n </div>'
In [8]: selfive.xpath('//div').get()
Out[8]: '<div></div>'
EDIT:
Already changed, thanks for the suggestion, test is much clear now.
tests/test_selector.py
Outdated
def test_characters_gt_and_lt(self): | ||
"""HTML5 parser tests: greater and less than symbols work as expected.""" | ||
lt_elem = '20 < 100' | ||
gt_elem = '120 > 100' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are just defining these two cases, I would prefer if they could be one on its respective test. On the other hand I think we should be adding more cases, and defining them as a list which will also improve the tests below to not repeat code, but iterate that list checking all those cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, think that I improved that 👍
@@ -39,8 +44,15 @@ def create_root_node(text, parser_cls, base_url=None): | |||
"""Create root node for text using given parser class. | |||
""" | |||
body = text.strip().replace('\x00', '').encode('utf8') or b'<html/>' | |||
parser = parser_cls(recover=True, encoding='utf8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think parser_cls
made sense here when both classes had the same API. Now that we are introducing a new class that does not have the same signature, maybe we should switch to a different approach.
For example, instead of a class, we could use a function that returns a valid root
, and modify _ctgroup
so that _parser
values are such functions.
That way, the code here remains independent of the _parser
value.
What do you think?
There is another option to get faster parsing times:
https://html5-parser.readthedocs.io/en/latest/
It's not as straightforward to install as html5lib but it worked well for
the few times I tried it
|
@joaquingx I know feedback took a while… Do you wish to continue working on it? |
PR Solves Issue scrapy/scrapy#83.
As that issue pointed out, there's a lot of requests to add this option: 1 2 3 4 5 6
Based/Continue work of #54 & scrapy/scrapy#1043
Known issues with
html5lib
:html5lib/html5lib-python#96 , for now I just raise a TypeError, is there another/better way to handle this?
I also made some benchmarks, and I obtain that
html5lib
is a waaaaay slower than normalhtml
parser. It's60
times slower when parse: https://gist.github.com/joaquingx/efaf1152beb4e4ea25d6a8afa061ebcf (I used https://gist.github.com/kmike/af647777cef39c3d01071905d176c006 as reference). Due that, should be a good idea to make some alert when user chooseshtml5
as option?