Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HTML5Parser option #133

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 12 additions & 3 deletions parsel/selector.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

import six
from lxml import etree, html
from lxml.html import html5parser

from .utils import flatten, iflatten, extract_regex
from .csstranslator import HTMLTranslator, GenericTranslator
Expand All @@ -23,6 +24,10 @@ def __init__(self, *args, **kwargs):
'xml': {'_parser': SafeXMLParser,
'_csstranslator': GenericTranslator(),
'_tostring_method': 'xml'},
'html5': {'_parser': html5parser.HTMLParser,
'_csstranslator': HTMLTranslator(),
'_tostring_method': 'html',
},
}


Expand All @@ -39,8 +44,12 @@ def create_root_node(text, parser_cls, base_url=None):
"""Create root node for text using given parser class.
"""
body = text.strip().replace('\x00', '').encode('utf8') or b'<html/>'
parser = parser_cls(recover=True, encoding='utf8')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think parser_cls made sense here when both classes had the same API. Now that we are introducing a new class that does not have the same signature, maybe we should switch to a different approach.

For example, instead of a class, we could use a function that returns a valid root, and modify _ctgroup so that _parser values are such functions.

That way, the code here remains independent of the _parser value.

What do you think?

root = etree.fromstring(body, parser=parser, base_url=base_url)
if parser_cls != html5parser.HTMLParser:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be legible if we define the option as == for the case of html5parser and then else for the other options

Copy link
Author

@joaquingx joaquingx Jan 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey!, thanks for review!, I also think that, i'll change it. changed

parser = parser_cls(recover=True, encoding='utf8')
root = etree.fromstring(body, parser=parser, base_url=base_url)
else:
parser = parser_cls(namespaceHTMLElements=False)
root = html5parser.fromstring(body, parser=parser)
if root is None:
root = etree.fromstring(b'<html/>', parser=parser, base_url=base_url)
return root
Expand Down Expand Up @@ -158,7 +167,7 @@ class Selector(object):
``text`` is a ``unicode`` object in Python 2 or a ``str`` object in Python 3
``type`` defines the selector type, it can be ``"html"``, ``"xml"`` or ``None`` (default).
``type`` defines the selector type, it can be ``"html"``, ``"xml"``, ``"html5"`` or ``None`` (default).
If ``type`` is ``None``, the selector defaults to ``"html"``.
"""

Expand Down
3 changes: 2 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,8 @@ def has_environment_marker_platform_impl_support():
'w3lib>=1.19.0',
'lxml>=2.3',
'six>=1.5.2',
'cssselect>=0.9'
'cssselect>=0.9',
'html5lib',
]
extras_require = {}

Expand Down
37 changes: 35 additions & 2 deletions tests/test_selector.py
Original file line number Diff line number Diff line change
Expand Up @@ -498,7 +498,7 @@ def test_re(self):
["John", "Paul"])
self.assertEqual(x.xpath("//ul/li").re(r"Age: (\d+)"),
["10", "20"])

# Test named group, hit and miss
x = self.sscls(text=u'foobar')
self.assertEqual(x.re('(?P<extract>foo)'), ['foo'])
Expand All @@ -511,7 +511,7 @@ def test_re(self):
def test_re_replace_entities(self):
body = u"""<script>{"foo":"bar &amp; &quot;baz&quot;"}</script>"""
x = self.sscls(text=body)

name_re = re.compile('{"foo":(.*)}')

# by default, only &amp; and &lt; are preserved ;
Expand Down Expand Up @@ -712,6 +712,39 @@ def test_replacement_null_char_from_body(self):
self.assertEqual(u'<html><body><p>Grainy</p></body></html>',
self.sscls(text).extract())

def test_characters_gt_and_lt(self):
"""HTML5 parser tests: greater and less than symbols work as expected."""
lt_elem = '20 < 100'
gt_elem = '120 > 100'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are just defining these two cases, I would prefer if they could be one on its respective test. On the other hand I think we should be adding more cases, and defining them as a list which will also improve the tests below to not repeat code, but iterate that list checking all those cases

Copy link
Author

@joaquingx joaquingx Jan 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, think that I improved that 👍

body = u'''<html>
<head></head>
<body>
<div id="distance">{0}</div>
<body>
</html>'''

sel = self.sscls(text=body.format(lt_elem), type='html5')
lt_res = sel.xpath('//div[@id="distance"]/text()').get()
self.assertEqual(lt_res, lt_elem, msg='less than(<) parsing does not work as expected')

sel = self.sscls(text=body.format(gt_elem), type='html5')
gt_res = sel.xpath('//div[@id="distance"]/text()').get()
self.assertEqual(gt_res, gt_elem, msg='greater than(>) parsing does not work as expected')

def test_complete_tags(self):
"""HTML5 parser complete/fill tags as expected."""
body = u'''<html>
<head></head>
<body>
<li>one<div></li>
<li>two</li>
</body>
</html>'''
sel = self.sscls(text=body, type='html5')
res = sel.xpath('//div/text()').get()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the test confirm that it gets the div element? Or maybe I am not understanding the new functionality correctly.

Copy link
Author

@joaquingx joaquingx Jan 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right the testcase isn't clear. The difference that this test pretend to show is the difference in parsers view. In this particular case:

In [1]:         body = u'''<html> 
   ...:                     <head></head> 
   ...:                        <body> 
   ...:                         <li>one<div></li> 
   ...:                         <li>two</li> 
   ...:                        </body> 
   ...:                 </html>'''                                                                                                                                                                                  

In [2]: from parsel import Selector                                                                                                                                                                                 

In [3]: sel = Selector(body, 'html')                                                                                                                                                                                

In [4]: sel.get()                                                                                                                                                                                                   
Out[4]: '<html>\n            <head></head>\n               <body>\n                <li>one<div>\n                <li>two</li>\n               </div></li></body>\n        </html>'

In [5]: selfive = Selector(body, 'html5')                                                                                                                                                                           

In [6]: selfive.get()                                                                                                                                                                                               
Out[6]: '<html><head></head>\n               <body>\n                <li>one<div></div></li>\n                <li>two</li>\n               \n        </body></html>'

As you can see, div tag is filled/completed/closed in different way with html than html5parser, just as extra info, the way html5parser see is the way that browsers would see that html.

In [7]: sel.xpath('//div').get()                                                                                                                                                                                    
Out[7]: '<div>\n                <li>two</li>\n               </div>'

In [8]: selfive.xpath('//div').get()                                                                                                                                                                                
Out[8]: '<div></div>'

EDIT:

Already changed, thanks for the suggestion, test is much clear now.

self.assertEqual(res, None)


class ExsltTestCase(unittest.TestCase):

sscls = Selector
Expand Down