Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added jmespath to selector #27

Closed
wants to merge 2 commits into from
Closed

Conversation

voith
Copy link

@voith voith commented Feb 3, 2016

This PR implements addition of jmespath to parsel Selector as discussed in #25.
I have tried to implement @dangra's suggestion of having nested selectors. The jmespath returns a flattened SelectorList instance.

Examples

>>> import parsel
>>> sscls = parsel.Selector

example of nested selector methods over html containing json

>>> js = u'{"id": "1", "name": "ABCD", "html": "<p>html content</p>"}'
>>> body = u'''<body>
...             <div id=1>not<span>me</span></div>
...             <div class="dos"><p>text</p><a href='#'>foo</a></div>
...             <div class="data" data-attr='{}'><p>json</p></div>
...        </body>'''.format(js)
>>> sel = sscls(text=body)
>>> sel.xpath('//div/@data-attr').jmespath('id').extract()
[u'1']
>>> sel.css("div[data-attr]::attr(data-attr)").jmespath('id').extract()
[u'1']

example of nested selector methods over json containing html

>>> body = u'''<body>
...             <div id=1>not<span>me</span></div>
...             <div class="dos"><p>text</p><a href='#'>foo</a></div>
...         </body>'''
>>> mixed_body = u"""
... {{
...     "id": 32,
...     "name": "ABCD",
...     "description": "{}"
... }}
... """.format(body.replace('"', '\\"').replace('\n', ''))
>>> sel = sscls(text=mixed_body, type='json')
>>> sel.jmespath('description').xpath('//div[@id="1"]').css('span::text').extract()
[u'me']
>>> sel.jmespath('description').css('#1').xpath('./span/text()').extract()
[u'me']

Design Considerations

  1. When type=json, text is tried to be decoded into a valid json object. Incase of failure the selector silently re initializes to the default type i.e html
  2. While translating between xpath and json and vice versa, a new selector instance is created for translating the required data types.
  3. A warning is given if a method is tried to call that is incompatible with the selector when type is json. eg remove_namespaces.

Some Side Notes

  1. I've tried to extend the extract method to output unicode. But I'm not entirely sure about what I've implemented is right. Some help here will be appreciated.
  2. This point will be best explained with an example.
>>> import json, jmespath
>>> text=u'{"id": 1, "name": null}'
>>> print jmespath.search('name', json.loads(text))
None
>>> print jmespath.search('abcd', json.loads(text)) #non existing jmespath
None

In the above cases jmespath returned None when queried for a json null and also for a non- existent jmespath. So ideally, for the first case the extract method should return u'null' and None in the second case. But I couldn't think of a way to distinguish the two.
So I converted this json to equivalent xml and tested the xpath on it. The selector returned None.
So I decided to go with the same. This implementation of jmespath returns None instead of null in either of the two cases.
3. I have tried to match existing test cases for equivalent jmespath. But for testcase test_null_bytes_shouldnt_raise_errors I didn't know what was the right thing to do. In my case, json.loads is unable to decode such strings and so get silently converted to type html.
4. I've not updated the docs right now. I'll update them only if the maintainers would like to merge this PR!

PS : I'm open for suggestions and modifications this pull might need before getting merged!

@codecov-io
Copy link

Current coverage is 100.00%

Merging #27 into master will not affect coverage as of c9b3e3d

Powered by Codecov. Updated on successful CI builds.

@voith
Copy link
Author

voith commented Feb 3, 2016

I have no idea why the pypy tests are failing

@dangra
Copy link
Member

dangra commented Feb 4, 2016

Hi @voith, all the test cases alone worth the work IMO. I scheduled it to review over the weekend. thanks!

@dangra
Copy link
Member

dangra commented Feb 4, 2016

some notes from my quck look:

  • Some new test cases are better placed under a different method. Just add _json suffix to method name.
  • I am not completely happy with adding json_obj and jexpr, I prefer to reuse root and expr if possible. repr() should include type and jmespath as expr
  • any reason check_text and new decorators are public?

@voith
Copy link
Author

voith commented Feb 4, 2016

@dangra I will try to incorporate your suggestions in my next commit.

@voith
Copy link
Author

voith commented Feb 12, 2016

Hey @dangra, I guess you forgot to review this pull. You said you'd review it on the last weekend.

@voith
Copy link
Author

voith commented Mar 10, 2016

Its been more than a month since this pull has been opened. Is there any chance of getting this reviewed?? It's sad that I put in so much effort on this PR and it looks like a waste now!

@eliasdorneles
Copy link
Member

Hello @voith, it's definitely not a waste -- even when a PR is not accepted, this is how discussions evolve.
I'm sorry it's taking long to review it, it's just that we've had so many things happened over the last few weeks.

Please have a little patience, we'll get to it. 👍

@voith
Copy link
Author

voith commented Mar 11, 2016

@eliasdorneles Thanks for saying that! Makes me feel much better :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants