Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] text extraction in Selector and SelectorList #127

Open
wants to merge 27 commits into
base: master
Choose a base branch
from

Conversation

kmike
Copy link
Member

@kmike kmike commented Nov 2, 2018

I've opened it for a discussion, it is not a finished solution (yet), but something one can install any try the API. See #34 for original proposal.

Here .text() methods options to extract text are added for Selector and SelectorList, using https://github.com/TeamHG-Memex/html-text library.

Problems:

  1. [RESOLVED] Naming issue. In scrapy it can be convenient to have response.text() shortcut, to use it instead of response.css('body').text() or response.selector.text(). But we already have response.text, which is unicode body. It makes this feature more confusing - selector's .text() methods are very different from response.text attribute. From this point of view, .text_content() name sounds better. Any ideas for a shorter / alternative name? UPD: resolved by making text conversion .get argument

  2. [RESOLVED] There is a circular package dependency: html_text requires parsel, and parsel requires html_text. This is not a problem code-wise, but I haven't checked how well pip can handle it. In a basic case it seems to work, but I wonder if we get issues related to this. It can be solved by changing html_text API and making its parsel dependency optional. UPD: this is fixed in Remove parsel dependency TeamHG-Memex/html-text#15

  3. [RESOLVED] parsel imports private html_text methods. This can be solved by changing html_text API. UPD: fixed at Remove parsel dependency TeamHG-Memex/html-text#15

  4. [RESOLVED] Cleaning is called for each Selector.text() call. So e.g. in case of sel.css('div').text() each div will be cleaned and copied - instead of cleaning a tree once. I'm not sure how large is this problem tough; probably it is inefficient when you need to extract text from nested elements (e.g. from all elements) - it means cleaning will be run multiple times on same parts of the tree, making sel.xpath("*").text() O(N^2) instead of O(N). Alternative solution is to have sel.cleaned().text() or something like this; .cleaned() may allow lxml Cleaner arguments. But it looks like a separate feature; also, Cleaner parameters which work best with html-text are not default lxml's. UPD: there is .cleaned() method which supports different Cleaners, O(N^2) caveat is mentioned in the docs.

  5. [RESOLVED] When user requests sel.text() from an element which is removed by Cleaner (e.g. sel.css('script')[0].text(), None is returned. Should it be an empty string? UPD: we (me and @dangra) think None is fine.

  6. [RESOLVED] SelectorList.text() joins text. This is similar to what's proposed in Add method that allows joining the extracted result into a string scrapy#772, but different from SelectorList.get, which returns the first element. If needed, we can support both behaviors, by allowing sep=None, and probably using it by default (or join=None if we rename 'sep' argument to 'join'), meaning "don't join, take first" - or would it be too confusing? UPD: SelectorList no longer joins text; as there is text extraction support in .getall, it is easy to join text on user side.

  7. [RESOLVED] Joining in SelectorList.text can be confusing if SelectorList selects nested elements. UPD: SelectorList no longer joins text.

TODO:

  • release html-text, to fix tests
  • finish docstrings
  • bring back typing.overload
  • make sure parsel API docs are ok
  • make sure Scrapy API docs are fine (they may pull from parsel API docs)
  • update parsel tutorial
  • clarify .cleaned behavior for non-html/xml selectors
  • clarify text extraction behavior for non-html/xml selectors
  • make sure exception messages are correct
  • add tests
  • document O(N^2) gotcha
  • should we expose all html_text options? - No, we can do it at any time, it's not a requirement to decide on this in this PR.

@codecov
Copy link

codecov bot commented Nov 2, 2018

Codecov Report

Merging #127 into master will decrease coverage by 1.43%.
The diff coverage is 42.85%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #127      +/-   ##
==========================================
- Coverage   99.63%   98.19%   -1.44%     
==========================================
  Files           5        5              
  Lines         271      277       +6     
  Branches       48       49       +1     
==========================================
+ Hits          270      272       +2     
- Misses          1        5       +4
Impacted Files Coverage Δ
parsel/selector.py 97.2% <42.85%> (-2.8%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8fc608e...da7bb80. Read the comment docs.

"""
if isinstance(cleaner, six.string_types):
if cleaner not in {'html', 'text'}:
raise ValueError("cleaner must be 'html', 'text' or "
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one gotcha: this exception is raised in .get as well, but in .get there are two more accepted values: "auto" and None. Does it worth fixing?

if cleaner == 'html':
cleaner = self._html_cleaner
elif cleaner == 'text':
cleaner = self._text_cleaner
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an alternative is make these attributes public, and ask users to pass them: sel.cleaned(sel.TEXT_CLEANER) instead of sel.cleaned('text').

@kmike kmike changed the title [WIP] Selector.text and SelectorList.text methods [WIP] text extraction in Selector and SelectorList Dec 12, 2018
@codecov
Copy link

codecov bot commented May 30, 2019

Codecov Report

Attention: Patch coverage is 77.14286% with 8 lines in your changes are missing coverage. Please review.

Project coverage is 90.88%. Comparing base (780b6e6) to head (69456c1).
Report is 3 commits behind head on master.

❗ Current head 69456c1 differs from pull request most recent head 852bbef. Consider uploading reports for the commit 852bbef to get more accurate results

Files Patch % Lines
parsel/selector.py 77.14% 4 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #127      +/-   ##
==========================================
- Coverage   92.18%   90.88%   -1.30%     
==========================================
  Files           5        5              
  Lines         448      472      +24     
  Branches       91       99       +8     
==========================================
+ Hits          413      429      +16     
- Misses         26       30       +4     
- Partials        9       13       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me so far.

parsel/selector.py Show resolved Hide resolved
kmike and others added 3 commits November 10, 2022 17:22
# Conflicts:
#	parsel/selector.py
#	tests/test_selector.py
Co-authored-by: Adrián Chaves <[email protected]>
# Conflicts:
#	parsel/selector.py
#	tests/test_selector.py
#	tests/typing/selector.py
#	tox.ini
@dangra
Copy link
Member

dangra commented Apr 24, 2024

still 👍 from far away 🚢

Comment on lines +143 to +155
To extract all text of one or more element and all their child elements,
formatted as plain text taking into account HTML tags (e.g. ``<br/>`` is
translated as a line break), set ``text=True`` in your call to
:meth:`~parsel.selector.Selector.get` or
:meth:`~parsel.selector.Selector.getall` instead of including
``::text`` (CSS) or ``/text()`` (XPath) in your query::

>>> selector.css('#images').get(text=True)
'Name: My image 1\nName: My image 2\nName: My image 3\nName: My image 4\nName: My image 5'

See :meth:`Selector.get` for additional parameters that you can use to change
how the extracted plain text is formatted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like for many use cases .get(text=True) could provide more reasonable behavior than /text() or ::text in a selector. From this point of view, I wonder if we should make it one of the first examples, and review many other examples as well. But it seems we can also do it separately, not as a part of this PR, so I'm not working on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants