Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extruct returns incorrectly formatted description property #113

Closed
jakubwasikowski opened this issue May 27, 2019 · 2 comments
Closed

Extruct returns incorrectly formatted description property #113

jakubwasikowski opened this issue May 27, 2019 · 2 comments
Labels

Comments

@jakubwasikowski
Copy link
Contributor

Seems that extruct incorrectly interprets description with included HTML tags from microdata.

See the below description extracted from URL https://www.monsterpetsupplies.co.uk/cat/cat-flea-tick/johnsons-4-fleas-cats-kittens:

>>> import extruct
>>> import requests
>>> from w3lib.html import get_base_url
>>> r = requests.get('https://www.monsterpetsupplies.co.uk/cat/cat-flea-tick/johnsons-4-fleas-cats-kittens')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url=base_url)
>>> data['microdata'][0]['properties']['description']
"Johnsons 4 Fleas Cats & Kittens - 3 Treatment Pack, 6 Treatment PackFor use with Cats and Kittens over 4 weeks of age between 1 and 11kg.Johnson's 4fleas tablets are an easy to use oral treatment to kill adult fleas found on your pet.Effects on the fleas may be seen as soon as 15 minutes after administration.Between 95 - 100% of fleas will be killed off in the first six hours, but ALL adult fleas will be gone after a day.These tablets can be given directly to the mouth or may be mixed in a small portion f our pet's favourite food and given immediately. Administer a single tablet on an day when fleas are seen on your pet. Repeat on any subsequent day as necessary. Do not give more than one treatment per day.You may notice your pet scratching more than usual for the first half hour after administration; this is completely normal and caused by the fleas reacting to Johnson's 4Fleas tablets.While highly effective by themselves, 4Fleas is great when used as part of a programme to eliminate fleas and their larvae from both pets and their surroundings."

As it can be seen, there is a problem with formatting, like lack of space between "Pack" and "For" or between "11kg." and "Johnson's".

It turns out that the problem is not because of description property content per-se, because it looks correctly on the page source:

<p><strong>Johnsons 4 Fleas Cats &amp; Kittens - 3 Treatment Pack, 6 Treatment Pack</strong></p>For use with Cats and Kittens over 4 weeks of age between 1 and 11kg.<br /><br />Johnson's 4fleas tablets are an easy to use oral treatment to kill adult fleas found on your pet.<br /><br />Effects on the fleas may be seen as soon as 15 minutes after administration.<br /><br />Between 95 - 100% of fleas will be killed off in the first six hours, but ALL adult fleas will be gone after a day.<br /><br />These tablets can be given directly to the mouth or may be mixed in a small portion f our pet's favourite food and given immediately. Administer a single tablet on an day when fleas are seen on your pet. Repeat on any subsequent day as necessary. Do not give more than one treatment per day.<br /><br />You may notice your pet scratching more than usual for the first half hour after administration; this is completely normal and caused by the fleas reacting to Johnson's 4Fleas tablets.<br /><br />While highly effective by themselves, 4Fleas is great when used as part of a programme to eliminate fleas and their larvae from both pets and their surroundings.

Likely it is a matter of line

return u"".join(self._xp_clean_text(node)).strip()
where html-text should be used instead of ad-hoc text extraction.

@lopuhin
Copy link
Member

lopuhin commented May 28, 2019

Added html so that it can be reproduced later

gh-113.html.zip

@jakubwasikowski
Copy link
Contributor Author

jakubwasikowski commented Jul 22, 2019

The issue has been fixed in PR #119.
I'm closing it 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants