-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scrapy can not auto detect GBK html encoding #155
Comments
Hi @samuelchen @Gallaecio , Source of encoding detection problem seems to be in invalid input HTML it self not in w3lib. There is invalid HTML meta tag. There is
For @samuelchen problem w3lib can be updated to be more forgiving/lenient. Updating (lib: w3lib, file: w3lib/encoding.py) After this fix w3lib would detected encoding as More details below. Details I was able to reproduce issue with provided settings:
Test python script:
Script output:
Detection by Beatifulsoup only for 'lxml' parser, by fallback encoding detection. |
@kostalski Thank you for the feedback. I am not able to recall why that html was |
Ok @samuelchen, no problem 👍 |
Hi,
Thanks you guys for the great framework.
I am using scrapy to crawl multiple sites. Sites are diffrerent encodings.
One site is encoding as 'gbk' and it's declared in HTML meta. but scrapy can not auto detect the encoding.
I tried using Beautiful soup, it can parse it correctly. So I dig into w3lib. found that the pattern
_BODY_ENCODING_BYTES_RE
can not correctly found the encoding in meta.HTML snippet as below:
my test :
The text was updated successfully, but these errors were encountered: