Scrapy can not auto detect GBK html encoding #155

samuelchen · 2020-03-05T09:09:52Z

Hi,

Thanks you guys for the great framework.

I am using scrapy to crawl multiple sites. Sites are diffrerent encodings.
One site is encoding as 'gbk' and it's declared in HTML meta. but scrapy can not auto detect the encoding.

I tried using Beautiful soup, it can parse it correctly. So I dig into w3lib. found that the pattern
_BODY_ENCODING_BYTES_RE can not correctly found the encoding in meta.

HTML snippet as below:

b'<HTML>\r\n <HEAD>\r\n  <TITLE>\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc</TITLE>\r\n  <meta httpequiv="ContentType" content="text/html; charset=gbk" />\r\n  <META NAME="Keywords" CONTENT="\xe5\xd0\xa1\xcb\xb5,\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc">'

my test :

>>> from w3lib.encoding import html_body_declared_encoding
>>> b
b'<HTML>\r\n <HEAD>\r\n  <TITLE>\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc</TITLE>\r\n  <meta httpequiv="ContentType" content="text/html; charset=gbk" />\r\n  <META NAME="Keywords" CONTENT="\xe5\xd0\xa1\xcb\xb5,\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc">'
>>> html_body_declared_encoding(b)
>>> enc = html_body_declared_encoding(b)
>>> enc
>>> print('"%s"' % enc)
"None"
>>> soup = BeautifulSoup(b)
>>> soup.title
<title>网站地图</title>
>>> soup.original_encoding
'gbk'
>>>

The text was updated successfully, but these errors were encountered:

kostalski · 2020-11-08T18:47:42Z

Hi @samuelchen @Gallaecio ,

Source of encoding detection problem seems to be in invalid input HTML it self not in w3lib. There is invalid HTML meta tag. There is <meta httpequiv="ContentType" ..., but valid (with w3c) should be <meta http-equiv="Content-Type" ...(missing dash character). Because of that w3lib is not detecting defined encoding.

beautifulsoup4 is detecting 'gbk' encoding, because it is using naive regex for fallback encoding detection (lib: beautifulsoup4 file: bs4/dammit.py, line: html_meta = '<\\s*meta[^>]+charset\\s*=\\s*["\']?([^>]*?)[ /;\'">]').

For @samuelchen problem w3lib can be updated to be more forgiving/lenient. Updating (lib: w3lib, file: w3lib/encoding.py)
From: _HTTPEQUIV_RE = _TEMPLATE % ('http-equiv', 'Content-Type')
To: _HTTPEQUIV_RE = _TEMPLATE % (r'http-?equiv', r'Content-?Type')

After this fix w3lib would detected encoding as gb18030. This should have no side effects, but I don't know if it is right way ;)
What you think @Gallaecio ?

More details below.

Details

I was able to reproduce issue with provided settings:

Python 3.7.9
libs:
-- beautifulsoup4==4.9.3
-- html5lib==1.1
-- lxml==4.6.1
-- w3lib==1.22.0

Test python script:

from w3lib.encoding import html_body_declared_encoding
from bs4 import BeautifulSoup

b = b'<HTML>\r\n <HEAD>\r\n  <TITLE>\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc</TITLE>\r\n  <meta httpequiv="ContentType" content="text/html; charset=gbk" />\r\n  <META NAME="Keywords" CONTENT="\xe5\xd0\xa1\xcb\xb5,\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc">'
enc = html_body_declared_encoding(b)
print("html_body_declared_encoding: %s" % enc)

for parser in ['html5lib', 'html.parser', 'lxml']:
    soup = BeautifulSoup(b, parser)
    print("soup.original_encoding[parser:{}]: {}".format(parser, soup.original_encoding))

Script output:

html_body_declared_encoding: None
soup.original_encoding[parser:html5lib]: windows-1252
soup.original_encoding[parser:html.parser]: windows-1252
soup.original_encoding[parser:lxml]: gbk

Detection by Beatifulsoup only for 'lxml' parser, by fallback encoding detection.
lib: beautifulsoup4
file: bs4/dammit.py,
line: html_meta = '<\\s*meta[^>]+charset\\s*=\\s*["\']?([^>]*?)[ /;\'">]'

samuelchen · 2020-11-22T15:22:53Z

@kostalski Thank you for the feedback. I am not able to recall why that html was httpequiv="ContentType". Not sure if it is possible to be coverted by other parts of scrapy or it's original. I am sorry about this, too long ago to remember that.
btw. GB18030 is compatible with GBK.

kostalski · 2020-11-22T20:48:29Z

Ok @samuelchen, no problem 👍

Gallaecio added the bug label Mar 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrapy can not auto detect GBK html encoding #155

Scrapy can not auto detect GBK html encoding #155

samuelchen commented Mar 5, 2020

kostalski commented Nov 8, 2020

samuelchen commented Nov 22, 2020 •

edited

Loading

kostalski commented Nov 22, 2020

Scrapy can not auto detect GBK html encoding #155

Scrapy can not auto detect GBK html encoding #155

Comments

samuelchen commented Mar 5, 2020

kostalski commented Nov 8, 2020

samuelchen commented Nov 22, 2020 • edited Loading

kostalski commented Nov 22, 2020

samuelchen commented Nov 22, 2020 •

edited

Loading