Default downloader fails to get page #355

mfyang · 2013-07-23T05:12:00Z

'http://autos.msn.com/research/userreviews/reviewlist.aspx?ModelID=14749'

Looks like the default downloader implemented with twisted lib can't fetch the above url. I ran 'scrapy shell http://autos.msn.com/research/userreviews/reviewlist.aspx?ModelID=14749', and got the following output.

Traceback (most recent call last):
  File "/usr/local/bin/scrapy", line 5, in <module>
    pkg_resources.run_script('Scrapy==0.17.0', 'scrapy')
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources.py", line 489, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources.py", line 1207, in run_script
    execfile(script_filename, namespace, namespace)
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <module>
    execute()
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/cmdline.py", line 143, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/cmdline.py", line 88, in _run_print_help
    func(*a, **kw)
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/cmdline.py", line 150, in _run_command
    cmd.run(args, opts)
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/commands/shell.py", line 47, in run
    shell.start(url=url, spider=spider)
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/shell.py", line 43, in start
    self.fetch(url, spider)
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/shell.py", line 85, in fetch
    reactor, self._schedule, request, spider)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/threads.py", line 118, in blockingCallFromThread
    result.raiseException()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/python/failure.py", line 370, in raiseException
    raise self.type, self.value, self.tb
twisted.internet.error.ConnectionDone: Connection was closed cleanly.

But both urlopen of urllib2 and requests.get can download the page smoothly.

The text was updated successfully, but these errors were encountered:

stav · 2013-08-03T16:00:27Z

The initial cause of the error is that there is a cookie header line that is too long:

stav@maia:~$ curl -I http://autos.msn.com/research/userreviews/reviewlist.aspx?ModelID=14749
HTTP/1.1 200 OK
Cache-Control: no-cache
Pragma: no-cache
Content-Length: 195855
Content-Type: text/html; charset=utf-8
Expires: -1
Server: Microsoft-IIS/7.5
X-AspNet-Version: 2.0.50727
Set-Cookie: ResearchBackUrl=/research/userreviews/reviewlist.aspx?ModelID=14749; path=/
Set-Cookie: vReview=rid=441,1165,1248,1269,1272,1284,1417,1434,1455,1723,1800,1857,1875,2379,2396,2406,2439,2456,2734,2901,2944,2991,3046,3059,3157,3313,3576,3613,3634,3672,3986,4106,4227,4367,4461,4739,4857,4984,5073,5106,5275,5388,5406,5559,5592,5764,5771,5808,5838,5893,5962,6055,6198,6229,6332,6543,6546,6549,6826,6835,6839,6855,6881,6919,7021,7065,7112,7124,7196,7223,7329,7398,7411,7577,7579,7696,7698,7757,7759,7787,7973,7989,8136,8188,8189,8201,8231,8271,8285,8298,8346,8465,8482,8510,8521,8579,8613,8642,8744,8754,8812,8858,8875,8948,9000,9048,9116,9208,9223,9428,9468,9494,9561,9753,9844,10021,10063,10071,10091,10093,10120,10169,10193,10199,10212,10267,10317,10336,10361,10376,10446,10452,10481,10494,10500,10528,10535,10547,10556,10590,10607,10609,10619,10624,10625,10629,10662,10690,10706,10734,10753,10762,10772,10776,10819,10840,10861,10873,10902,10922,10932,11020,11031,11044,11046,11102,11132,11159,11173,11218,11227,11244,11336,11356,11434,11446,11453,11484,11531,11536,11545,11553,11559,11566,11577,11589,11595,11598,11636,11668,11706,11764,11784,11785,11792,11797,11799,11818,11829,11855,11857,11885,11943,11946,11955,11957,11963,11990,11997,12017,12059,12062,12105,12146,12163,... >>> longer than 63923
Set-Cookie: MC1=V=3&GUID=56202f9931a94d0e928050b01980dfe6; domain=.msn.com; expires=Mon, 04-Oct-2021 16:00:00 GMT; path=/
X-Powered-By: ASP.NET
Date: Sat, 03 Aug 2013 15:36:52 GMT

This is caught by twisted/protocols/basic.py:

if len(self.__buffer) > self.MAX_LENGTH:  # 16384

But the Scrapy implementation of the transport does not have a loseConnection method, ergo the Exception:

Traceback (most recent call last):
  File "/srv/scrapy/scrapy/scrapy/xlib/tx/_newclient.py", line 1431, in dataReceived
    self._parser.dataReceived(bytes)
  File "/srv/scrapy/scrapy/scrapy/xlib/tx/_newclient.py", line 382, in dataReceived
    HTTPParser.dataReceived(self, data)
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/basic.py", line 556, in dataReceived
    return self.lineLengthExceeded(line)
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/basic.py", line 638, in lineLengthExceeded
    return self.transport.loseConnection()
AttributeError: 'TransportProxyProducer' object has no attribute 'loseConnection'

Which is caught here:

> /srv/scrapy/scrapy/scrapy/xlib/tx/_newclient.py(1432)dataReceived()->None
-> self._parser.dataReceived(bytes)

By the infamous catch-all except: obfusticator:

def dataReceived(self, bytes):
    """
    Handle some stuff from some place.
    """
    try:
        self._parser.dataReceived(bytes)
    except:
        self._giveUp(Failure())

boyce-ywr · 2015-11-14T16:26:53Z

is it fixed?

nyov · 2016-03-29T16:23:07Z

The code in xlib/tx/_newclient.py hasn't changed from what @stav wrote down. So there is no fix there. But if the issue persists with Twisted > 13, then it's (still) a bug in the twisted project, as the bundled tx code isn't used with newer Twisted versions.

If there has been a fix for this upstream, it may still be too much trouble to backport it to the old pre-13 xlib/tx code. So I would propose closing this (and reporting it to Twisted if the issue persists).

redapple · 2016-09-13T17:03:53Z

ftr, this still fails with Twisted 16.4

0xbf00 · 2018-09-11T15:59:45Z

I'm running into this issue again with

scrapy shell https://macupdate.com

This command produces

2018-09-11 17:57:04 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: mac_scraper)
2018-09-11 17:57:04 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0dev0, Python 3.7.0 (default, Jun 29 2018, 20:13:13) - [Clang 9.1.0 (clang-902.0.39.2)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i  14 Aug 2018), cryptography 2.3.1, Platform Darwin-17.7.0-x86_64-i386-64bit
2018-09-11 17:57:04 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'mac_scraper', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0, 'NEWSPIDER_MODULE': 'mac_scraper.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['mac_scraper.spiders']}
2018-09-11 17:57:04 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage']
2018-09-11 17:57:04 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-09-11 17:57:04 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-09-11 17:57:04 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-09-11 17:57:04 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-09-11 17:57:04 [scrapy.core.engine] INFO: Spider opened
2018-09-11 17:57:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.macupdate.com/robots.txt> from <GET https://macupdate.com/robots.txt>
2018-09-11 17:57:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.macupdate.com/robots.txt> (referer: None)
2018-09-11 17:57:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.macupdate.com/> from <GET https://macupdate.com>
2018-09-11 17:57:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.macupdate.com/robots.txt> (referer: None)
2018-09-11 17:57:06 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.macupdate.com/> (failed 1 times): [<twisted.python.failure.Failure builtins.ValueError: not enough values to unpack (expected 2, got 1)>]
2018-09-11 17:57:06 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.macupdate.com/> (failed 2 times): [<twisted.python.failure.Failure builtins.ValueError: not enough values to unpack (expected 2, got 1)>]
2018-09-11 17:57:07 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www.macupdate.com/> (failed 3 times): [<twisted.python.failure.Failure builtins.ValueError: not enough values to unpack (expected 2, got 1)>]
Traceback (most recent call last):
  File "/usr/local/bin/scrapy", line 11, in <module>
    sys.exit(execute())
  File "/usr/local/lib/python3.7/site-packages/scrapy/cmdline.py", line 150, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/usr/local/lib/python3.7/site-packages/scrapy/cmdline.py", line 90, in _run_print_help
    func(*a, **kw)
  File "/usr/local/lib/python3.7/site-packages/scrapy/cmdline.py", line 157, in _run_command
    cmd.run(args, opts)
  File "/usr/local/lib/python3.7/site-packages/scrapy/commands/shell.py", line 73, in run
    shell.start(url=url, redirect=not opts.no_redirect)
  File "/usr/local/lib/python3.7/site-packages/scrapy/shell.py", line 48, in start
    self.fetch(url, spider, redirect=redirect)
  File "/usr/local/lib/python3.7/site-packages/scrapy/shell.py", line 114, in fetch
    result = threads.blockingCallFromThread(reactor, self._schedule, request, spider)
  File "/usr/local/lib/python3.7/site-packages/twisted/internet/threads.py", line 122, in blockingCallFromThread
    result.raiseException()
  File "/usr/local/lib/python3.7/site-packages/twisted/python/failure.py", line 467, in raiseException
    raise self.value.with_traceback(self.tb)
twisted.web._newclient.ResponseFailed: [<twisted.python.failure.Failure builtins.ValueError: not enough values to unpack (expected 2, got 1)>]

I've had trouble debugging the actual underlying issue, but the server is also sending an overly large header field and so I suspect the issue is the same.
How should one go about fixing this (locally)? Since Twisted is likely not going to fix this (see here), I've tried setting larger MAX_SIZE constants in /twisted/protocols/basic.py. However, that seems to have no effect for me...

0xbf00 · 2018-09-18T11:52:55Z

I've written up a workaround here.

nyov · 2018-09-19T07:22:55Z

@0xbf00 Thanks for providing a working workaround.
That does seem kind of an obscene-overkill amount-of solution (putting a TLS MITM proxy beneath scrapy) 🤣
(I wouldn't even mind much, if mitmproxy was not so obsessive in their up-to-date dependency requirements, that's I can't easily use an up-to-date version.)

I tried to build you a more internal solution, assuming the only problem seems to be the LINE LENGTH:

# myproject/settings.py

### Force HTTP1.0 Handler
DOWNLOAD_HANDLERS = {
    'http': 'scrapy.core.downloader.handlers.http.HTTP10DownloadHandler',
    'https': 'scrapy.core.downloader.handlers.http.HTTP10DownloadHandler',
}

#TODO?# MAX_HTTP_LINE_LENGTH = 65536
DOWNLOADER_HTTPCLIENTFACTORY = 'myproject.downloader.ScrapyHTTPClientFactory'
DOWNLOADER_CLIENTCONTEXTFACTORY = 'myproject.downloader.ScrapyClientContextFactory'

# myproject/downloader.py
from OpenSSL import SSL

from scrapy.core.downloader.webclient import (
    ScrapyHTTPPageGetter as HTTPPageGetter,
    ScrapyHTTPClientFactory as HTTPClientFactory,
)
from scrapy.core.downloader.contextfactory import \
    ScrapyClientContextFactory as ClientContextFactory


class ScrapyBadHTTPPageGetter(HTTPPageGetter):

    delimiter = b'\n'
    # Maximum Line Length of LineReceiverProtocol
    MAX_LENGTH = 65536

    # no idea how to get at settings here, so scratch that
    #def __init__(self, *a, **kw):
    #    self.MAX_LENGTH = settings.getint('MAX_HTTP_LINE_LENGTH', 16384)


class ScrapyHTTPClientFactory(HTTPClientFactory):

    protocol = ScrapyBadHTTPPageGetter


class ScrapyClientContextFactory(ClientContextFactory):

    def __init__(self):
        # default method is SSLv23_METHOD
        self.method = SSL.SSLv23_METHOD

However, this still doesn't seem to work on your domain "https://www.macupdate.com/" (YMMV):
The error with this now is an SSL handshake failure: Error: [('SSL routines', 'ssl3_read_bytes', 'sslv3 alert handshake failure')]. ~~That's not an issue of Scrapy IMO, but the server trying to negotiate something stupid.~~

/edit: I figured out this is because the missing SNI of the HTTP10Downloader/OpenSSL combo.

But perhaps you can manage to make that work by changing the ClientContextFactory, which is why I provided an override of ScrapyClientContextFactory here as well? (No idea actually)

(The better solution would be to fix it in the HTTP1.1 downloader instead, but that class is a lot more involved, so I couldn't manage to fix it there so far. And HTTP1.0 is usually still good enough for many sites.)

0xbf00 · 2018-09-19T12:06:06Z

@nyov Thanks for your input! I know that my workaround is not ideal, but it works for me and it involves no fiddling with scrapy and twisted internals. Ideally, this could be fixed upstream, but I am not the person to do this.

alvarolloret · 2020-11-25T12:55:01Z

receiving this error in 2020. Still no fix?

elacuesta · 2020-11-25T18:11:56Z

@alvarolloret I cannot currently reproduce with any of the URLs posted in this thread. Could you post yours?

alvarolloret · 2020-11-26T06:35:16Z

I actually had a long list of urls (around 15 000), and about ~0.5% gave this error. Once I ran it again with only the ones that gave me the error, it disappeared :)

vp777 · 2021-10-25T20:22:47Z

same error with Twisted 21.7.0:
scrapy shell https://spotless.tech/

elacuesta · 2021-10-26T12:14:23Z

Seems like this last site sends some ASCII art with its headers:

$ curl -I https://spotless.tech
HTTP/1.1 200 sP0tL3sS sP0tlLesS (╯°□°)╯︵ ┻━┻
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
░░░░░░░░░░▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄░░░░░░░░░
░░░░░░░░▄▀░░░░░░░░░░░░▄░░░░░░░▀▄░░░░░░░
░░░░░░░░█░░▄░░░░▄░░░░░░░░░░░░░░█░░░░░░░
░░░░░░░░█░░░░░░░░░░░░▄█▄▄░░▄░░░█░▄▄▄░░░
░▄▄▄▄▄░░█░░░░░░▀░░░░▀█░░▀▄░░░░░█▀▀░██░░
░██▄▀██▄█░░░▄░░░░░░░██░░░░▀▀▀▀▀░░░░██░░
░░▀██▄▀██░░░░░░░░▀░██▀░░░░░░░░░░░░░▀██░
░░░░▀████░▀░░░░▄░░░██░░░▄█░░░░▄░▄█░░██░
░░░░░░░▀█░░░░▄░░░░░██░░░░▄░░░▄░░▄░░░██░
░░░░░░░▄█▄░░░░░░░░░░░▀▄░░▀▀▀▀▀▀▀▀░░▄▀░░
░░░░░░█▀▀█████████▀▀▀▀████████████▀░░░░
░░░░░░████▀░░███▀░░░░░░▀███░░▀██▀░░░░░░
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Server: Sp0tw3b
Date: Tue, 26 Oct 2021 12:07:07 GMT
Content-Type: text/html
Content-Length: 33015
Connection: keep-alive
Last-Modified: Tuesday, 26-Oct-2021 12:07:07 GMT
Cache-Control: no-store, no-cache, must-revalidate, proxy-revalidate, max-age=0
Accept-Ranges: bytes

which makes Twisted choke on this line. There is no b":" in the received header, hence the ValueError:

>>> a, b = b"foobar".split(b":", 1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: not enough values to unpack (expected 2, got 1)

AFAICT, these are not RFC-compliant headers: "Each header field consists of a name followed by a colon (":") and the field value" (RFC 2616, section 4.2).

vp777 · 2021-10-26T20:08:56Z

holy mother of god
good job spotting it, i have a spider running on a big list of hosts, i will update if another host pops up

manojbhatt123 · 2022-11-03T15:46:30Z

Still getting this error for url- "https://www.accenture.com/ro-en/services/data-analytics-index#block-what-we-think " in Nov 2022. Still no fix?

wRAR · 2023-01-29T20:43:22Z

Still getting this error for url- "https://www.accenture.com/ro-en/services/data-analytics-index#block-what-we-think " in Nov 2022. Still no fix?

FTR, this URL works just fine for me (just like all of the URLs mentioned earlier). We should probably close this.

Gallaecio · 2023-04-25T06:45:52Z

Looks like this issue still applies. twisted/twisted#8570 is not fixed. It can be currently reproduced with https://www.vapestore.co.uk/ due to their content-security-policy header, but if we need to reproduce it in the future we just need to trigger a response with a header exceeding twisted’s limit.

cathalgarvey added the upstream issue label Feb 20, 2018

wRAR added needs more info and removed upstream issue labels Jan 29, 2023

wRAR closed this as completed Apr 7, 2023

Gallaecio reopened this Apr 25, 2023

Gallaecio added upstream issue and removed needs more info labels Apr 25, 2023

Gallaecio linked a pull request Apr 25, 2023 that will close this issue

Fix “not enough values to unpack” when parsing headers #5911

Draft

4 tasks

Gallaecio mentioned this issue Dec 20, 2023

Incorrect (too long) header from server causes Twisted Agent to fail twisted/twisted#8570

Open

Gallaecio linked a pull request Jan 26, 2024 that will close this issue

#8570 Increase HTTP protocol length limits twisted/twisted#12094

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default downloader fails to get page #355

Default downloader fails to get page #355

mfyang commented Jul 23, 2013

stav commented Aug 3, 2013

boyce-ywr commented Nov 14, 2015

nyov commented Mar 29, 2016

redapple commented Sep 13, 2016

0xbf00 commented Sep 11, 2018

0xbf00 commented Sep 18, 2018

nyov commented Sep 19, 2018 •

edited

0xbf00 commented Sep 19, 2018

alvarolloret commented Nov 25, 2020

elacuesta commented Nov 25, 2020

alvarolloret commented Nov 26, 2020

vp777 commented Oct 25, 2021

elacuesta commented Oct 26, 2021 •

edited

vp777 commented Oct 26, 2021

manojbhatt123 commented Nov 3, 2022

wRAR commented Jan 29, 2023

Gallaecio commented Apr 25, 2023 •

edited

Default downloader fails to get page #355

Default downloader fails to get page #355

Comments

mfyang commented Jul 23, 2013

stav commented Aug 3, 2013

boyce-ywr commented Nov 14, 2015

nyov commented Mar 29, 2016

redapple commented Sep 13, 2016

0xbf00 commented Sep 11, 2018

0xbf00 commented Sep 18, 2018

nyov commented Sep 19, 2018 • edited

0xbf00 commented Sep 19, 2018

alvarolloret commented Nov 25, 2020

elacuesta commented Nov 25, 2020

alvarolloret commented Nov 26, 2020

vp777 commented Oct 25, 2021

elacuesta commented Oct 26, 2021 • edited

vp777 commented Oct 26, 2021

manojbhatt123 commented Nov 3, 2022

wRAR commented Jan 29, 2023

Gallaecio commented Apr 25, 2023 • edited

nyov commented Sep 19, 2018 •

edited

elacuesta commented Oct 26, 2021 •

edited

Gallaecio commented Apr 25, 2023 •

edited