Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default downloader fails to get page #355

Open
mfyang opened this issue Jul 23, 2013 · 17 comments · May be fixed by twisted/twisted#12094 or #5911
Open

Default downloader fails to get page #355

mfyang opened this issue Jul 23, 2013 · 17 comments · May be fixed by twisted/twisted#12094 or #5911

Comments

@mfyang
Copy link

mfyang commented Jul 23, 2013

'http://autos.msn.com/research/userreviews/reviewlist.aspx?ModelID=14749'

Looks like the default downloader implemented with twisted lib can't fetch the above url. I ran 'scrapy shell http://autos.msn.com/research/userreviews/reviewlist.aspx?ModelID=14749', and got the following output.

Traceback (most recent call last):
  File "/usr/local/bin/scrapy", line 5, in <module>
    pkg_resources.run_script('Scrapy==0.17.0', 'scrapy')
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources.py", line 489, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pkg_resources.py", line 1207, in run_script
    execfile(script_filename, namespace, namespace)
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <module>
    execute()
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/cmdline.py", line 143, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/cmdline.py", line 88, in _run_print_help
    func(*a, **kw)
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/cmdline.py", line 150, in _run_command
    cmd.run(args, opts)
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/commands/shell.py", line 47, in run
    shell.start(url=url, spider=spider)
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/shell.py", line 43, in start
    self.fetch(url, spider)
  File "/Library/Python/2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/shell.py", line 85, in fetch
    reactor, self._schedule, request, spider)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/threads.py", line 118, in blockingCallFromThread
    result.raiseException()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/python/failure.py", line 370, in raiseException
    raise self.type, self.value, self.tb
twisted.internet.error.ConnectionDone: Connection was closed cleanly.

But both urlopen of urllib2 and requests.get can download the page smoothly.

@stav
Copy link
Contributor

stav commented Aug 3, 2013

The initial cause of the error is that there is a cookie header line that is too long:

stav@maia:~$ curl -I http://autos.msn.com/research/userreviews/reviewlist.aspx?ModelID=14749
HTTP/1.1 200 OK
Cache-Control: no-cache
Pragma: no-cache
Content-Length: 195855
Content-Type: text/html; charset=utf-8
Expires: -1
Server: Microsoft-IIS/7.5
X-AspNet-Version: 2.0.50727
Set-Cookie: ResearchBackUrl=/research/userreviews/reviewlist.aspx?ModelID=14749; path=/
Set-Cookie: vReview=rid=441,1165,1248,1269,1272,1284,1417,1434,1455,1723,1800,1857,1875,2379,2396,2406,2439,2456,2734,2901,2944,2991,3046,3059,3157,3313,3576,3613,3634,3672,3986,4106,4227,4367,4461,4739,4857,4984,5073,5106,5275,5388,5406,5559,5592,5764,5771,5808,5838,5893,5962,6055,6198,6229,6332,6543,6546,6549,6826,6835,6839,6855,6881,6919,7021,7065,7112,7124,7196,7223,7329,7398,7411,7577,7579,7696,7698,7757,7759,7787,7973,7989,8136,8188,8189,8201,8231,8271,8285,8298,8346,8465,8482,8510,8521,8579,8613,8642,8744,8754,8812,8858,8875,8948,9000,9048,9116,9208,9223,9428,9468,9494,9561,9753,9844,10021,10063,10071,10091,10093,10120,10169,10193,10199,10212,10267,10317,10336,10361,10376,10446,10452,10481,10494,10500,10528,10535,10547,10556,10590,10607,10609,10619,10624,10625,10629,10662,10690,10706,10734,10753,10762,10772,10776,10819,10840,10861,10873,10902,10922,10932,11020,11031,11044,11046,11102,11132,11159,11173,11218,11227,11244,11336,11356,11434,11446,11453,11484,11531,11536,11545,11553,11559,11566,11577,11589,11595,11598,11636,11668,11706,11764,11784,11785,11792,11797,11799,11818,11829,11855,11857,11885,11943,11946,11955,11957,11963,11990,11997,12017,12059,12062,12105,12146,12163,... >>> longer than 63923
Set-Cookie: MC1=V=3&GUID=56202f9931a94d0e928050b01980dfe6; domain=.msn.com; expires=Mon, 04-Oct-2021 16:00:00 GMT; path=/
X-Powered-By: ASP.NET
Date: Sat, 03 Aug 2013 15:36:52 GMT

This is caught by twisted/protocols/basic.py:

if len(self.__buffer) > self.MAX_LENGTH:  # 16384

But the Scrapy implementation of the transport does not have a loseConnection method, ergo the Exception:

Traceback (most recent call last):
  File "/srv/scrapy/scrapy/scrapy/xlib/tx/_newclient.py", line 1431, in dataReceived
    self._parser.dataReceived(bytes)
  File "/srv/scrapy/scrapy/scrapy/xlib/tx/_newclient.py", line 382, in dataReceived
    HTTPParser.dataReceived(self, data)
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/basic.py", line 556, in dataReceived
    return self.lineLengthExceeded(line)
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/basic.py", line 638, in lineLengthExceeded
    return self.transport.loseConnection()
AttributeError: 'TransportProxyProducer' object has no attribute 'loseConnection'

Which is caught here:

> /srv/scrapy/scrapy/scrapy/xlib/tx/_newclient.py(1432)dataReceived()->None
-> self._parser.dataReceived(bytes)

By the infamous catch-all except: obfusticator:

def dataReceived(self, bytes):
    """
    Handle some stuff from some place.
    """
    try:
        self._parser.dataReceived(bytes)
    except:
        self._giveUp(Failure())

@boyce-ywr
Copy link

is it fixed?

@nyov
Copy link
Contributor

nyov commented Mar 29, 2016

The code in xlib/tx/_newclient.py hasn't changed from what @stav wrote down. So there is no fix there. But if the issue persists with Twisted > 13, then it's (still) a bug in the twisted project, as the bundled tx code isn't used with newer Twisted versions.

If there has been a fix for this upstream, it may still be too much trouble to backport it to the old pre-13 xlib/tx code. So I would propose closing this (and reporting it to Twisted if the issue persists).

@redapple
Copy link
Contributor

ftr, this still fails with Twisted 16.4

@0xbf00
Copy link

0xbf00 commented Sep 11, 2018

I'm running into this issue again with

scrapy shell https://macupdate.com

This command produces

2018-09-11 17:57:04 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: mac_scraper)
2018-09-11 17:57:04 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0dev0, Python 3.7.0 (default, Jun 29 2018, 20:13:13) - [Clang 9.1.0 (clang-902.0.39.2)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i  14 Aug 2018), cryptography 2.3.1, Platform Darwin-17.7.0-x86_64-i386-64bit
2018-09-11 17:57:04 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'mac_scraper', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0, 'NEWSPIDER_MODULE': 'mac_scraper.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['mac_scraper.spiders']}
2018-09-11 17:57:04 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage']
2018-09-11 17:57:04 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-09-11 17:57:04 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-09-11 17:57:04 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-09-11 17:57:04 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-09-11 17:57:04 [scrapy.core.engine] INFO: Spider opened
2018-09-11 17:57:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.macupdate.com/robots.txt> from <GET https://macupdate.com/robots.txt>
2018-09-11 17:57:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.macupdate.com/robots.txt> (referer: None)
2018-09-11 17:57:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.macupdate.com/> from <GET https://macupdate.com>
2018-09-11 17:57:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.macupdate.com/robots.txt> (referer: None)
2018-09-11 17:57:06 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.macupdate.com/> (failed 1 times): [<twisted.python.failure.Failure builtins.ValueError: not enough values to unpack (expected 2, got 1)>]
2018-09-11 17:57:06 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.macupdate.com/> (failed 2 times): [<twisted.python.failure.Failure builtins.ValueError: not enough values to unpack (expected 2, got 1)>]
2018-09-11 17:57:07 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www.macupdate.com/> (failed 3 times): [<twisted.python.failure.Failure builtins.ValueError: not enough values to unpack (expected 2, got 1)>]
Traceback (most recent call last):
  File "/usr/local/bin/scrapy", line 11, in <module>
    sys.exit(execute())
  File "/usr/local/lib/python3.7/site-packages/scrapy/cmdline.py", line 150, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/usr/local/lib/python3.7/site-packages/scrapy/cmdline.py", line 90, in _run_print_help
    func(*a, **kw)
  File "/usr/local/lib/python3.7/site-packages/scrapy/cmdline.py", line 157, in _run_command
    cmd.run(args, opts)
  File "/usr/local/lib/python3.7/site-packages/scrapy/commands/shell.py", line 73, in run
    shell.start(url=url, redirect=not opts.no_redirect)
  File "/usr/local/lib/python3.7/site-packages/scrapy/shell.py", line 48, in start
    self.fetch(url, spider, redirect=redirect)
  File "/usr/local/lib/python3.7/site-packages/scrapy/shell.py", line 114, in fetch
    result = threads.blockingCallFromThread(reactor, self._schedule, request, spider)
  File "/usr/local/lib/python3.7/site-packages/twisted/internet/threads.py", line 122, in blockingCallFromThread
    result.raiseException()
  File "/usr/local/lib/python3.7/site-packages/twisted/python/failure.py", line 467, in raiseException
    raise self.value.with_traceback(self.tb)
twisted.web._newclient.ResponseFailed: [<twisted.python.failure.Failure builtins.ValueError: not enough values to unpack (expected 2, got 1)>]

I've had trouble debugging the actual underlying issue, but the server is also sending an overly large header field and so I suspect the issue is the same.
How should one go about fixing this (locally)? Since Twisted is likely not going to fix this (see here), I've tried setting larger MAX_SIZE constants in /twisted/protocols/basic.py. However, that seems to have no effect for me...

@0xbf00
Copy link

0xbf00 commented Sep 18, 2018

I've written up a workaround here.

@nyov
Copy link
Contributor

nyov commented Sep 19, 2018

@0xbf00 Thanks for providing a working workaround.
That does seem kind of an obscene-overkill amount-of solution (putting a TLS MITM proxy beneath scrapy) 🤣
(I wouldn't even mind much, if mitmproxy was not so obsessive in their up-to-date dependency requirements, that's I can't easily use an up-to-date version.)

I tried to build you a more internal solution, assuming the only problem seems to be the LINE LENGTH:

# myproject/settings.py

### Force HTTP1.0 Handler
DOWNLOAD_HANDLERS = {
    'http': 'scrapy.core.downloader.handlers.http.HTTP10DownloadHandler',
    'https': 'scrapy.core.downloader.handlers.http.HTTP10DownloadHandler',
}

#TODO?# MAX_HTTP_LINE_LENGTH = 65536
DOWNLOADER_HTTPCLIENTFACTORY = 'myproject.downloader.ScrapyHTTPClientFactory'
DOWNLOADER_CLIENTCONTEXTFACTORY = 'myproject.downloader.ScrapyClientContextFactory'
# myproject/downloader.py
from OpenSSL import SSL

from scrapy.core.downloader.webclient import (
    ScrapyHTTPPageGetter as HTTPPageGetter,
    ScrapyHTTPClientFactory as HTTPClientFactory,
)
from scrapy.core.downloader.contextfactory import \
    ScrapyClientContextFactory as ClientContextFactory


class ScrapyBadHTTPPageGetter(HTTPPageGetter):

    delimiter = b'\n'
    # Maximum Line Length of LineReceiverProtocol
    MAX_LENGTH = 65536

    # no idea how to get at settings here, so scratch that
    #def __init__(self, *a, **kw):
    #    self.MAX_LENGTH = settings.getint('MAX_HTTP_LINE_LENGTH', 16384)


class ScrapyHTTPClientFactory(HTTPClientFactory):

    protocol = ScrapyBadHTTPPageGetter


class ScrapyClientContextFactory(ClientContextFactory):

    def __init__(self):
        # default method is SSLv23_METHOD
        self.method = SSL.SSLv23_METHOD

However, this still doesn't seem to work on your domain "https://www.macupdate.com/" (YMMV):
The error with this now is an SSL handshake failure: Error: [('SSL routines', 'ssl3_read_bytes', 'sslv3 alert handshake failure')]. That's not an issue of Scrapy IMO, but the server trying to negotiate something stupid.

/edit: I figured out this is because the missing SNI of the HTTP10Downloader/OpenSSL combo.

But perhaps you can manage to make that work by changing the ClientContextFactory, which is why I provided an override of ScrapyClientContextFactory here as well? (No idea actually)

(The better solution would be to fix it in the HTTP1.1 downloader instead, but that class is a lot more involved, so I couldn't manage to fix it there so far. And HTTP1.0 is usually still good enough for many sites.)

@0xbf00
Copy link

0xbf00 commented Sep 19, 2018

@nyov Thanks for your input! I know that my workaround is not ideal, but it works for me and it involves no fiddling with scrapy and twisted internals. Ideally, this could be fixed upstream, but I am not the person to do this.

@alvarolloret
Copy link

receiving this error in 2020. Still no fix?

@elacuesta
Copy link
Member

@alvarolloret I cannot currently reproduce with any of the URLs posted in this thread. Could you post yours?

@alvarolloret
Copy link

I actually had a long list of urls (around 15 000), and about ~0.5% gave this error. Once I ran it again with only the ones that gave me the error, it disappeared :)

@vp777
Copy link

vp777 commented Oct 25, 2021

same error with Twisted 21.7.0:
scrapy shell https://spotless.tech/

@elacuesta
Copy link
Member

elacuesta commented Oct 26, 2021

Seems like this last site sends some ASCII art with its headers:

$ curl -I https://spotless.tech
HTTP/1.1 200 sP0tL3sS sP0tlLesS (╯°□°)╯︵ ┻━┻
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
░░░░░░░░░░▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄░░░░░░░░░
░░░░░░░░▄▀░░░░░░░░░░░░▄░░░░░░░▀▄░░░░░░░
░░░░░░░░█░░▄░░░░▄░░░░░░░░░░░░░░█░░░░░░░
░░░░░░░░█░░░░░░░░░░░░▄█▄▄░░▄░░░█░▄▄▄░░░
░▄▄▄▄▄░░█░░░░░░▀░░░░▀█░░▀▄░░░░░█▀▀░██░░
░██▄▀██▄█░░░▄░░░░░░░██░░░░▀▀▀▀▀░░░░██░░
░░▀██▄▀██░░░░░░░░▀░██▀░░░░░░░░░░░░░▀██░
░░░░▀████░▀░░░░▄░░░██░░░▄█░░░░▄░▄█░░██░
░░░░░░░▀█░░░░▄░░░░░██░░░░▄░░░▄░░▄░░░██░
░░░░░░░▄█▄░░░░░░░░░░░▀▄░░▀▀▀▀▀▀▀▀░░▄▀░░
░░░░░░█▀▀█████████▀▀▀▀████████████▀░░░░
░░░░░░████▀░░███▀░░░░░░▀███░░▀██▀░░░░░░
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Server: Sp0tw3b
Date: Tue, 26 Oct 2021 12:07:07 GMT
Content-Type: text/html
Content-Length: 33015
Connection: keep-alive
Last-Modified: Tuesday, 26-Oct-2021 12:07:07 GMT
Cache-Control: no-store, no-cache, must-revalidate, proxy-revalidate, max-age=0
Accept-Ranges: bytes

which makes Twisted choke on this line. There is no b":" in the received header, hence the ValueError:

>>> a, b = b"foobar".split(b":", 1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: not enough values to unpack (expected 2, got 1)

AFAICT, these are not RFC-compliant headers: "Each header field consists of a name followed by a colon (":") and the field value" (RFC 2616, section 4.2).

@vp777
Copy link

vp777 commented Oct 26, 2021

holy mother of god
good job spotting it, i have a spider running on a big list of hosts, i will update if another host pops up

@manojbhatt123
Copy link

Still getting this error for url- "https://www.accenture.com/ro-en/services/data-analytics-index#block-what-we-think " in Nov 2022. Still no fix?

@wRAR
Copy link
Member

wRAR commented Jan 29, 2023

Still getting this error for url- "https://www.accenture.com/ro-en/services/data-analytics-index#block-what-we-think " in Nov 2022. Still no fix?

FTR, this URL works just fine for me (just like all of the URLs mentioned earlier). We should probably close this.

@Gallaecio
Copy link
Member

Gallaecio commented Apr 25, 2023

Looks like this issue still applies. twisted/twisted#8570 is not fixed. It can be currently reproduced with https://www.vapestore.co.uk/ due to their content-security-policy header, but if we need to reproduce it in the future we just need to trigger a response with a header exceeding twisted’s limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet