New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default downloader fails to get page #355
Default downloader fails to get page #355
Comments
The initial cause of the error is that there is a cookie header line that is too long:
This is caught by
But the Scrapy implementation of the transport does not have a
Which is caught here:
By the infamous catch-all
|
is it fixed? |
The code in If there has been a fix for this upstream, it may still be too much trouble to backport it to the old pre-13 xlib/tx code. So I would propose closing this (and reporting it to Twisted if the issue persists). |
ftr, this still fails with Twisted 16.4 |
I'm running into this issue again with scrapy shell https://macupdate.com This command produces 2018-09-11 17:57:04 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: mac_scraper)
2018-09-11 17:57:04 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0dev0, Python 3.7.0 (default, Jun 29 2018, 20:13:13) - [Clang 9.1.0 (clang-902.0.39.2)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i 14 Aug 2018), cryptography 2.3.1, Platform Darwin-17.7.0-x86_64-i386-64bit
2018-09-11 17:57:04 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'mac_scraper', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0, 'NEWSPIDER_MODULE': 'mac_scraper.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['mac_scraper.spiders']}
2018-09-11 17:57:04 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage']
2018-09-11 17:57:04 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-09-11 17:57:04 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-09-11 17:57:04 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-09-11 17:57:04 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-09-11 17:57:04 [scrapy.core.engine] INFO: Spider opened
2018-09-11 17:57:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.macupdate.com/robots.txt> from <GET https://macupdate.com/robots.txt>
2018-09-11 17:57:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.macupdate.com/robots.txt> (referer: None)
2018-09-11 17:57:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.macupdate.com/> from <GET https://macupdate.com>
2018-09-11 17:57:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.macupdate.com/robots.txt> (referer: None)
2018-09-11 17:57:06 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.macupdate.com/> (failed 1 times): [<twisted.python.failure.Failure builtins.ValueError: not enough values to unpack (expected 2, got 1)>]
2018-09-11 17:57:06 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.macupdate.com/> (failed 2 times): [<twisted.python.failure.Failure builtins.ValueError: not enough values to unpack (expected 2, got 1)>]
2018-09-11 17:57:07 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www.macupdate.com/> (failed 3 times): [<twisted.python.failure.Failure builtins.ValueError: not enough values to unpack (expected 2, got 1)>]
Traceback (most recent call last):
File "/usr/local/bin/scrapy", line 11, in <module>
sys.exit(execute())
File "/usr/local/lib/python3.7/site-packages/scrapy/cmdline.py", line 150, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/usr/local/lib/python3.7/site-packages/scrapy/cmdline.py", line 90, in _run_print_help
func(*a, **kw)
File "/usr/local/lib/python3.7/site-packages/scrapy/cmdline.py", line 157, in _run_command
cmd.run(args, opts)
File "/usr/local/lib/python3.7/site-packages/scrapy/commands/shell.py", line 73, in run
shell.start(url=url, redirect=not opts.no_redirect)
File "/usr/local/lib/python3.7/site-packages/scrapy/shell.py", line 48, in start
self.fetch(url, spider, redirect=redirect)
File "/usr/local/lib/python3.7/site-packages/scrapy/shell.py", line 114, in fetch
result = threads.blockingCallFromThread(reactor, self._schedule, request, spider)
File "/usr/local/lib/python3.7/site-packages/twisted/internet/threads.py", line 122, in blockingCallFromThread
result.raiseException()
File "/usr/local/lib/python3.7/site-packages/twisted/python/failure.py", line 467, in raiseException
raise self.value.with_traceback(self.tb)
twisted.web._newclient.ResponseFailed: [<twisted.python.failure.Failure builtins.ValueError: not enough values to unpack (expected 2, got 1)>] I've had trouble debugging the actual underlying issue, but the server is also sending an overly large header field and so I suspect the issue is the same. |
I've written up a workaround here. |
@0xbf00 Thanks for providing a working workaround. I tried to build you a more internal solution, assuming the only problem seems to be the LINE LENGTH: # myproject/settings.py
### Force HTTP1.0 Handler
DOWNLOAD_HANDLERS = {
'http': 'scrapy.core.downloader.handlers.http.HTTP10DownloadHandler',
'https': 'scrapy.core.downloader.handlers.http.HTTP10DownloadHandler',
}
#TODO?# MAX_HTTP_LINE_LENGTH = 65536
DOWNLOADER_HTTPCLIENTFACTORY = 'myproject.downloader.ScrapyHTTPClientFactory'
DOWNLOADER_CLIENTCONTEXTFACTORY = 'myproject.downloader.ScrapyClientContextFactory' # myproject/downloader.py
from OpenSSL import SSL
from scrapy.core.downloader.webclient import (
ScrapyHTTPPageGetter as HTTPPageGetter,
ScrapyHTTPClientFactory as HTTPClientFactory,
)
from scrapy.core.downloader.contextfactory import \
ScrapyClientContextFactory as ClientContextFactory
class ScrapyBadHTTPPageGetter(HTTPPageGetter):
delimiter = b'\n'
# Maximum Line Length of LineReceiverProtocol
MAX_LENGTH = 65536
# no idea how to get at settings here, so scratch that
#def __init__(self, *a, **kw):
# self.MAX_LENGTH = settings.getint('MAX_HTTP_LINE_LENGTH', 16384)
class ScrapyHTTPClientFactory(HTTPClientFactory):
protocol = ScrapyBadHTTPPageGetter
class ScrapyClientContextFactory(ClientContextFactory):
def __init__(self):
# default method is SSLv23_METHOD
self.method = SSL.SSLv23_METHOD However, this still doesn't seem to work on your domain "https://www.macupdate.com/" (YMMV): /edit: I figured out this is because the missing SNI of the HTTP10Downloader/OpenSSL combo. But perhaps you can manage to make that work by changing the (The better solution would be to fix it in the HTTP1.1 downloader instead, but that class is a lot more involved, so I couldn't manage to fix it there so far. And HTTP1.0 is usually still good enough for many sites.) |
@nyov Thanks for your input! I know that my workaround is not ideal, but it works for me and it involves no fiddling with |
receiving this error in 2020. Still no fix? |
@alvarolloret I cannot currently reproduce with any of the URLs posted in this thread. Could you post yours? |
I actually had a long list of urls (around 15 000), and about ~0.5% gave this error. Once I ran it again with only the ones that gave me the error, it disappeared :) |
same error with Twisted 21.7.0: |
Seems like this last site sends some ASCII art with its headers:
which makes Twisted choke on this line. There is no >>> a, b = b"foobar".split(b":", 1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: not enough values to unpack (expected 2, got 1) AFAICT, these are not RFC-compliant headers: "Each header field consists of a name followed by a colon (":") and the field value" (RFC 2616, section 4.2). |
holy mother of god |
Still getting this error for url- "https://www.accenture.com/ro-en/services/data-analytics-index#block-what-we-think " in Nov 2022. Still no fix? |
FTR, this URL works just fine for me (just like all of the URLs mentioned earlier). We should probably close this. |
Looks like this issue still applies. twisted/twisted#8570 is not fixed. It can be currently reproduced with https://www.vapestore.co.uk/ due to their |
'http://autos.msn.com/research/userreviews/reviewlist.aspx?ModelID=14749'
Looks like the default downloader implemented with twisted lib can't fetch the above url. I ran 'scrapy shell http://autos.msn.com/research/userreviews/reviewlist.aspx?ModelID=14749', and got the following output.
But both urlopen of urllib2 and requests.get can download the page smoothly.
The text was updated successfully, but these errors were encountered: