Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad request to Splash & HTTP status code is not handled or not allowed #168

Open
linukey opened this issue Mar 7, 2018 · 4 comments
Open

Comments

@linukey
Copy link

linukey commented Mar 7, 2018

hi kmike, i use scrapy-splash and meet a issue, when i first run 'scrapy crawl toutiao', it's run right, bug when i run it's second, it occur a issue.

i find the issue because headers i add, when i not use headers, it's run right, but it's errors when i use headers and run the second.

the lua script and project follows, i need your help, thanks.

code:

import scrapy
import json
from scrapy_splash import SplashRequest
from scrapy.http.headers import Headers

script = """ 
function main(splash)
  splash:init_cookies(splash.args.cookies)
  assert(splash:go{
                    splash.args.url,
                    headers=splash.args.headers,
                    http_method=splash.args.http_method,
                    body=splash.args.body,
                  })

  assert(splash:wait(0.5))

  local entries = splash:history()
  local last_response = entries[#entries].response

  return {
    headers = last_response.headers,
    cookies = splash:get_cookies(),
    html = splash:html(),
    url = splash:url(),
    http_status = last_response.status,
  }
end
"""

HEADERS = Headers({
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding': 'compress',
    'Accept-Language': 'en-US',
    'Connection': 'keep-alive',
    'Cache-Control': 'no-cache',
    'Pragma': 'no-cache',
    'Host':'m.toutiao.com',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Mobile Safari/537.36'
})

class MySpider(scrapy.Spider):
    name = "toutiao"

    def __init__(self):
        self.start_url = "https://m.toutiao.com"

    def start_requests(self):
            yield SplashRequest(url=self.start_url,
                                callback=self.parse_result,
                                endpoint='execute',
                                cache_args=['lua_source'],
                                args={'lua_source': script, 'http_method': 'GET'},
                                headers=HEADERS)

    def parse_result(self, response):
        print("ok")
        print(response.headers)

the first run correct:

ok
{b'Vary': [b'Accept-Encoding, Accept-Encoding, Accept-Encoding'], b'Timing-Allow-Origin': [b'*'], b'Set-Cookie': [b'tt_webid=653006869922952004; Max-Age=7776000'], b'Transfer-Encoding': [b
'chunked'], b'Content-Type': [b'text/html; charset=utf-8'], b'Connection': [b'keep-alive'], b'X-Tt-Timestamp': [b'152040098.652'], b'X-Ss-Set-Cookie': [b'tt_webid=653006899221952004; Max-
Age=7776000'], b'Server': [b'Tengine'], b'Via': [b'cache1.cn406[13,0]'], b'Content-Encoding': [b'gzip'], b'Eagleid': [b'dcb54e411524000986256455e'], b'Date': [b'Wed, 07 Mar 2018 05:21:38 G
MT']} 

the second run error:

2018-03-07 13:18:54 [scrapy.core.engine] INFO: Spider opened
2018-03-07 13:18:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-03-07 13:18:54 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-03-07 13:18:55 [scrapy_splash.middleware] WARNING: Bad request to Splash: {'info': {'message': 'Lua error: [string "..."]:14: attempt to index field \'?\' (a nil value)', 'type': 'LUA_
ERROR', 'source': '[string "..."]', 'error': "attempt to index field '?' (a nil value)", 'line_number': 14}, 'description': 'Error happened while executing Lua script', 'error': 400, 'type'
: 'ScriptError'}
2018-03-07 13:18:55 [scrapy.core.engine] DEBUG: Crawled (400) <GET https://m.toutiao.com via http://172.17.0.2:8050/execute> (referer: None)
2018-03-07 13:18:55 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://m.toutiao.com>: HTTP status code is not handled or not allowed
2018-03-07 13:18:55 [scrapy.core.engine] INFO: Closing spider (finished)
@linukey
Copy link
Author

linukey commented Mar 7, 2018

the 'Bad request to splash' error maybe caused by 'local last_response = entries[#entries].response', but i don't konw how to fix it.

@nirvana-msu
Copy link

I have a similar issue. For some requests which I make, splash:history() returns an empty array, which makes subsequent indexing into entries[#entries] throw an error. What could cause Splash to not populate the history? And how to get resulting headers and http status in this case?

@kmike
Copy link
Member

kmike commented Mar 15, 2018

Yeah, it can be the problem. It is caused by cache: when response is fetched from an in-memory cache, it doesn't get a record in splash:history. I don't have a good workaround now; it makes sense to check if history is not empty before taking last entry.

@nirvana-msu
Copy link

@kmike I am fine to disable cache (in fact, I would prefer to do that). It seems like it's not possible until scrapinghub/splash#339 is merged? Related issues: scrapinghub/splash#203, scrapinghub/splash#519.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants