Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oh-bugimporters should do per-domain backoff #81

Open
ghost opened this issue Sep 3, 2014 · 0 comments
Open

oh-bugimporters should do per-domain backoff #81

ghost opened this issue Sep 3, 2014 · 0 comments

Comments

@ghost
Copy link

ghost commented Sep 3, 2014

Comment by paulproteus:

Some bug trackers (openhatch.org/bugs/ especially...) if you request more than 1-
2 bugs per second report HTTP 504 Gateway Timeout.

The way Scrapy handles this now is in the
http://doc.scrapy.org/en/0.12/topics/downloader-middleware.html#module-
scrapy.contrib.downloadermiddleware.retry middleware, which re-queues the job but
doesn't insist on a time delay.

It'd be nice to have a custom RetryMiddleware that did per-domain backoff. (Note
that we're sort of abusing the Scrapy architecture; we're supposed to have one
"spider" class per domain, but instead we only have one.)

One way to do this is to provide a custom subclass of
scrapy.contrib.downloadermiddleware.retry.RetryMiddleware and then override the
_retry method.

That should let us more reliably crawl some of the sites that are quite finnicky.


Status: unread
Nosy List: paulproteus
Priority: wish
Imported from roundup ID: 793 (view archived page)
Last modified: 2012-11-20.16:04:43

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants