-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawlers appear to be failing: items in /search/ are out of date #58
Comments
http://inside.openhatch.org/crawl-logs/scrapy.2014-09-06.bkCH.log This terrible bug crash death thing may have something to do with this. |
I'm basically sure that the problem was that in our crawl script, curl would wait 1 minute for openhatch.org to generate a bugimporters configuration file, and then after 1 minute, it timed out. I'm not sure if CloudFlare was making this more interesting; don't know. But anyway, it works now. But it'll likely fail tomorrow in the same way, because the caches will be cold. Generating this bugimporters configuration file is extremely slow, and unless we make change to oh-mainline it's going to remain really slow, so it's going to time out periodically. We have some retry logic in oh-mainline:run_bugimporters.sh (iirc) that retries if the curl fails due to a network error. We could use some retry logic later in the process, and that could probably save us. |
Sometimes, we get a CloudFlare error instead of a yaml response; we could make sure we're not accepting that as input to parse if we checked to make sure the first line of the file isn't |
Just as an idea: maybe wrapping the run_importer.sh script and calling it for each tracker id explicitly would help? I'm not sure that I really got the big picture of your current bug importing toolchain, but to me it looks as if the underlying database on openhatch.org is the bottleneck, while trying to compile the infos for all known bug trackers at once... |
@dirkbaechle, that is at least a semi-reasonable idea. The downside is then we'd have to do N requests from the API, which is kind of tragic. But it's a pretty reasonable idea. We could also use a Python profiler and figure out exactly what makes the call to this API endpoint slow. |
Also, let me say that it is great to see you here, @dirkbaechle ! |
Sure, I just saw Elana's and your comments and am very interested in this issue...still trying to get things going for the SCons bug tracker. ;) |
I think this is fixed now. @paulproteus manually started a scrape (or three) today, and in the future, things should run happily on cron. @dirkbaechle, let's check things tonight or tomorrow and ensure everything is working! |
Okay, and thanks for the info. |
I'm going to reopen this, as crawlers continue to fail. @paulproteus, can you take a look at this log (warning 2.5MB: I manually ran this crawl with |
We've now received three separate reports of this; @dirkbaechle has noted that his SCons bugs are not yet imported, I've noticed that the OpenHatch move from Roundup to GitHub Issues isn't reflected by what's in /search/, and we got an email report from Michael Crusoe at Khmer that their bugs haven't been imported yet despite setting things up in /customs/.
Given the dates of the various reports and the bugs viewable in /search/, I think our bug crawler has been failing for at least a month now. A request in #1010 would make the last crawl status per project viewable on /customs/, but really, we need to get things up to date and figure out what's wrong before we start adding new features.
I've scheduled some time for @paulproteus and I to take a look at this on the weekend. Hopefully we can get this fixed soon.
The text was updated successfully, but these errors were encountered: