Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawlers appear to be failing: items in /search/ are out of date #58

Open
ghost opened this issue Sep 3, 2014 · 10 comments
Open

Crawlers appear to be failing: items in /search/ are out of date #58

ghost opened this issue Sep 3, 2014 · 10 comments

Comments

@ghost
Copy link

ghost commented Sep 3, 2014

We've now received three separate reports of this; @dirkbaechle has noted that his SCons bugs are not yet imported, I've noticed that the OpenHatch move from Roundup to GitHub Issues isn't reflected by what's in /search/, and we got an email report from Michael Crusoe at Khmer that their bugs haven't been imported yet despite setting things up in /customs/.

Given the dates of the various reports and the bugs viewable in /search/, I think our bug crawler has been failing for at least a month now. A request in #1010 would make the last crawl status per project viewable on /customs/, but really, we need to get things up to date and figure out what's wrong before we start adding new features.

I've scheduled some time for @paulproteus and I to take a look at this on the weekend. Hopefully we can get this fixed soon.

@ehashman
Copy link
Member

ehashman commented Sep 6, 2014

http://inside.openhatch.org/crawl-logs/scrapy.2014-09-06.bkCH.log

This terrible bug crash death thing may have something to do with this.

@paulproteus
Copy link

I'm basically sure that the problem was that in our crawl script, curl would wait 1 minute for openhatch.org to generate a bugimporters configuration file, and then after 1 minute, it timed out.

I'm not sure if CloudFlare was making this more interesting; don't know.

But anyway, it works now.

But it'll likely fail tomorrow in the same way, because the caches will be cold. Generating this bugimporters configuration file is extremely slow, and unless we make change to oh-mainline it's going to remain really slow, so it's going to time out periodically.

We have some retry logic in oh-mainline:run_bugimporters.sh (iirc) that retries if the curl fails due to a network error. We could use some retry logic later in the process, and that could probably save us.

@ehashman
Copy link
Member

ehashman commented Sep 6, 2014

Sometimes, we get a CloudFlare error instead of a yaml response; we could make sure we're not accepting that as input to parse if we checked to make sure the first line of the file isn't <!DOCTYPE html> or similar.

@dirkbaechle
Copy link

Just as an idea: maybe wrapping the run_importer.sh script and calling it for each tracker id explicitly would help? I'm not sure that I really got the big picture of your current bug importing toolchain, but to me it looks as if the underlying database on openhatch.org is the bottleneck, while trying to compile the infos for all known bug trackers at once...

@paulproteus
Copy link

@dirkbaechle, that is at least a semi-reasonable idea.

The downside is then we'd have to do N requests from the API, which is kind of tragic. But it's a pretty reasonable idea.

We could also use a Python profiler and figure out exactly what makes the call to this API endpoint slow.

@paulproteus
Copy link

Also, let me say that it is great to see you here, @dirkbaechle !

@dirkbaechle
Copy link

Sure, I just saw Elana's and your comments and am very interested in this issue...still trying to get things going for the SCons bug tracker. ;)
But I can't really help, or can I?

@ehashman
Copy link
Member

ehashman commented Sep 6, 2014

I think this is fixed now. @paulproteus manually started a scrape (or three) today, and in the future, things should run happily on cron. @dirkbaechle, let's check things tonight or tomorrow and ensure everything is working!

@ehashman ehashman closed this as completed Sep 6, 2014
@dirkbaechle
Copy link

Okay, and thanks for the info.

@ehashman
Copy link
Member

ehashman commented Sep 9, 2014

I'm going to reopen this, as crawlers continue to fail.

@paulproteus, can you take a look at this log (warning 2.5MB: I manually ran this crawl with run_importers.sh, mostly successful with some errors crawl), this one, and this one (automated failing crawls).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants