Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential errors when scraping new organization. Skipped repos. #477

Open
jordanperr opened this issue Dec 4, 2020 · 1 comment
Open

Comments

@jordanperr
Copy link

jordanperr commented Dec 4, 2020

I am running MASTER.sh to download all data from the NREL github organization (which has 350 repos), but it's taking a very long time and I'm not sure if this is normal. For most repositories in the org, the query returns in under a second. It does appear that the script is scraping over 4,000 repositories (possibly dependencies?)

For some repositories, it seems to take much longer and the script prints out warning-like messages such as:

Sending REST query...
Checking response...
HTTP/1.1 202 Accepted
API Status {"limit": 5000, "remaining": 4414, "reset": 1607114323}
Query accepted but not yet processed. Trying again in 3sec...

Also, for a very small minority of repos, I get the following error-like message:

GraphQL API error.
[{"path": ["repository", "dependencyGraphManifests"], "locations": [{"line": 1, "column": 244}], "message": "loading"}]

These two errors do not seem to occur simultaneously.

The script is still humming along, and I will let it finish, but am wondering if these errors can simply be ignored.

Update: The script has finished and I am able to view the data using the Jekyll dev server. However, it appears that at least 3 repositories (out of 350) were skipped.

Steps to reproduce:

  1. Remove all data from explore/github_data.
  2. Remove all repos and orgs from _explore/input_lists.json, and add "NREL" as an org.
  3. Create python environment and install dependencies from requirements.txt
  4. Set GITHUB_API_TOKEN environment variable
  5. Run ./MASTER.sh
@jordanperr jordanperr changed the title Very slow scraping of Github API Very slow scraping for new organization Dec 4, 2020
@jordanperr jordanperr changed the title Very slow scraping for new organization Potential errors when scraping new organization Dec 4, 2020
@jordanperr jordanperr changed the title Potential errors when scraping new organization Potential errors when scraping new organization. Skipped repos. Dec 5, 2020
@LRWeber
Copy link
Member

LRWeber commented Dec 7, 2020

The update can take a long time. Our current daily update typically runs for about an hour.

The warning messages with the 202 Accepted response typically come from the commit activity query, and should be expected. That response means that data in particular requires GitHub's side to do some additional internal processing before it can response. The initial query triggers that GitHub process, and the script then repeats the query after allowing it time to finish to return the desired data. The commit activity specifically is then cached (on GitHub's side) for immediate responses for about 24 hours / the rest of the day.

The generic GraphQL API error message means something went wrong on GitHub's side. Sometimes these are intermittent issues, in which case the script will attempt the query again. Other times, this can be caused by something like an empty repo. A closer examination of _explore/LAST_MASTER_UPDATE.log may reveal what is happening in these cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants