Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support scraping Subversion #35

Open
leebrian opened this issue May 17, 2019 · 6 comments
Open

Support scraping Subversion #35

leebrian opened this issue May 17, 2019 · 6 comments
Labels
code.gov https://github.com/GSA/code-gov/blob/master/HelpWanted.md help wanted

Comments

@leebrian
Copy link
Collaborator

Being able to scrape subversion projects would be helpful and is not yet supported. It's a pretty low priority for my agency, but you requested we add issues for examples of repos not yet supported.

@IanLee1521 IanLee1521 added code.gov https://github.com/GSA/code-gov/blob/master/HelpWanted.md help wanted labels May 17, 2019
@IanLee1521
Copy link
Member

Hmm. Yeah, pretty low for me too. I'm trying to think how we might do this for arbitrary SVN repos (where we get the metadata itself).

Are there specific SVN hosting tools that we would specifically want / need to target?

I wonder if we can get a list of all of the repositoryURLs from code.gov (cc/ @RicardoAReyes) to try to find the hosting platforms to target..? Guess that is more justification for #29 ;)

@leebrian
Copy link
Collaborator Author

I think we have about 100-200 projects or so but haven't counted yet since no one is really asking internally and since they aren't scraped properly to determine if they are excludable, it's a viscous cycle since people can't find them.

I'm not sure what hosting tools to target. I was reading through the svn book's api chapter and it seems like a crawl using the svn client to checkout every directory and then got through it to find history and comments and maybe enough metadata. I haven't looked at it since them because it seemed like a decent amount of boring work digging into svn history files and such.

I tried checking all the repos, but https://api.code.gov/repos?size=10000 only returned 1000 of the reported 6565 repos. None of those thousand had subversion and they were all vcs=git.

@gmkarl
Copy link

gmkarl commented Oct 29, 2019

Hi, found the help-wanted tag for this issue on code.gov .

You can see more than 1000 repos at once by passing '&from=[start]' to the code api.
I used the api node.js module to check vcs= for all of them. 2496 don't have a vcs field, 200 have an empty string, 1 has 'zip', and the 3863 others are all some form of 'git'. That's all of them.

Here's a list of all the repository urls by repository ID: repository_ids_and_urls.txt

@gmkarl
Copy link

gmkarl commented Oct 29, 2019

I tried doing a simple 'svn co ${url}' on each repositoryURL (so no authentication performed). It only worked on two projects: https://code.gov/projects/doe_office_scientific_technical_information_osti_1_kepler and https://code.gov/projects/doe_office_scientific_technical_information_osti_1_zeptoos . What did I miss?

@leebrian
Copy link
Collaborator Author

What kind of metadata can you extract from those checkouts? Are you able to populate the code.json elements?

@gmkarl
Copy link

gmkarl commented Oct 29, 2019

They're just source code repositories, containing branch and tag names, sourcefiles, and detailed change history, and that's it. Human intervention would be needed for many fields, but you could auto-populate things like vcs=svn and maybe offer guesses for things like releases, license, e-mail, or description, based on repo content. I'm guessing that even a barebones code.json file is helpful, here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code.gov https://github.com/GSA/code-gov/blob/master/HelpWanted.md help wanted
Projects
None yet
Development

No branches or pull requests

3 participants