Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import tags from github, and allow fine-grained search on them #20

Open
sbromberger opened this issue Jun 26, 2017 · 11 comments
Open

import tags from github, and allow fine-grained search on them #20

sbromberger opened this issue Jun 26, 2017 · 11 comments

Comments

@sbromberger
Copy link
Member

Tags are a great way to categorize software repos and they're already built into GitHub repos. It would be great to be able to filter on "tag:foo" (and, incidentally, "language:C", but that's probably another issue).

@IanLee1521
Copy link
Member

Yeah, language is possibly a separate issue.

As far as tags, this was difficult at the time I started the site, but should be possible now that GitHub added "Topics" (aka tags aka labels) for repositories: https://github.com/blog/2309-introducing-topics

@LRWeber
Copy link
Member

LRWeber commented Aug 24, 2017

Repo "language" and "topics" data is now being collected. (And displayed in the Explore tab!)
This information could potentially be incorporated into the search functionality.

@IanLee1521
Copy link
Member

Some of this work has become with @hauten 's lead

See also: https://github.com/LLNL/llnl.github.io/tree/add-topics

@IanLee1521 IanLee1521 added this to To do in Summer 2019 May 13, 2019
@gonsie
Copy link
Member

gonsie commented Jun 12, 2019

@angfl97 Could you do some data analysis to help us understand the categories already in use?

  1. What topics are already in use by our repos, and how many repos fall into each topic?
  2. Another way to broadly categorize the repos would be based on organization (other than LLNL). How many non-llnl organizations do we have in the catalog, and how many repos are in each?
  3. I also like @sbromberger's idea. We have the data about which languages are used, what is the set of unique languages, and how many repos use each one?

@LRWeber
Copy link
Member

LRWeber commented Jun 13, 2019

It may be worth noting that logic for answering some of these questions exists to generate our "word cloud" visualizations at the bottom of the explore page and individual repo pages.

The cloud-generator takes a list of {name: aWord, value: wordCount} objects, which is what these functions output. They may be worth a look.

// Turn json obj into desired word list
function reformatData(obj) {
var wordDict = {};
var repos = (repoNameWOwner == null) ? Object.keys(obj["data"]) : [repoNameWOwner];
repos.forEach(function (repo) {
if (obj["data"].hasOwnProperty(repo)) {
var topicNodes = obj["data"][repo]["repositoryTopics"]["nodes"];
for (var i=0; i<topicNodes.length; i++) {
var aWord = topicNodes[i]["topic"]["name"];
if (!Object.keys(wordDict).contains(aWord)) {
wordDict[aWord]=0;
}
wordDict[aWord]+=1;
}
}
});
var data = [];
for (var aWord in wordDict) {
if (wordDict.hasOwnProperty(aWord)) {
var datpair = {name: aWord, value: wordDict[aWord]};
data.push(datpair);
}
}
// Prioritize highest counts
data.sort((a,b) => (a.value < b.value) ? 1 : ((a.value > b.value) ? -1 : 0));
return data;
};

// Turn json obj into desired word list
function reformatData(obj) {
var wordDict = {};
var repos = (repoNameWOwner == null) ? Object.keys(obj["data"]) : [repoNameWOwner];
repos.forEach(function (repo) {
if (obj["data"].hasOwnProperty(repo)) {
var langNodes = obj["data"][repo]["languages"]["nodes"];
for (var i=0; i<langNodes.length; i++) {
var aWord = langNodes[i]["name"];
if (!Object.keys(wordDict).contains(aWord)) {
wordDict[aWord]=0;
}
if (repoNameWOwner == null) {
wordDict[aWord]+=1; // across multiple repos, count once per repo
} else {
wordDict[aWord]+=langNodes.length-i; // across single repo, count by usage rank
}
}
}
});
var data = [];
for (var aWord in wordDict) {
if (wordDict.hasOwnProperty(aWord)) {
var datpair = {name: aWord, value: wordDict[aWord]};
data.push(datpair);
}
}
// Prioritize highest counts
data.sort((a,b) => (a.value < b.value) ? 1 : ((a.value > b.value) ? -1 : 0));
return data;
};

@angela-flores-wdc
Copy link
Member

I made an Excel workbook with the stats @gonsie asked for.

Here is the link

@gonsie
Copy link
Member

gonsie commented Jun 14, 2019

For those not traversing the link, these topics are mentioned in 4 or more repositories:

  • hpc
  • scientific-computing
  • cpp
  • parallel-computing
  • mpi
  • visualization
  • llnl
  • python
  • high-order
  • finite-elements
  • c-plus-plus
  • data-viz
  • computational-science
  • simulation
  • blt
  • gov

I was hoping that we'd get some topics outside of the typical "hpc" stuff, but I guess not. The language tags are sort of interesting:

Language count
shell 292
python 252
C 210
C++ 202
Makefile 174
CMake 113
HTML 85

But I'm not sure that's immediately useful. There are 13 repos using AWK... maybe digging into the lesser used languages would be cool.

What I do think is actually useful are the repos we are pulling from non-LLNL organizations. The top 5 (most repos) come from:

Some of these projects would be very cool to highlight on their own as they sort of represent a whole ecosystem of interrelated repos. These are also the places where we get the most external interaction.

@hauten
Copy link
Contributor

hauten commented Jun 15, 2019

Would be awesome if more repos had topics. I'd done a couple of inventories over the last year and it's something like <10%. Maybe this can encourage PIs: Our portal (not to mention GitHub) will provide more visibility to repos that have topics.

@hauten
Copy link
Contributor

hauten commented Jun 19, 2019

See https://github.com/LLNL/llnl.github.io/blob/new-home-page/radiuss/README.md for a list of tags on radiuss repos - will aim to use that list & the notes above as starting points for standardizing tags across other LLNL repos

@IanLee1521
Copy link
Member

@hauten -- Maybe list our standard tags on https://github.com/LLNL/llnl.github.io/blob/master/about/using-github.md ?

@IanLee1521
Copy link
Member

Actually, for the docs, we can start the listing here: https://github.com/LLNL/llnl.github.io/tree/master/categories

@hauten hauten moved this from To do to In progress in 2019 Summer Improvements Jul 1, 2019
@angela-flores-wdc angela-flores-wdc moved this from In progress to To do in 2019 Summer Improvements Jul 15, 2019
@hauten hauten removed their assignment Apr 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

6 participants