Where do academics hide their code?

Sorrel Harriet
5 min readJan 28, 2019

First off, this is kind of a trick question. My guess is we are hiding far less of it then we used to, and that’s a guess based on a small-scale analysis of data published by Gateway to Research (GtR) and GitHub. Looking at this data, it seems that academics are increasingly inclined to ‘hide’ their code on the world’s most popular repository hosting platform (still GitHub!) This is good news for various reasons (some of which we’ll come to later), but it also leads me to wonder:

What can the code on GitHub tell us about the emerging habits of software developers in academia?

I’ll start by presenting the results of this small-scale descriptive analysis, and then I’ll talk about why I found these results interesting. Maybe you’ll agree, maybe you won’t — either way, I’m hoping it will spark further discussion. I’ll also explain why “hiding code on GitHub” isn’t necessarily a contradiction in terms.

What the data reveals

  • >5000 Software and Technical Products (STPs) were funded by UK Research and Innovation (UKRI) council grants between 2006 and 2018
  • EPSRC funded around half of them
  • ~1/5 STPs have supporting URLs associated with a source-code hosting platform which supports version control (>80% are GitHub URLs)
  • ~1/10 STPs reference public GitHub repositories…
  • …of which Python is the most popular language
  • Use of GitHub increased between 2010 and 2015
  • The average number of contributors is 3
  • The average number of forks is 7
  • The average active lifetime of a repository is 2 years
Figure 1: Number of repos created by year (530 in total)

It ought to be pointed out that Figure 1 probably has a chunk of data missing! When you consider that projects started <5 years ago may be ongoing (and thus absent from the GtR data), it seems reasonable to expect that the figures from 2016 onward are higher.

The same logic applies to the chart below, which also tells us that Python is the dominant language of research repositories on GitHub — but for how much longer I wonder? Interestingly it’s JavaScript which dominates GitHub overall, and I suspect we may start seeing a bit more of it in research too (no, seriously!) though there are other contenders of course.

Figure 2: Repos created between 2006 and 2018 in top 7 most popular languages

The 2 years active lifetime figure is also problematic given the right-hand skew in the repository age data. For this reason I’d be more inclined to look at Figure 3 for a clue about the `longevity’ of research software projects on GitHub.

Figure 3: Proportion of 530 repositories active within 30 days, 6 months, 1 year

Why this is interesting

If you’ve ever had anything to do with research software engineering you’ll probably be familiar with some of the arguments for using a source-code hosting platform such as GitHub. In case you aren’t, I’ll briefly summarise my top 3:

1. Research integrity.

The fact you are being transparent about your processes ought to encourage robust practices.

2. Impact.

In theory, sharing your code ought to contribute to the impact of your research (assuming other researchers use it!) In practice, measuring the impact of research software is still a grey area, yet the optimist in me says it is only a matter of time before it is properly recognised (check out http://blog.impactstory.org).

3. Collaboration.

Git exists to facilitate collaboration, and GitHub is a git repository hosting service (for those who needed reminding). If you want to collaborate effectively within a team, that fact alone gives you reason to use it.

Now, I’m going to go out on a limb here and suggest that reason Number 3 is ‘underexplored’…

While it’s encouraging to learn that academics who write code are increasingly inclined toward GitHub and other platforms, the stats I’d associate with collaborative development (i.e. contributors, forks, longevity) are less impressive. To me this suggests there’s more we could be doing together, and that this togetherness might bleed into (or out of) the way we work more generally (i.e. the processes surrounding the writing of code). And, if you agree there’s more we could be doing together, you might also appreciate why ‘hiding code on GitHub’ isn’t as crazy as it sounds.

“If a developer pushes code to GitHub and there is no-one around to fork it, will they receive any pull requests?”

Conclusion

What I would take from this small-scale analysis are 3 things:

  1. Academics writing software is something we can expect to see more of
  2. Academics who write software want to do so transparently and with the best of intentions in terms of tools, code quality etc.
  3. Academics who write software might still need support and encouragement to work collaboratively while following effective processes and workflows. This, in turn, might contribute to the future impact of research software.

Point 3 in particular is speculative, which is why I am calling for help from anyone who would position themselves in the small yet mighty Figure 4 intersection*. I would like to hear your stories about developing software for academic research, and you can tell me them here: bit.ly/academic-software-survey

Figure 4: Academics who write software (including Research Software Engineers)

*10,000 is a figure I made up. I don’t know the actual number of UK academics who write software…does anyone? I imagine it is small but getting bigger.

--

--