Code as a research object: updates, prototypes, next steps

It’s been just over three months since we first announced our collaboration with GitHub (a code hosting service) and figshare (an open data repository) exploring the idea of “code as a research object”. There’s been much progress and discussion about the effort of late, and we wanted to share with you our progress to date, as well as discuss next steps and yet-unanswered questions.

But first, a quick look back at the thought behind the project…

Part of the Science Lab’s mission is to work with other community members to build technical prototypes that move science on the web forward. In particular, we want to show that many problems can be solved by making existing tools and technology work together, rather than by starting from scratch. The reason behind that is two-fold: (1) most of the stuff needed to change behaviors already exists in some form and (2) the smartest minds are usually outside of your organization.

This project has a few layers to it, ranging from the strictly technical to the social. Below we discuss a few of the main tenets.

A means of pushing code from GitHub to figshare

This project exists to show how, using open technologies, you can get two services to to talk to one another through server-side (particularly OAuth) as well as browser-based technologies. This allows the user to move information from one service to the other seamlessly on the web, as well as collect meaningful information needed for reuse.

You can access the current prototype here:

http://mozillascience.github.io/code-research-object/

It provides two different entry points for those looking to push their code from GitHub to figshare — one being through the intermediary site linked above, the other being from your repository through a bookmarklet or browser extension.

The goal of the prototype is to test the user flow starting in the user’s code repository (this is intentional), see how the user flow maps to the needs of the researcher, and keep the process as minimally onerous as possible.

If you try the prototype, you’ll notice once you click the “Get a DOI” bookmarklet, you’re asked to fill in a few info fields about the code. That’s an initial reflection of the conversation going on here — the fields to be collected are subject to change as we nail down the minimal set of fields needed.

Produce code and documentation so that other systems can do this too.

We’ve seen in recent weeks that both figshare and Zenodo have built in this functionality for their users, using the code generated from this collaboration. We think that’s great and that was one of the aims of this project: to build out a prototype to start the conversation, and provide the building blocks for others to repurpose for their own systems. The more choice for users, the better; and the more code with a DOI, able to be discovered, reused, shared and viewed on equal footing with published papers or datasets, the better, in our opinion.

The aim of this was not to start and finish solely with GitHub and figshare, but instead show how two repositories (or a code-hosting service and a data repository, to be precise) could move information between one another using open APIs. (An API is the interface implemented by an application which allows other applications to communicate with it. APIs are really useful for allowing research and information to move freely on the web and be easily reused.)

For this project, the DOI is assigned to a snapshot of the code archived by figshare, not the code repository itself. This is useful for a few reasons. If the repository disappears, the DOI will still resolve to the archived version, and if a user decides to switch code hosting services (for example), the DOI can remain the same and will still resolve to the archived version of the code.

There are a number of services out there that also archive content and data as well as mint DOIs (Dryad and DataVerse, for starters). We wanted to make sure the prototype was designed in a way that shed light on the process for others to reuse as much as possible, whether to apply to their repository, or just see what we did under the hood.

Want to create your own version? Arfon Smith from GitHub has written up a fantastic how-to outlining some of the steps. Still have questions? Get in touch.

Discussion about best practice and “metadata for reuse”

When we first started discussions around our latest “Code as a research object” project, one of the main topics that arose was reuse. It’s one thing for code and software to have an identifier that the community trusts so that it can be integrated into scholarly publishing systems, but what about the researchers looking to use that identifier to build or reuse the code in their own work? What information is needed for the code to be discovered, picked up, forked and run by someone else outside of their lab? In short, what would the ideal README look like?

In our post back in February we pointed the community towards a few ideas and encouraged their feedback and comments in this discussion thread. Some fascinating contributions emerged in that thread, and we’re working on sifting them into a best practices document.

Here are a few ideas we tossed out as a starting point:

  • What does the code do? For what field? (Short descriptor)

  • What’s needed to get the code to run? Is it part of a larger codebase? Links to relevant repositories or tools used to run the code.

  • Contributors.

  • Link to the documentation on how the code is used.

  • License

But one thing we’re still grappling with in the comments is how low or high to set the bar for reproducibility for code, especially when catering to an audience that may not identify as “computational” per se, or have much experience with software development beyond running some data analysis.

There also was the point made about our use of the term “metadata”, which can carry a certain “machine readable or bust” connotation in various circles. We were framing our “ideal README” as the “human-readable” metadata — the overarching context needed for a lay researcher to glean enough about the piece of code to put it in context and possibly even reuse it.

We’ve woven in some of these fields into our prototype to show what that could look like as part of the build, and we’ll continue to hone it over the coming weeks. Care to help us? Get in touch. We’d love to hear your thoughts.

Why DOIs again? Isn’t this already being done?

There’s also been a lot of discussion about software citation itself, which pre-dates our small project; we’re by no means the first group to work on this or even to cite code.

The choice to link to a data repository that also mints Digital Object Identifiers (DOIs) was a deliberate one, as regardless of being able to cite URLs and other unique identifiers, DOIs are still seen as the citable gold standard for research objects. In many ways, being able to assign DOIs to code (as well as data) is an indication to the research community that these scholarly contributions are on par with and of equal worth as the published paper, a trusted mechanism understood and used by many to track citations, impact and use.

What’s next?

  • Testing – We had a number of publishers add their names to our call for testers. We’re keen to see how we can close that feedback loop, from researchers archiving a snapshot of their code in a data repository to receiving a DOI they then use in publication. We’re also still looking for feedback from researchers. Let us know what you think.

  • Role of Institutional Repositories? Is there a role for the institutional repository or library in archiving snapshots of the code? How can they help?

We’ll be discussing this further this Thursday, March 20 on our Mozilla Science community call. Come join us and add your questions to the etherpad (line 56).