Code as a research object: a new project

Part of the Science Lab’s mission is to work with other community members to build technical prototypes that move science on the web forward. In particular, we want to show that many problems can be solved by making existing tools and technology work together, rather than by starting from scratch.The reason behind that is two-fold: (1) most of the stuff needed to change behaviors already exists in some form and (2) the smartest minds are usually outside of your organization.

Our newest project extends our existing work around “code as a research object”, exploring how we can better integrate code and scientific software into the scholarly workflow. The project will test building a bridge that will allow users to push code from their GitHub repository to figshare, providing a Digital Object Identifier for the code (a gold standard of sorts in science, allowing persistent reference linking). We will also be working on a “best practice” standard (think a MIAME standard for code), so that each research object has sufficient documentation to make it possible to meaningfully use.

The project will be a collaboration of the Science Lab with Arfon Smith (Github; co-founder Zooniverse) and Mark Hahnel and his team at figshare.

Why code?

Scientific research is becoming increasingly reliant on software. But despite there being an ever-increasing amount of the academic process described in code, research communities do not yet treat these products as a fundamental component or  “first-class research object” (see our background post here for more). Up until recent years, the sole “research object” in discussion was the published paper, the main means of packaging together the data, methods and research to communicate findings. The web is changing that, making it easier to unpack the components such as data and code for the community to remix, reuse, and build upon.

A number of scientists are pushing the envelope, testing out new ways of bundling their code, data and methods together. But outside of copy and pasting lines of code into a paper or, if we’re lucky, having it included in a supplementary information file alongside a paper, the code is still often separated from the documentation needed for others to meaningfully use it to validate and reproduce experiments. And that’s if it’s shared openly at all.

Code can go a long way in helping academia move toward the holy grail that is reproducibility. Unfortunately, academics whose main research output is the code they produce, often cannot get the recognition they deserve for creating it. There is also a problem with versioning:  citing a paper written about software (as is common practice), gives no indication of which version, or release in GitHub terms, was used to generate the results.

What we’re testing

figshare and GitHub are two of the leading repositories for data and code (figshare for data; GitHub for code). Open data repositories like figshare have led the way in recent years in changing our practices in relation to data, championing the idea of data as a first-class research object. figshare and others such as Harvard’s Dataverse and Dryad have helped change how we think of data as part of the research process, providing citable endpoints for the data itself that the community trusts (DOIs), as well as clear licensing and making it easy to download, remix, and reuse information. One of the main objectives here is that the exact code used in particular investigations, can be accessed by anyone and persists in the form it was in when cited.

This project will test whether having a means of linking code repositories to those commonly used for data will allow for software and code to be better incorporated into existing credit systems (by having persistent identifiers for code snapshots) and how seamless we can make these workflows for academics using tools they are already familiar with. We’ve seen this tested with data over recent years, with sharing of detailed research data associated with increased citation rates (Piwowar, 2007). This culture shift of publishing more of the product of research is an ongoing process and we’re keen to see software and code elevated to the same status as the academic manuscript.

We believe that by having the code closely associated with the data it executes on (or generates) will help reduce the barriers when trying to reproduce and build upon the work of others. This is already being tested in the reverse, with computational scientists nesting their data with the code in GitHub (Carl and Ethan, for example). We want to find out if formally linking the two to help ease that pain will change behavior.

We are also looking to foster a culture of reuse with academic code. While we know there are lots of variables in this space, we are actively soliciting feedback from the community to help determine best practices for licensing and workflows.

How to get involved

(UPDATE: Want to help us test? Instead of sending us an email, how about adding yourself to this issue in our GitHub repository? More about that here.)

Mark and Arfon will be joining us for our next Mozilla Science Lab community call on December 12, 2013. Join us to hear more about the project. Have a question you’d like to ask? Add it to the etherpad!

We’re also looking for computational researchers and publishers to help us test out the implementation. Shoot us an email if you’d like to participate.

4 responses

  1. Franklin Chen wrote on :

    I’m happy to see efforts to improve the replicability and sharing of scientific research! Too often I’ve seen research data and programs to manage them being just privately backed up (or not) and then inaccessible after a while.

  2. Scott Edmunds wrote on :

    This is a very timely scheme, as at GigaScience for the reasons you outline above we’ve been experimenting for about a year now on archiving bundles of code and pipelines with DOIs in our GigaDB repository (e.g. http://dx.doi.org/10.5524/100044), and F1000Research have also recently started trialling a similar pipeline with Zenodo (http://blog.f1000research.com/2013/10/11/open-access-software-our-recent-software-repository-collaborations/). We’ve also started issuing DOIs to workflows after persuading DataCite to include them as a resource type in their latest schema (see: http://schema.datacite.org/meta/kernel-3/example/datacite-example-workflow-v3.0.xml), so would computational workflows would fit into this project? Not sure I can make the call as our Asian timezone is a bit problematic, but would be great to compare notes with you (plus F1000Research, researchobject.org and anyone else working in this area) to make sure we have best practices and standards ironed out, and if you are looking for participants and guinea pigs we’d definitely be interested.

    1. Kaitlin Thaney wrote on :

      Thanks, Scott! Will be in touch. :)

  3. JunZhao wrote on :

    Hi, how are you guys related to the Research Object community project, http://researchobject.org, which stemmed out of the original idea that was published back in 2007?