Part of the Science Lab’s mission is to work with other community members to build technical prototypes that move science on the web forward. In particular, we want to show that many problems can be solved by making existing tools and technology work together, rather than by starting from scratch.The reason behind that is two-fold: (1) most of the stuff needed to change behaviors already exists in some form and (2) the smartest minds are usually outside of your organization.
Our newest project extends our existing work around “code as a research object”, exploring how we can better integrate code and scientific software into the scholarly workflow. The project will test building a bridge that will allow users to push code from their GitHub repository to figshare, providing a Digital Object Identifier for the code (a gold standard of sorts in science, allowing persistent reference linking). We will also be working on a “best practice” standard (think a MIAME standard for code), so that each research object has sufficient documentation to make it possible to meaningfully use.
Scientific research is becoming increasingly reliant on software. But despite there being an ever-increasing amount of the academic process described in code, research communities do not yet treat these products as a fundamental component or “first-class research object” (see our background post here for more). Up until recent years, the sole “research object” in discussion was the published paper, the main means of packaging together the data, methods and research to communicate findings. The web is changing that, making it easier to unpack the components such as data and code for the community to remix, reuse, and build upon.
A number of scientists are pushing the envelope, testing out new ways of bundling their code, data and methods together. But outside of copy and pasting lines of code into a paper or, if we’re lucky, having it included in a supplementary information file alongside a paper, the code is still often separated from the documentation needed for others to meaningfully use it to validate and reproduce experiments. And that’s if it’s shared openly at all.
Code can go a long way in helping academia move toward the holy grail that is reproducibility. Unfortunately, academics whose main research output is the code they produce, often cannot get the recognition they deserve for creating it. There is also a problem with versioning: citing a paper written about software (as is common practice), gives no indication of which version, or release in GitHub terms, was used to generate the results.
What we’re testing
figshare and GitHub are two of the leading repositories for data and code (figshare for data; GitHub for code). Open data repositories like figshare have led the way in recent years in changing our practices in relation to data, championing the idea of data as a first-class research object. figshare and others such as Harvard’s Dataverse and Dryad have helped change how we think of data as part of the research process, providing citable endpoints for the data itself that the community trusts (DOIs), as well as clear licensing and making it easy to download, remix, and reuse information. One of the main objectives here is that the exact code used in particular investigations, can be accessed by anyone and persists in the form it was in when cited.
This project will test whether having a means of linking code repositories to those commonly used for data will allow for software and code to be better incorporated into existing credit systems (by having persistent identifiers for code snapshots) and how seamless we can make these workflows for academics using tools they are already familiar with. We’ve seen this tested with data over recent years, with sharing of detailed research data associated with increased citation rates (Piwowar, 2007). This culture shift of publishing more of the product of research is an ongoing process and we’re keen to see software and code elevated to the same status as the academic manuscript.
We believe that by having the code closely associated with the data it executes on (or generates) will help reduce the barriers when trying to reproduce and build upon the work of others. This is already being tested in the reverse, with computational scientists nesting their data with the code in GitHub (Carl and Ethan, for example). We want to find out if formally linking the two to help ease that pain will change behavior.
We are also looking to foster a culture of reuse with academic code. While we know there are lots of variables in this space, we are actively soliciting feedback from the community to help determine best practices for licensing and workflows.
How to get involved
Mark and Arfon will be joining us for our next Mozilla Science Lab community call on December 12, 2013. Join us to hear more about the project. Have a question you’d like to ask? Add it to the etherpad!
We’re also looking for computational researchers and publishers to help us test out the implementation. Shoot us an email if you’d like to participate.