Q&A and Conversations About Code as a Research Object

This is the third of three blog posts touching on highlights of our community call on December 12. Previous posts presented updates from the community and questions and answers arising from Ed Lazowska’s discussion of the new Moore/Sloan data science centers project. Today’s post is about the response to Mozilla Science Lab’s Code as a Research Object project.

The Code as a Research Object project was discussed in the meeting by Arfon Smith of GitHub and Mark Hahnel of figshare. You can read the notes from their discussion starting on line 224 of the meeting Etherpad.

Below we’ve captured some questions and answers from the discussion that followed, as well as some comments and references to relevant projects from community members. Lots of topics to further explore as the project progresses.

(As with previous posts, attribution can be tricky when working with Etherpad notes, especially with a meeting as big and lively as this. We’ve attributed as we could below — if you would like your name added or removed, feel free to get in touch.)

Will this technology interoperate with other VCS/DOI providers?

“This is the plan. Working with GitHub and figshare is just to model (any others with open APIs should work, but we had to start somewhere.” —Kaitlin Thaney

Will posting code from GitHub to figshare imply CC-BY licensing of the code (AFAIK content on figshare is currently CC-BY)?

Related question: authorship. Can I fork someone’s repository and archive it on figshare under my name? I suppose a “let’s trust users” approach will work for a while, but not forever.

“Great question, and one we’ve been discussing what’s best for this. Perhaps if a license.txt file is not included already, default to MIT or BSD? Open to suggestions.” —Kaitlin Thaney

License file first and then default to BSD/MIT sounds like a good solution to me. The other option is to just have folks choose an explict license from a list of standard code licenses.

I am not a fan of ‘defaults’ for legal issues. Force people to choose, then you are sure they actually thought about the question.

“Note ‘defaults’ can just go to lots of licenses that you need to read and can conflict with each others.” —Raniere Silva

And don’t forget ‘other’, if there’s a checklist for licenses. In particular, I expect many non-trivial packages to have multiple licenses for different parts of the code.

What schema(s) will be used for software metadata? (DOAP, http://schema.org/SoftwareApplication etc.) How will it be entered?

“Good question. DOAP looks like it could work well. I’m not sure how we get people to start writing these descriptors though.” —Arfon Smith

A typical piece of research software has dependencies whose exact versions should also be tracked. Assuming everything is on Github, will it be possible to send a combined snapshot of all relevant repositories for archiving on figshare?

“At this point I think we’re thinking only about a single repo being archived. There’s no reason this couldn’t be extended further but for a first release I think single repo – single figshare article is useful.” —Arfon Smith

“Don’t submodules help in this regard if the dependencies are all in git repositories along with tarballs using SHAs/hashes? We have been doing work in this area to build more complex projects reproducibly (largely C/C++/Fortan compiled projects).” —Marcus Hanwell

“At least for R packages, RStudio has been working on a package, Packrat, that helps to create ‘snapshot’ of code and full dependencies to enable reproduction of a particular ‘state’ of code. However, this is simplified by the use of a single language package state.” —Robert M. Flight

“Packrat is great but dependencies are not a huge problem of Rstats. CRAN is a robust centralized archive and it’s easy to get any versions of packages installed on a new box. Packrat is meant to ease immediate collaboration.” —Karthik Ram

“I would argue that getting ‘archived’ versions of packages working to ensure reproducibility can be hard, even in R, having done it once.” —Robert M. Flight

What is the advantage of assigning DOI for software rather than the github URL?

“DOIs are permanent. You can’t change what a DOI points to, whereas URLs can point to rapidly changing content.” —Konrad Hinsen

The DOI is becoming the standard for the data citation infrastructure.

From what I understand of the DOI, if it points to an URL, you need to update it each time the URL changes (but the DOI is still useful, it is like the bibtex entry of a paper, right?)

“The idea of a DOI is that the document being pointed at remains the same forever. So you change the associated URL only if the document needs to move. It’s supposed to be a rare event. DOI registrars are supposed to guarantee that the ultimate document is always the same.” —Konrad Hinsen

Suppose a code is cited with DOI, then the code gets updated (and/or debugged). What’s best way to handle this situation, which I expect will be quite a frequent occurence? —Jeremy Magland

“Put a reference to the working repository (Github etc.) in the object that the DOI points to. Ideally this would take the form of machine-readable metadata, meaning we need standards.” —Konrad Hinsen

“I think we should consider distinguishing exact version (for reproducibility) and DOI for a project (for credit/attribution, avoid dilution (citing commits would cause dilution)).” —Rémi Emonet

“For now, DOIs are for Digital Objects, as the name implies. I don’t know any equivalent for a project — maybe ORCID could be extended for this?” —Konrad Hinsen

How will code versioning work with the DOI? Will future commits to the repository automatically be granted a new DOI? Will the DOI reflect the SHA?

See comments from Geoff Bilder (from CrossRef) on rapid versioning and “just in time” identifiers. People may use any particular commit in their publication, and therefore Geoff proposes creating the identifier when the first citation is detected. As one of the creators of DOIs he’s thought a lot about this issue and is worth chatting to.

Maybe you will not want a new identifier for every single commit? Maybe only tagged commits (show up as “releases” on Github)?

Releases makes sense as user can link to specific tag and give some more text explaining, etc.

Figshare is not nearly as well-suited for actually browsing a code repository as is the Github interface. How will this be handled?

Just add a link to Github to the landing page of the figshare document and everyone will be happy. Use figshare for archiving, Github for developing and sharing.

Granting DOI to code is supposed to make it permanent; will having DOI on one repo on Github prevent the user from deleting it? I suppose there will be a DOI–SHA bridge; what if git history has sensitive information or legal issues and the branch needs to be filterd? Will the DOI still resolve if the SHA does not exist?

“Yes, because the DOI will be resolving to figshare and not a GitHub address.” —Arfon Smith

I suppose that figshare will store a snapshot of the repository, not just a reference to a commit.

“Correct.” —Mark Hahnel

Could we link up changes in software on Github to figshare via Github hooks? —Scott Chamberlain

“That’s our planned integration point. … which means that the same methods could be used to push to other data repositories that can issue DOIs.” —Arfon Smith

Will Github improve the licensing of repositories to make selection clearer/more prominent? —Marcus Hanwell

“We’ve been working on this already and encourage people to use http://choosealicense.com/” —Arfon Smith

“That helps, but it would be great to see more here, when creating the repo/additional metadata for repos. Perhaps it will be more fully addressed in general improved metadata, looking at a project I would love to see a UI element showing license, just like Github tells me it is C++/Shell/CSS, etc in the high level summary of projects.” —Marcus Hanwell

Comments and References

GigaScience has been doing something similar with archiving and assigning DOIs to software/workflows/etc. GigaScience would definitely be interested in working on some best practice guidelines, etc. with Mozilla, figshare, and GitHub in this project. Citation seems key here as well. If we want to cite code, we should be citing the DOI snapshot version in figshare but we also should be pointing to the live version in GitHub or wherever.

F1000Research is also very keen to collaborate on getting some best practice guidelines together. F1000 have a nice model of ‘versioned’ DOIs, and I would recommend considering applying a DOI to a tag in a github repo, and then versioning the DOI on each incremental tag or software release.

“There are similar efforts across lots of disciplines that are in the planning stage, in ecology you can see ISEES but I believe there are many others.” —Ted Hart

Let X be a DOI on data, and Y a DOI on code. Would be great to be able to ultimately execute Y on X to produce an output Z. (Probably thinking too far ahead).

That can tie into PROV standard and various workflow tools.

“You can do that with ActivePapers. Even put X and Y and Z in the same digital object.” —Konrad Hinsen

Also WISDM (Web-Interactive Scientific Data Manager) is a prototype version of a tool to do that.

Dat by Max Ogden is a very young idea/project, but I wonder if it could be useful here (it’s probably too far off in terms of stability). It offers “real-time replication and versioning for large tabular data sets”.

Wondering about workflows. Would be nice if workflow of various code and data (research objects) could be integrated into/on top of this. (Also thinking about PROV.)

Workflows are just another level of code. If your Workflow tools stores them in a readable format, just put them on Github like everything else.

One thing that would be helpful would be citation-like maps of how code snippets diffuse between projects.

AFAIK https://www.synapse.org/ can do some thing like that.

Some thoughts on credit for software: Daniel S. Katz. Citation and Attribution of Digital Products: Social and Technological Concerns.

This ties well to this NSF-funded project: “Data and Software Preservation for Open Science

See Rule 6 of 10 Simple Rules for the Care and Feeding of Scientific Data: “publish your code

Regarding metadata — it’s worth thinking of what best practice looks like to make code maximally usable. Like a MIAME standard / MIBBI for code?

“For both data and code, additional metadata needs to be captured at inception that gives hints about length of preservation. We will NOT preserve all data and code. When it comes time to pare down, it would be good if we could do it intelligently.” —John Cobb

Another thing that might be useful for a DOI-centric code repository — peer review of code, stars or kudos for reproducibility verification.

Thanks to all who joined in on the conversation. Stay tuned for more in the coming year and mark your calendars – our next call is January 9.