Code as a research object: Metadata for software discovery

This is cross posted from http://www.arfon.org/json-ld-for-software-discovery-reuse-and-credit

This is a continuation of some work I’ve been doing with the Mozilla Science Lab and their ‘code as a research object’ program. There’s multiple aspects to this project including work on code and GUI prototypes, discussions around best practices for making code reusable andsoftware citation. This post explores some ideas around linked data and machine readable descriptions of software repositories with the goal being to make software more discoverable and therefore increase reuse.

JSON-LD

JSON-LD is a way of describing data with additional context (or semantics if you like) so that for a JSON record like this:

{ "name" : "Arfon" }

when there’s an entity called ‘name’ you know that it means the name of a person and not a place.

If you haven’t heard of JSON-LD then there are some great resources here and an excellent short screencast on YouTube here.

One of the reasons JSON-LD is particularly exciting is that it’s a lightweight way of organising JSON-formatted data and giving semantic meaning without having to care about things like RDF data models, XML and the (note the capitals) Semantic Web. Being much more succinct than XML and JavaScript native, JSON has over the past few years become the way to expose data through a web-based API.

JSON-LD offers a way for API provides (and consumers) to share data more easily with little or no ambiguity about what the data they’re describing.

So what about software?

Over the past few months there’s been a lot of talk about finding ways for researchers to derive (more) credit for code. There are lots of issues at play here but one major factor is that a prerequisite to receiving credit for some piece of code you’ve written is that a peer needs to both be able to find your work and then reuse it.

The problem is, it can be pretty hard to find software unless there’s a standard place to share tools in that language and the author of the code has chosen to publish there. Ruby has RubyGems.org, Python has PyPI, Perl has CPAN but where do I go if I’m looking to find an obscure library written in C++?

Discovering domain, language and function specific software is an even harder problem to crack. Sure, if I write Ruby I can head over to RubyGems to look for a Gem that might solve my problem but I’m relying on both the author to write a descriptive README and my ability to search for terms that include similar language to the author of the package.

For many subjects where common languages don’t benefit from canonical package indexes and the function of the software is relatively niche, then just finding code that might be useful is a problem

Towards a (machine readable) description of software

One way to address this discoverability problem is to find a standard way of describing software with context for the terms used. A design goal here should be that these files can be almost entirely automatically generated.

Inspired by the package.json format prescribed by the npm community and using an ontology described on http://schema.org below is a relatively short JSON-LD document that describes the Fidgit codebase. Let’s call it ‘code.jsonld’ for now.

Minimal citable form

{

"@context": "http://schema.org",
"@type": "Code",
"name": "Fidgit",
"codeRepository": "https://github.com/arfon/fidgit",
"citation": "http://dx.doi.org/10.6084/m9.figshare.828487",
"description": "An ungodly union of GitHub and Figshare http://fidgit.arfon.org",
"dateCreated": "2013-10-19",
"license": "MIT",
"author": {
"@type": "Person",
"name": "Arfon Smith",
"@id": "http://orcid.org/0000-0002-3957-2474",
"email": "arfon@github.com"
}
}

Note the first two line (‘@context’ and ‘@type’) defines the context for the key/value pairs in the JSON structure so that ‘name’ meansthe name of the codebase. You can see the full ontology for ‘Code’ here but this should mostly be straightforward to understand1.

Once we get to the authors attribute we’re now entering a new context, that of an individual. As we’re still using the schema.org ontology for type ‘Person’ we only need to set the ‘@type’ attribute here.

There are a bunch more attributes that we could set here but this feels like a minimal set of information that is sufficient for citation (and therefore credit and attribution for the author).

For data archivers

This next example is a slightly modified version of the minimal. This includes multiple authors2 but now also has keywords required by folks like figshare and Zenodo who require these terms. (Note these keywords should probably be more explicitly structured rather than relying on comma-delimited strings.)


{
"@context": "http://schema.org",
"@type": "Code",
"name": "Fidgit",
"codeRepository": "https://github.com/arfon/fidgit",
"citation": "http://dx.doi.org/10.6084/m9.figshare.828487",
"description": "An ungodly union of GitHub and Figshare http://fidgit.arfon.org",
"dateCreated": "2013-10-19",
"license": "MIT",
"author": [
{
"@type": "Person",
"name": "Arfon Smith",
"@id": "http://orcid.org/0000-0002-3957-2474",
"email": "arfon@github.com"
},
{
"@type": "Person",
"name": "Kaitlin Thaney",
"@id": "http://orcid.org/0000-0002-7217-4494",
"email": "kaitlin@mozillafoundation.org"
}
],
"keywords": "publishing, DOI, credit for code"
}

For discovery

I started by describing the problem of software discovery and how domain, function and language specific searches for tools is hard. So far these JSON-LD snippets don’t really help with this problem as we still only have keywords and a description for describing the software function and domain.

The schema.org ‘Code’ ontology includes a ‘programmingLanguage’ attribute which solves for doing language-specific searches. At GitHub we’re pretty good at detecting this automatically with Linguist and so it’s not even clear that an author of a piece of software would need to manually specify this (a win).

The challenge when designing a more ‘complete’ ‘code.jsonld’ document is that it’s seemingly rather tough to automate a description of what subject domain the software has been designed for and what the software does.

PLOS ONE has a pretty decent subject taxonomy that I’ve extracted into a machine readable form here and so it’s possible something along these lines could be used to assign a subject domain. Thus far, I’ve been unable to find a good schema for describing academic subjects (or any subject domains). Going deeper and attempting to describe also the function of software is also proving challenging.

Feedback please!

At this point I’d love some feedback on these ideas. The goal here is to promote software discovery and reuse, so framing this in what’s possible today is a good place to start reflecting on these ideas. Right now it’s possible to do a pretty advanced search for code on GitHub with facets for programming language, file extension, creation date, username and more. Imagine if you could do the same but add in subject area and software function?

One major pitfall with this idea is that in order for an index of ‘code.json’ files to be useful people have to start making them – a classic chicken and egg problem. All is not lost though, pretty much all of the minimal ‘code.json’ file can be auto-generated and perhaps submitted to authors as a pull request patch by a friendly robot?

One of the biggest barriers to reusing research software is finding the damn stuff in the first place – does this help?

Links

1. Note the Code ontology on schema.org doesn’t include a license attribute which seems like an oversight.
2. It’s not clear that this is allowed!

4 responses

  1. Brent Shambaugh wrote on :

    Arfon, I thought about using JSON-LD with webpayments (as in Manu Sporny’s initiative) to incentivize the creation of linked data. Ideally I’d want it to be donation based.

  2. Brent Shambaugh wrote on :

    I have a friend who is a linguist that has a theory “that meaning is an expansion, not a compression, of the real-world”, http://ontolog.cim3.net/forum/ontolog-forum/2010-02/msg00309.html Perhaps there is a way to get back to semantic labelling thinking this way?

    Here is an excerpt:
    “Consistency is what computers demand, but forcing people to do what
    computers want is not ideal. Even if you “assist” them with lots of
    training and big penalties. (037)

    Much better to find a way to assist the computer to interpret data the
    same way as humans. If you try to assist the human to reach the right
    decision by attaching lots of context, why not find a way to assist a
    computer to use that context instead. “

  3. JunZhao wrote on :

    Hi Arfon, It is great to see a concrete proposal like this. We have been engaged in a related activity in facilitating the sharing and reuse of scientific workflows through machine readable descriptions. By working very closely with domain scientists, we have also concluded that a minimal approach is essential, in encouraging the engagement and creation of such descriptions by the scientists, see some of our examples (http://bit.ly/1qy00WN and https://github.com/CHIP-SET/clinicalcodes.article5/). So, I think you are absolutely on track on this regard.

    But it was interesting to see that so far in your examples software code has been mostly treated as a standard alone object, apart from being linked with its authors. What about other types of objects that are related to the code, like example data, sample parameter setting etc? This is the kind of ‘research objects’ that we commonly understand, as a bundle of artefacts, to provide the essential context for code/method reuse and reproduce. Is this a result from the minimal principle that you are applying or from the discussions with the communities? Would be great to hear your thoughts!

    Ps. Would be nice to see a following-up post, analysing how each piece can be automated by the current tooling or infrastructure, taking Github as an example :)

    Jun, on behalf of researchobject.org

  4. Thomas Smith wrote on :

    Hi! I would like to point out that there are a lot of existing data sources that include this information. You’ve probably already thought of the fact that a subset of your information is included in e.g. the package.json or the PyPI record. But the Linux distributions also package a lot of software, and it includes e.g. C++ libraries. Here’s BLAST2, the bioinformatics tool, in Debian: https://packages.debian.org/wheezy/blast2 . It includes tags for bioinformatics, the implementation language, etc. The Debian package lists include all of this information in a machine-readable format.

    So, if you want to mess with integrating information from various sources, … the sources are there.