Code review for science: What we learned

The results from our code review pilot are in!

This past August, we launched a pilot with PLOS Computational Biology and some of our colleagues at Mozilla to explore the idea of code review for science. With the help of PLOS, we selected a series of published papers that included code, extracted the snippets (between 200-500 lines), and put in front of Mozilla engineers. The code samples selected were not whole software packages, but rather indicative of the type of analysis code one would find in a paper, in Python, R, Perl, etc.

This experiment was a means to explore the following:

  • What does code review from a software engineer outside of academia look like? How do they approach the task?
  • To what extent is domain knowledge needed to do a successful code review? Is the code parseable by someone outside of that discipline?
  • What lessons can be learned about code review, possibly to influence and enhance traditional peer review?
  • Does this process surface issues around best practice in writing software and code? If so, what are those issues?
  • Following the review, how useful are the comments to the author? Does this feedback help them in their work? How can we change those norms?

Following the completion of the reviews, Professor Marian Petre (Open University) interviewed the Mozilla volunteers about their experience, and we reached out to the paper authors to share the comments on their code. They were then also interviewed by Marian, to gauge their thoughts on the reviews and the process in general.

What did we learn?

The full write-up from the pilot can be read here on arXiv.

A few high-level points to tease out:

For many scientists, this was their first experience with code review.

Some of the authors were familiar with code review as a process, but many claimed that this was a new experience for them, as well as for some their first chance to have a discussion about their code.

“The code was not written for others to use.”

While the scientists aimed to produce readable, re-usable code, the reviewers felt their software was less reusable by others than what they were used to. Lack of commenting in the code and documentation added to this, which the reviewers identified as a blocking point for other researchers to build upon that scientist’s work.

Context and dialogue are key parts of the review process.

… And I don’t mean just the context of what’s written in the research paper itself. The reviewers themselves felt their comments were shallower than they’d have liked, and recommended that there’s an ongoing dialogue with the author in the future as the code is being written to help iterate, debug, better understand the context in which the code exists for a piece of work.

With that said, the authors still found the comments useful, particularly feedback on usability, ease of re-use, organization of README files, code structure, performance questions and optimization (“why is this so slow?”).

Both the scientists and the reviewers were frustrated by the “drive by” nature of this experiment: both wanted a longer conversation with a chance to go back and forth.  This, and the fact that both sides are enthusiastic about taking part in a follow-on, are probably the most important of our findings.

For more, have a look at the full report.

The conversation didn’t stop there …

“In the business of science, all that matters is the figures. The quality of the code is just not on the critical path.” – scientist interviewed in the pilot.

Something interesting happened as we were conducting this pilot. As the project was running, we had a feature in Nature about our work exploring code review, that sparked quite a lively discussion online. (Which, in many ways, was the point of such a pilot … ).

The discussion stemmed from a dissenting comment at the bottom of the Nature piece from a researcher at Johns Hopkins, one known for his work in reproducibility, saying that an experiment like this could actively “discourage” the sharing of code.

From the article:

“One worry I have is that, with reviews like this, scientists will be even more discouraged from publishing their code,” says biostatistician Roger Peng at the Johns Hopkins Bloomberg School of Public Health in Baltimore, Maryland. “We need to get more code out there, not improve how it looks.”

I and a number of others (Titus Brown, Nick Barnes, Carl Boetigger) took to Twitter to try to unpack the reasoning behind the comment – was it a fundamental misunderstanding of what we were testing? Was it a question of methods or of review on the whole? The Nature reporter even chimed in with a snapshot of the researcher’s comments in full from her notes.

The conversation then moved to our blogs, further unpicking the comments online, and continuing to explore the implications of code review for science when it comes to reproducibility, openness, collaboration. Carl and Titus both wrote up the discussion, adding their .02 to the matter (making some very valid points, mind), and a researcher in Peng’s lab at Johns Hopkins (note: not the person behind the quote, but a close colleague), chimed in with a post further explaining the comment in that Nature piece. You can read that post by Jeff Leek here.

Jeff’s post leads with a look into their lab’s processes when it comes to detailing code in their work, speaking to how they make all of their code available openly for review and reuse. Then midway down, he starts to explain why he thinks this would *discourage* sharing, and it’s linked to an experience he had where a peer tried to discredit their work once an error was raised (and fixed) in the work. That sort of discrediting is unfortunately common play in research, keeping many from working in an open, constructive and iterative fashion, where feedback is welcomed and not career jeopardizing.

What we realized from this dialogue is that we don’t wholly disagree with Roger’s comment: if code review is done only at the end of work, it becomes another hurdle for scientists to get over in order to publish their research.  But that’s not how it works in open source — in fact, few people in open source would willingly work that way.

Review is supposed to be continuous and participatory; people should have a chance to respond to feedback in order to improve both what they’re working on now, and how they work in future. And to Jeff’s experience, working in that fashion should be the norm, not something used against you to discredit your work by peers.

The next stage of our work is to explore how well this works in science, and what it takes to get scientists to adopt more constructive, iterative, collaborative practices.

Many thanks again to our Mozilla colleagues who participated in this study, the PLOS staff, Professor Marian Petre, Greg Wilson, the authors, and everyone who joined in online to discuss this issue.

13 responses

  1. Titus Brown wrote on :

    Fascinating stuff!

  2. Mark R Côté wrote on :

    Looking forward to participating again in the next phase!

  3. Selena Deckelmann wrote on :

    I enjoyed participating and look forward to future work in this area!

    This commentary struck me:

    “Then midway down, he starts
    to explain why he thinks this would *discourage* sharing, and it’s
    linked to an experience he had where a peer tried to discredit their
    work once an error was raised (and fixed) in the work. That sort of
    discrediting is unfortunately common play in research, keeping many from
    working in an open, constructive and iterative fashion, where feedback
    is welcomed and not career jeopardizing.

    What we realized from this dialogue is that we don’t wholly disagree
    with Roger’s comment: if code review is done only at the end of work, it
    becomes another hurdle for scientists to get over in order to publish
    their research. But that’s not how it works in open source — in fact,
    few people in open source would willingly work that way.”

    Regarding discrediting work — it’s also a problem in open source development, although it’s usually called something different.

    I’ve seen it addressed in a variety of ways:

    * Make reviewing a requirement of participation — the people sharing code *also* review the code of the reviewers.
    * Create review questions that everyone answers (structure helps people avoid blame)
    * Require tests (unit, or other kinds depending on the purpose of the tool) and use automation to determine test coverage
    * Require documentation updates as part of patch submission

    1. Mozilla Science Lab wrote on :

      Thanks, Selena! Great points.

    2. otakucode wrote on :

      What is absolutely maddening is that no one seems concerned that if research has flaws and can be falsified, IT MUST BE! Careers, feelings, etc be damned! People make policy decisions based on research. Why would we ever let a flawed piece of research stand for one second? Do we WANT to look back and see ourselves as being as fooldhardy as the people who opposed the germ theory of disease (and killed thousands of women and babies as a result) because it made doctors feel bad?

      1. Selena Deckelmann wrote on :

        No idea what you’re talking about! I see research falsified all the time. Google for “reproducible research” for some great information about work in this area.

  4. olgabot wrote on :

    Great stuff! I’d love to be involved if you do more of this stuff: @olgabot on twitter.

  5. Aron Ahmadia wrote on :

    This was very interesting, and thanks for posting the results.

    The only information I could find is the summary article on arxiv, and I have many burning questions! Are the comments from the reviewers available? Was all the software open source and released prior to the reviews? What software languages were evaluated? Did the reviews detect defects in the software?

    1. Mozilla Science Lab wrote on :

      @aron – we’ll have Marian on our community call today. Feel free to add
      those questions to the etherpad!
      https://etherpad.mozilla.org/sciencelab-calls-nov14

  6. Terry A Davis wrote on :

    I am literally God’s gift to programming. I have divine intellect.

    25 At that time Jesus said, “I praise you, Father, Lord of heaven and earth, because you have hidden these things from the wise and learned, and revealed them to little children. 26 Yes, Father, for this is what you were pleased to do.

    27 “All things have been committed to me by my Father. No one knows the Son except the Father, and no one knows the Father except the Son and those to whom the Son chooses to reveal him.

  7. otakucode wrote on :

    Code is no different from any other kind of experiment detail. If a scientist does not include it in full, they should be assumed to be a liar and a charlatan until they can prove otherwise. That journals publish papers without complete source code is despicable. Would they publish a paper whose experimental procedures consisted of “Do some stuff I won’t talk about…” ? Why is it acceptable for a paper to say the same about how they used computers in their so-called “research”?

  8. Wes Turner wrote on :

    Fascinating study!

    I, myself, have long-wondered about what sorts of practices practitioners of Open Science can learn from Open Source Collaboration.

    If Test Driven Development and Code Review are essential to producing quality software, why wouldn’t they be essential to producing reproducible science?

    What are the computational biology and standard scientific method vocabulary terms for referring to what software developers call ‘traceability’?

    In reviewing the “Reproducible Academic Publications” listed in “A gallery of interesting IPython notebooks”, I can’t help but wonder whether there is a template (e.g. for cookiecutter) which could help automate the routine parts of a reproducible scientific analysis workflow.

    Abstract, Methods, Data, Findings, Confirms/Disproves (DOI URI/URN/URL)?

    Obviously publishing entails lookup of structured ontological classifiers; but how do I say, with linked data, that “this is a PDF about […] which confirms or disproves […], with this data, collected with this level of blinding”?

    In regards to 5 Star Linked Open Data (5stardata.info), a provenance component (such as W3C PROV) may qualify for a sixth star.

    I happen to have tweeted a few relevant thoughts in Nov 2013.

    Whether IPython notebooks are as amenable to automated testing and code review as ‘regular’ code, I suppose is up for debate.

    I look forward to the development of tools which support and encourage collaborative open science practices.

    Is there a template, with tests, for science, that would make Open Science more actively traceable and reproducible?

  9. Dan Ofer wrote on :

    Hi – is there any chance that the program or some version thereof is still running?