Complexity vs Quality: The Bumpy Relation of Scientific Software
Scientific software is used in physical, environmental, earth and life sciences on a daily basis to make important discoveries. Due to its highly specialized nature, scientific software is frequently developed by scientists with deep domain knowledge, but not necessarily deep knowledge in technologies and tools used by software engineers and developers that build more mainstream applications. As a result, scientific software tends to be highly customized, less flexible, complex, poorly tested, less documented and even less maintained in the long run .
Reproducible Computational Research
Many issues plaguing scientific software have been discussed in the literature, but the ability to reproduce computational discoveries has taken center stage in recent years . The term reproducible computational research has been coined, and used as an umbrella concept for identifying and proposing solutions to issues that affect the reproducibility of computational scientific research.
Some Proposed Solutions
Although the challenge of reproducible computational research is multi-dimensional, some of the proposed solutions are rooted in existing, well established and robust software engineering solutions such as:
- Source code management (SCM)
- Computational Workflow Engines
- Scalable and distributed compute platforms
- Compute and storage hardware virtualization
- Centralized repositories of digital collections of scientific data
In addition, the organized and homogeneous tagging of scientific data with metadata (data about data) has been a well-established foundation for information retrieval and discovery. The development of consistent metadata and controlled vocabularies is another important component to searching, finding and using scientific data in a manner consistent with reproducible research.
Finally, (and to some degree an obvious requirement) reproducible computational research depends on the ability of other scientists or research experts to freely access the source code and scientific data used in generating new computational discoveries. These free and open access concepts have been championed by many in the software development community under the umbrella of the open-source community. Open-source code is meant to be a collaborative effort, where programmers improve upon the source code and share the changes within the community .
The BioUno open-source project seeks to improve scientific application automation, performance, reproducibility, usability, and management by applying and extending software engineering (SE) best practices in the field of scientific research applications. Deliverables from the project have found a variety of applications in life-science research (bioinformatics, genetics, drug discovery).
- We explore and apply the application of best practices in software engineering to support the project mission
- We develop extensions to established SE tools, frameworks and technologies that directly support or indirectly enhance scientific applications.
- We develop APIs and integration points that empower scientific applications
- We promote collaboration and reuse by contributing to existing open source projects
- We educate users through blog, wiki, and presentations on the application of SE best practices in scientific applications
- We advocate with software engineers for enabling SE tools and frameworks for use by scientists
BioUno has pioneered the use of continuous integration tools and techniques to create reproducible computational pipelines and to manage computer clusters in support of scientific research applications.
In addition, BioUno has adopted a variety of Software Engineering best practices, to achieve its objectives:
- Revision Control (e.g. Git, Subversion, branching strategies),
- Continuous Integration (e.g. Jenkins-CI, SonarQube, code metrics, reproducible builds),
- Software Testing (e.g. Nestor-QA, TestLink, TDD, code coverage),
- Virtualization (e.g. Docker, Vagrant, VirtualBox)
Finally, BioUno strives to minimize the open source proliferation problem . While the BioUno project covers a broad range of technologies and tools, it tries to avoid the Open-Source proliferation problem by actively contributing to existing open-source projects rather than releasing or starting a new project.
BioUno Objectives for Mozilla Science 2015 Global Sprint
The BioUno project is participating in the 2015 Mozilla Science Global Sprint (MSGS 2015) with three main objectives.
- Expose the MSGS participants to the BioUno strategy of using Jenkins, a popular continuous integration system, for managing and building reproducible scientific workflows
- Engage the MSGS participants in hands-on review and enhancement of the BioUno tool-kit (Jenkins plugins and API) and gather new ideas for its extension for research applications. In the process participants will gain valuable experience on how to create, maintain and debug Jenkins plugins for research applications.
- Create a lasting collaboration with MSGS participants and projects so that the BioUno project can continue to deliver on it mission statement with an expanded pool of active contributors and users.
Check out our etherpad with our ideas, issues and more information for the sprint. You can help us with suggestions, documentation, coding or testing – so you can help us even if you are not a programmer.
 [Scientific Reproducibility through Computational Workflows and Shared Provenance Representations](http://www.evernote.com/l/AJ8x2KJTSTlGmbrFDKXSR709G2wRjbN32Tk/) (Yolanda Gil, NSF Workshop: 2010)
 [The real Open-Source proliferation problem](http://gondwanaland.com/mlog/2013/10/22/open-source-proliferation-problem/) (Mike Linksvayer, Blog 2013)