Introducing the 2016 Mozilla Science Fellows: Bruno Vieira

Hey everyone, my name is Bruno. I’m from Portugal but I’ve lived in Switzerland (Geneva) and I’m currently based in London. I’m finishing a PhD at WurmLab.github.io (Queen Mary University of London) in Bioinformatics and Population Genomics. Specifically, I’m comparing the genetic diversity of social insect species (like ants and bees) to closely related solitary species. My aim is to test the hypothesis that social insect species have a lower genetic diversity since only a small number of individuals in the colony (e.g., the queen and a single male) are contributing offspring. Among other reasons, this research is relevant to understand evolution, and for conservation efforts.

Ants, bees and termites are so called eusocial species, with colonies where usually only the queen and a male reproduces. Cockroaches are solitary species closely related to termites. Photos from https://commons.wikimedia.org/wiki (files WeaverAntDefense.JPG, Honeybee_landing_on_milkthistle02.jpg, Coptotermes_formosanus_shiraki_USGov_k8204-7.jpg and Bush_Cockroach.jpg).

I learned Node.js (JavaScript) and became a full stack web developer in 2013, when I was part of the organisation of a big evolution conference, ESEB 2013, (1500 participants, 400 talks, 800 posters). I built the website and a platform for managing the conference and since then I realised that Node.js also has many advantages for scientific code.

A: The whole ESEB 2013 team; B: The big auditorium; C: Me and another Bruno keeping the backend up and running during the conference: D: My previous boss and main organiser, Octávio S. Paulo; E and F: The 1500 participants enjoying Portuguese weather and food; G: My current supervisor, Yannick Wurm, giving a talk at ESEB 2013 right before we met for the first time.

For my research, I rely heavily on publicly available data and metadata. Thus I’m very aware of Open Data issues, data discoverability, and metadata quality. I also spend a substantial amount of time reusing and combining Open Source Bioinformatics software into complex data analysis pipelines.

Example of how complex bioinformatics pipelines can get with just a few steps and tools.

The general theme I would like to work on during this Mozilla Fellowship is Open Genomics on the Web. Since the advent of high throughput DNA sequencing and the fast technological improvements over the last ten years, the field of Genomics is generating so much data that it’s going to surpass astronomical data or YouTube very soon.

From the paper “Big Data: Astronomical or Genomical?” (Stephens et al. 2015).

We are at a critical point! We need to make sure that all the data being generated – and the methods to analyse it – are open and transparent. Science fundamentally relies on the ability to test a hypothesis, under controlled conditions, and get a predicted result consistently. Without access to all the data, code and documented methods, reproducing an experiment might be impossible or extremely time-consuming. For example, it might take take a lab months to figure out the exact conditions used in an experiment originally done by another lab (or the same lab years before). Additionally, some private companies are trying to capitalise on closed data (by creating data silos/walled gardens), so providing open data alternatives could have a huge positive impact on healthcare.

Openness is so fundamental to Science that it’s ironic that “Open Data” and “Open Source” are concepts we need to fight for. Sadly, it seems that the default in science is now Closed Science. It’s difficult to say how this came about, but the main causes seem to be a mix of misplaced incentives from fundings bodies and cultural issues in academia. But things are improving! I hope that during this fellowship we can make a significant contribution towards boosting the state of Open Science. I hope we can improve best practices in reproducibility and create better credit attribution for scientific outputs such as code.

For reproducibility sake, any code is better than no code. However for reusability (i.e., take an analysis and run it on a different dataset), we need good, maintained, tested code. Thus sometimes communities arise to develop a common library or set of tools. During my PhD, I developed some Node.js (JavaScript) code for my research that I released under an Open Source license. This attracted contributors, and thus the Bionode.io community was born.

Bionode.io motto is “modular and universal bioinformatics”. Our aim is to build highly reusable code and tools that can run everywhere (e.g., a High-Performance Computing server, or a user’s browser) by leveraging Node.js and its fast growing community. Another distinguishable feature of Bionode is its architecture based around Node.js Streams, which we believe is essential for processing genomic big data in real-time and in a scalable way.

Personally, I gained a lot from the Bionode project. For example, I’ve been invited to give talks at places such as the Sanger Institute and participate in a hackathon in Japan (!). But more importantly, it’s been really exciting to see the Bionode.io communitty grow. This year, with sponsorship of Repositive.io, we organised a hackathon at Google Campus London. We got around 40 participants and a good mix of half scientist and half JS developers – we managed to prototype five tools in only a day! We are going to do a light version of this hackathon as a three hour workshop at MozFest 2016.

Me and the founder of Dat, Max Ogden, at the Sanger Institute before giving our talks (November 2014).

Biohackathon 2015 in Nagasaki, Japan. I was invited to give a talk about Bionode, and worked during a week on several things, including GUIX, Docker and CWL.

Bionode hackathon 2016 day at Google Campus London, sponsored by Repositive.io, where 40 participants (scientists and JavaScript developers) prototyped five new tools to fetch and process biological data.

Bionode also supported a student (Julian Mazzitelli) working full-time for three months under a Google Summer of Code fellowship from the Open Bioinformatics Foundation. He developed a workflow engine in Node.js with the aim of making it easier to write modular and scalable bioinformatics pipelines, but his tool is applicable in any data processing workflow in other fields.

Julian Mazzitelli dedicated three months in 2016 to Bionode as a Google Summer of Code student and we finally met in Toronto during the Mozilla fellowship onboarding.

Bionode.io has gone further than I expected as my part-time project and received the help of some occasional contributors. I now intend to dedicate full-time to it with this fellowship bandwidth to give it the push it needs to take off.

My goals are to make Bionode.io:

More self-sustainable
- Organise more hackathons to attract contributors, especially with more technical Node/JS background
- Improve documentation and online tutorials to attract more users
- Apply for grants to hire a few full-time developers to improve and maintain the project
More useful
- Reach out to specific labs/institutes to gather their needs and how this project can help them, and what features are missing
- Keep improving and working on existing code
More collaborative
- Use our existing connections to other projects like Dat (data versioning and distribution), BioJS (biological visualisation), CWL (common workflow language) to see if together we can build something amazing for Open Genomics
- Use the Mozilla fellowship to make new collaborations

Overall, I hope the Bionode.io community can become big and relevant enough to provide a viable solution and open standard for genomics (and maybe even other big data analysis fields).

Bruno | bmpvieira.com

Acknowledgements:

I would like to thank Adrian Lärkeryd, Aurelia Moser, Emeline Favreau, James Stefan Borrell and Rodrigo Pracana for reviewing and improving the text.