Trillian: making open astronomy data more useful | #mozsprint 2016

Demitri Muna is an astronomer working on the Sloan Digital Sky Survey (SDSS). A longtime friend of Mozilla Science, Demitri has been a strong voice for open collaboration at several of our sprints and workshops. It’s no surprise that Demitri started Trillian, an astronomy computational engine to bridge the gap between astrophysical models and the vast amount of publicly available astronomical data. Demitri also created the the SciCoder astroinformatics workshop and develops data visualization software for astronomy data.

I interviewed Demitri to learn more about Trillian and how you can help June 2-3 at #mozsprint.

mozsprint_interview

What is Trillian?

Trillian is… hard to describe in one sentence! Our field is lucky in that we have a tremendous volume of astronomical data, but the tools for individual astronomers to analyze large amounts of it have not kept up. Currently it’s a very manual process: going to numerous, independent archives, performing web searches (if you’re lucky!), downloading files, managing them locally, determining which are good and which aren’t on a telescope-by-telescope basis, then finally running your own analysis code on them. Trillian will flip this around: it lets you upload your analysis code to a server where the data is already collected, curated, and organized, and run it there. Because of the sheer volume of useful data (which is growing quickly!), it is inevitable that we will need to analyze data this way. The genomics community has already demonstrated that this is a successful model.

Why did you start Trillian?

Trillian is really the cumulation of encountering numerous limitations I’ve encountered accessing astronomical data over the years. One example from early in my career was when I wanted to find an example of a class F star and I expected to be able to go to a database that had a list of such stars (or any other). Turned out there wasn’t one. Another time I was writing a proposal to get observations of another kind of rare star (thermally-pulsating AGB stars), so I needed a list of stars that were at least likely candidates. To make this determination I needed to combine existing observations from many different wavelengths, and I needed to search a third of the sky. No tools exist to do this, and I didn’t have the resources (or time!) to download the hundreds of terabytes from multiple archives that would be needed to do it right.

There was a funding call a couple of years ago for data science projects. It had the benefit for me of concentrating all of these ideas and others that have been in my mind for many years and focussing them into a specific project. I didn’t get the funding, but I realized I didn’t need it to get the project started, so Trillian was born.

How will this help astronomers be more efficient?

One of the big problems with current tools and methods is that since so much manual work must be performed by astronomers we necessarily leave data on the table. An individual only has so much time to search for and manage data from the numerous archives available. It’s one thing to lack the observations we need to make our analyses or models better, but it’s another to know that they might be out on the internet somewhere but are not being used. The other major piece is that anyone will be able to run analyses over arbitrarily large parts of the sky across many wavelengths without having to download a single file, something that is simply impossible today.

What needs to be done to complete your proof-of-concept framework?

There are a few major pieces that will need to be created and connected. An important part of the design is that we are organizing data by location on the sky, not wavelength. This will require (Python) modules that represent each data set along with the ability to download any specified region of the sky (along with the bookkeeping that that entails). We are writing an API that will provide access to data from the user’s custom analysis code. That code will be run inside a Docker container which will also need to be designed. More is planned, such as a web-based user interface and data visualization in general, but that will be enough to produce science results.

How are you communicating astronomy to the general public?

I’m involved in a few outreach efforts that are quite successful (Astronomy on Tap and the Astrotweeps Twitter account), but Trillian is in too early a stage to be applicable for presenting to the general public. I can imagine Trillian being a fascinating tool just for random exploration eventually; specifically being able to select any random object in the sky and explore what models have been applied to it, but that’s a little ways off right now.

What problems have you run into while working on Trillian?

Building a significant open source project from scratch is hard! (That could be a t-shirt.) Designing a project like this requires both domain knowledge (astronomy!) as well as technical software development skills. This necessarily limits the number of people who can immediately jump in. I know several such people who believe in the project, but their time is of course taken by their jobs and research. The hardest thing is getting a foundation built and transitioning the project into one where pieces of work can be broken down into ones that can be more generally accomplished. This in part requires writing a tremendous amount of design documentation up front – necessary and important, but counter to my inclination to want to jump in and write code!

What kind of skills do I need to help you build Trillian?

Python experience is probably the most valuable one as most of the code will be written in that language. However, knowledge of creating Docker containers, populating databases (SQLAlchemy), automating large downloads, and application programming interface (API) design will all be needed. Finally, eventually we will want to have a web interface so web frameworks (Flask) and JavaScript for data visualization will be needed.

What are you hoping to do at the Mozilla Science Global Sprint, June 2-3? Can others help you here?

We received a $10K grant from Ohio State University’s Center for Cosmology and Astroparticle Physics (CCAPP) to purchase a server for Trillian. However, it’s somewhat unfeasible to create an account on that machine for anyone who might want to contribute to Trillian. Trillian is more than a software library; it depends on a database and at least some hosted data. For the Global Sprint, I’d like to solve this problem by designing a Docker container that would contain a “lite” version of Trillian – the database and a small but representative data sample.

 

—-

 

Come join us wherever you are June 2-3 at the Mozilla Science Global Sprint to work on Trillian and many other projects! Have your own project? Submissions are open for new project ideas.