Drinking from the Firehose: Jumping into Open Data

It’s now been three four weeks since I left academia to join the Mozilla Science Lab team as the inaugural Open Data Training Lead.  Being the first person in a newly created position isn’t new to me.  I was the first Data Services Coordinator position at the University of Washington Libraries and built the Research Data Services unit there.  Working at a non-profit foundation isn’t new to me either.  I spent a couple years at the Bill & Melinda Gates Foundation traveling the country as a computing trainer in libraries.  That being said, I still feel like this is a whole new world to me… a fun, engaging and highly collaborative world.

We escaped

Mozilla Science Lab staff and fellows Escape the Room!

There has been so much going on this month!  We’ve had a Mozilla Open Science Leadership Summit in Toronto, participated in the Open Source & Feelings conference in Seattle, and brought on new Mozilla Fellows for Science and mingled with Data & Society folks in Brooklyn.  The team is also getting things ready for MozFest in London in a few weeks.  Busy, busy, busy! Even with all that, I’ve had some time to reflect on the experiences as well as think of some plans for leading our open data training going forward.

Before I begin, I want to take what I learned about jargon-busting at the Leadership Summit mentioned above to define “open data”.  (You can read about and play along with the jargon-busting exercise here.) The idea behind this exercise is to only use the 1,000 most popular words in the English language to communicate what you’re trying to say.  After some trial and error, here’s what I came up with: open data = groups of facts and figures shared in a way that allows other people to use them to make new understanding of how the world works.  If you have a different definition you like, try it out here and let me know: http://splasho.com/upgoer5/.

Value of open data

So far most of my experience with data in my professional life has been in the world of academia and research data.  These last two weeks have opened my eyes to how relevant open data is beyond that narrow focus: from the elementary student who wants to get her “hands dirty” with a science project to the octogenarian looking for health resources in his neighborhood.  I was excited to stumble upon projects such as Smart Chicago and one of their projects, Chicago Early Learning Portal, which provides open data allowing residents to find and compare early learning programs in their neighborhoods.

Open data encourages transparency, feedback, collaboration and advances knowledge in ways that can’t be done when it is kept on a hard drive in a research lab for a small group of people.  To quote an old African proverb (which also happens to be the frequently stated motto of an open source repository framework): “If you want to go fast, go alone.  If you want to go far, go together.”  I could go on for ages about this but it really deserves its own post, so I’ll move on.

How to make it open

Open data is more than just placing a final research product on a personal website.  It requires context and sustainable formats to make it usable.  Think of a book on a shelf in a library.  If there wasn’t information in a catalog (physical or electronic), how would you know what the book was about or where to find it?  What if it was an e-book?  If that book was saved on a Zip disk, would you be able to open it and read it?

Open data follows the same rules.  We need to work with and train researchers and other data producers to identify what information about a dataset (metadata) needs to be collected and maintained with the dataset to make it findable and reusable.  (The Digital Curation Centre in the UK has some great resources for that, by the way.)  There are also places where data can be stored where it is much more likely to be found and expertly maintained than on a personal website.  Check out re3data.org for a catalog of data repositories.  For an example of how to do great open research, check out the Roberts Lab at the University of Washington, particularly their Data and Resource Sharing Plan.

A focus on re-use

The benefits of open data are not realized unless that data is used and reused by others.   Undersea data collected in the 1960s assisted today’s researchers in understanding earthquakes in the PNWNew York City data made freely available during and after Hurricane Sandy allowed for the creation of tools to help the public find shelters, gas and food.  Students in the Data Science for the Social Good program at the University of Chicago combined several datasets from Mexico to develop a plan to reduce maternal mortality there.

Work like this should be encouraged and rewarded. It should be taught as part of the research process early in a researcher’s career so it is thought of as the norm.  There has already been research indicating there is a long-lasting effect on citation and reuse for open and citable data, long after a paper is published.  We need to encourage funders to reward the reuse of data in grant proposals and work with institutions to acknowledge data sharing and reuse in the tenure and promotion process.

Open Knowledge Foundation founder Rufus Pollock once said “The best thing to do with your data will be thought of by someone else.” This isn’t a statement to be afraid of, it should be embraced.  By sharing your data and using data from others, you too can be that “someone else”.

Jump into the water with me

Girl drinking from firehose

Drinking from the firehose

This program will be more successful with contributions from the community it is meant to serve.  Feedback is both welcome and wanted.  Are there other areas that should be a priority as we develop plans for the open data training program? More programs and projects you think I should be investigating?  Tools and resources of which we should be aware?  I want to hear it.  You can leave a comment here , find me on Twitter @shefw, or drop me an email.  I will also be starting a repository on GitHub for development of our Open Data Training plan and sharing our work there as it evolves.  I’ll get that link out when it happens… stay tuned!