Categories: Open Source privacy

User Data & You: Privacy for Programmers

This was originally posted as a guest post on January 31t 2014. Since then, it has been requested that I post under my own name.

Introduction

I am a Firefox Engineer at Mozilla. I have worked on Desktop Firefox, Firefox for Android, and Firefox Sync. I currently work on Firefox for Windows 8 Touch, (née Firefox Metro). I also serve on Mozilla’s Privacy Council.

On Data Privacy Day, I presented a perspective on what we can do differently. My primary audience is fellow engineers or those engaged in engineering related activities. When I say ‘we’ I am largely referring to engineering as group. The remainder of this post is a written expansion on the presentation. The Air Mozilla recording is available here.

Goal

My goal is to start a public discussion about what engineers need to know about user privacy. Eventually the result of this discussion will evolve into a short training or set of best practices so engineers can ship better code with less hassle. Since this is the start of a public discussion, the content below will probably raise more questions than answers.

There be scaly dragons. Ye have been warned.

Privacy? That Word is so Overused. What is it Exactly & Why do I Care?

Privacy is a culturally laden word and definitions vary widely. Privacy means different things to different people at different times, even within the nascent field of privacy engineering. So for sanity’s sake, the following are my table stakes definitions.

Privacy: How & by whom the personal information of an individual is managed with respect to other individuals.

User Data: Any data related to users that they generate or cause to be generated; or that we generate, collect, store, have custody and control over, transfer, process, or hold interest in.

Why do we care? The reason Mozilla exists is to defend and promote the open web. Firefox & FirefoxOS are great products but they are not the raison d’être outlined in the Manifesto; they are means to an end. The Mozilla Manifesto declares that for a healthy web, users must be able to shape their own experiences on it. Ain’t nothing shapes your experience online more than than the data generated for, by, and about you. Whoever controls that controls your experience on the web. So our goal of the open web is directly linked to individuals ability to control that for themselves.

Acknowledging the Elephant in the Room

Let’s start by acknowledging the elephant in the room: whether or not Mozilla products should even handle user data. That would be a rich discussion on its own. This is not that discussion. This discussion assumes we’re going to handle user data. Regardless of your views, let’s agree that there are some things we will need to do differently when we choose to handle user data. Let’s figure out what those are.

Ok, So We Care; There’s Another Team at Mozilla for That.

There is a misconception I run into often that I’d like to clear up. Data safety & user privacy is everyone’s job, but especially an engineer’s job. At the end of the day, engineers make the sausage. No one has more leverage over what gets written than the engineer implementing it. The privacy team is here to help, but there are three of them and hundreds of us. The duty is really on us not them. Whether our code is fast, correct, elegant, secure, and meets Mozilla’s standards is chiefly our responsibility.

Ok, So it’s Kinda My Job. What do I Need to Think About or do Now?

I have good news & bad news. The good news is that it boils down to writing more stuff down & making more decisions upfront. Stated more formally:

  • More active transparency (writing more stuff down)
  • More proactive planning (making more decisions up front)

Sounds simple eh? Seasoned engineers should feel their spider sense tingling. It’s not miscalibrated. That’s the bad news. It’s how you do it that matters. The devil is in the details. So let’s tackle the easier one first: what I flippantly referred as ‘writing more stuff down’

Active Transparency (aka write more stuff down)

Passive Transparency: unintentional, decisions aren’t actively hidden, but are difficult to locate. May not even be documented

Active Transparency: intentional, everything is written down, easily searchable & locatable by interested parties

If you haven’t heard these terms before, don’t panic. I made them up years ago when I was a volunteer contributor trying to articulate how I was part of an open source project, actively following Bugzilla, but couldn’t figure out what was going on in the /storage module, let alone the rest of the Firefox code base.

Active transparency is functioning transparency. It requires sustained effort. Information, history, and decisions of a feature can be searched for, located, and consumed by all inclined.

Passive transparency is what happens unintentionally. People aren’t trying to hide information from each other. It just happens and no one notices until it is too late to do anything about it.

We often don’t notice because those who code marinate in information. We rarely bother to test whether or not anyone else outside can figure out what we’re living and breathing life into.
Break that habit. You test your code to prove it works; so test your transparency to prove it works (or doesn’t). Ask someone in marketing or engagement to figure out the state of your project or why your design is in its current state. Can they explain your tradeoff, constraints or design decisions back to you? Can they even find them?

I hear grumbling already: ‘Sounds like useless paperwork, not worth it’. What you really mean is ‘not worth it to you right now’, but it’s worth much to the people who will be responsible for it after you ship it, and there will be many of them.

One of the ways user data based features differ dramatically from application features of yore is that control will change hands many times over. Future development, operations, database administration, etc teams cannot read your mind. They also can’t go back in time to read your mind.

As an added bonus, privacy is not the only reason to be actively transparent. Active transparency is vital to building our community. Like open source software, it’s not really open if no one can find it and participate. Active transparency applies to the decision making process as much as to source code.

Proactive Planning (aka Making More Decisions With More People)

Now we move on the harder part – more decisions you’ll need to make with more people. Getting agreement on requirements is often one of the most difficult and least pleasant part of of an engineer’s craft. When handling user data, it will get harder. Your stakeholders will increase because the number of people who handle the data your feature generates or uses over its lifetime has increased.

The reason for enduring that pain at the beginning is that effective privacy is something you’ll only get one shot at. It’s usually impossible or cost-prohibitive to bolt it on to stuff after it’s built.

Proactive planning decisions will make up the bulk of the rest of the post. They are phrased in question form because the answers will be different for each project. They should not be interpreted as an all-inclusive list. The call to action for you is to answer them (and write the reasons down in a searchable location. Ahem – active transparency!)

30,000 Foot Views

The problem space can be vast. Below are two high level categorizations to jumpstart your problem solving, so that your feature can concretely bring to life the Manifesto’s declaration that users must be able to shape their own experience.

First Way to Slice It

An intuitive place to start is interaction.

Interactions between events and their data, or the data lifecycle

  • Birth
  • Life
  • Death
  • Zombie (braaaains)

Interactions between us and their data

  • How sensitive is this data?
  • Who should have access to it?
  • Who will be responsible for the safety of that data?
  • Who will make decisions about it when unexpected concerns come up?

Interactions between users and their data

  • How will a user see the data?
  • How will a user control it?
  • How will a user export it?

Second Way to Slice It

Another way to group key decisions is by basics plus high level concerns, such as:

  • Benefits & Risks
  • Openness, Transparency, & Accountability
  • Contributors & Third Parties
  • Identities & Identifiers
  • Data Life Cycles & Service History
  • User Control
  • Compatibility & Portability

Things to Think About – Basics

To start off, most of these seem pretty obvious. However there can be gotchas. For example, how identifying a type of data is can be tricky. What is seemingly harmless now could later be shown to be strongly identifying. Let’s consider the locale of your Firefox installation. If you are in en-us (the American English version), locale is not very identifying. Seems obvious. However, for small niche locales, it can be linked to a person.

  • Does your product/feature generate user data?
  • Metadata still counts
  • Does your product/feature store user data?
  • What kind of data & how identifying is it?
  • Are there legal considerations to this feature?
  • How do you authenticate users before they can access their data?
  • Which person or position is responsible for the feature while it remains active?
  • Who makes decisions after the product ships?
  • Figure this out. Now.

Things to Think About – Benefits and Risks

There will always be risk in doing anything. There exists a risk that when I leave my house an anvil with drop on me. That doesn’t mean I never leave my house. I leave my house because the benefits(like acquiring dinner) outweigh the risk. Similarly, there will always be risk when handling user data. That doesn’t mean we should handle it, but there had better be benefit to the user. ‘Well, it might be useful later‘ is probably not going to cut the mustard at Mozilla as a benefit to users.

  • What is the benefit to users from us storing this data?
  • What are the current alternatives available on the market?
  • What is the risk to users from storing this data?
  • What is the risk and cost to Mozilla from storing this data?
  • Where are you going to store this user data? Whose servers? (If not ours, apply above questions as well)

Things to Think About – Openness, Transparency and Accountability

For a Mozilla audience, this is preaching to the choir.

  • Have the benefits & risks of this feature been discussed on a public forum like a mailing list?
  • Should we exempt detailed discussion of handling really sensitive data?
  • Where is the documentation for our tradeoffs and design decisions, with respect to user data? (*cough* Active transparency!)

Things to Think About – Contributors and Third Parties

The use of third party vendors adds additional nuances, as I alluded to earlier.

  • Are any third party companies or entities involved in this? (ex: Amazon AWS)
  • Do we have a legal agreement governing what they can and can’t do with it?
  • Who makes decisions about access to the data?

At Mozilla, we sometimes release data sets so researchers can contribute knowledge about the open web for the public good.

  • Could researchers access it directly?
  • Do we have plans to release the dataset to researchers?
  • What would we do to de-identify the data before release?

Things to Think About – Identity and Identifiers

There’s probably nothing more personal than someone’s identity.

  • Will this feature have a user identification?
  • Who owns the login/username/identifier?
  • Is it possible to use this feature without supplying an identifier?
  • How will the user manage this identification?
  • Can they delete it?
  • Who can see this identifier?
  • Can the user control who can see their identifier?
  • Can this identifier be linked to the real life identity?
  • Can a single person have multiple identifiers/accounts?

Things to Think About – Data Lifecycles and Service History

This is an area that most application developers will have trouble with because we often don’t think about the mid-life or death of our feature or the data it uses. It ships, it’s out! Onto the next thing!

Not so fast.

  • Which person or position is responsible for the data/feature while it remains active?
  • Who makes decisions after the product ships?
  • Can a user see a record of their activities?
  • What happens to an inactive account and its associated data?
  • When is a user deemed inactive?
  • How will you dispose of user data?
  • What’s the security of the data in storage?
  • How long would we retain the data?
  • Who has access to the data at various stages?

Things to Think About – User Control

To shape their own experiences on the web, users need to have control of their data.

  • How can a user see their data?
  • Can users delete data in this feature?
  • What exactly would deletion mean?
  • how will happen?
  • what will it include?
  • what about already released anonymitized data sets?
  • what about server logs?
  • what about old backups?
  • Is there a case where the user identifier can be deleted, but not necessarily the associated data?
  • Is any of the data created by the user public?
  • What are the default user control settings for this feature?
  • How could a user change them?

Things to Think About – Compatibility and Portability

In my not-so-humble opinion, it’s not an open web if user data is held for ransom or locked into proprietary formats.

  • Can the user export their data from this service?
  • What format would it be in?
  • Is it possible to use an open format for storage?
  • If not, should we start an effort to make one?

That’s a Lot of Extra Work, No. Not OK. Not Cool.

Yes, it is.

Handling user data is going to increase your workload. So does writing test coverage. We do it anyway. We write tests to meet our standards for correctness; we must write code that meets our standards for privacy.

I didn’t say it would be easy, but it’s doable. We can do it better and show that the web the world needs can exist.

The Privacy Team is Here to Help. Talk to Them Early and Often

That a metric ton of questions to ponder. I don’t expect you to remember them all. The privacy team is working on a new kickoff form and a checklist of considerations to make this process smoother (Additional note I spent most of today on just this goal). They may even merge those two things. For now, use the existing Project Kickoff Form and check out this wiki containing the questions I’ve listed above.

Not sure if you need a review? Just have a question? Something you want to run by them? Drop them an email or pop into the #privacy irc channel.

Have an Opinion? Join the Effort.

The Mozilla Privacy Council needs more engineers, including volunteer contributors. No one knows more about building software than we do. User-empowering software won’t get built without us. Help shape the training, best practices, the kickoff form, and privacy reviews of new features. To get involved, email stacy at mozilla dot com.

Special thanks to the Metro team for their patience with my delayed code reviews this week.

Thank you for reading. May your clobber builds be short.