⚠️Danger zone⚠️: handling sensitive data in Glean

Co-authored by Alessio Placitelli and Beatriz Rizental.
(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. You can find an index of all TWiG posts online.)

🎵 “Precious and fragile things, need special handling […]” 🎵, and that applies to data, too!

Over the years, a number of projects at Mozilla had to handle the collection of sensitive data users explicitly decided to share with us (think, just as an example, things as sensitive as full URLs). Most of the time projects were designed and built over our legacy telemetry systems, leaving developers with the daunting task of validating their implementations, asking for security reviews and re-inventing their APIs.

With the advent of Glean, Mozilla’s Data Org took the opportunity to improve this area, allowing our internal customers to build better data products.

Data collection + Pipeline Encryption = ✨

We didn’t really talk about what we mean by “special handling”, did we?

For data that is generally not sensitive (e.g. the amount of available RAM), after a product using Glean submits a ping, it hits the ingestion pipeline. The communication channel between the Glean client and the ingestion server is HTTPS, which means the channel is encrypted from one end (the client) to the other end (the ingestion server). After the ingestion server is hit, unencrypted pings are routed within our ingestion pipeline and dispatched to the destination tables.

For products requesting pipeline encryption to make sure only specific individuals and pipeline engineers can access the data, the path is slightly different. When enabling them in the ingestion pipeline, an encryption key is provisioned and must be configured in the product using Glean before new pings can be successfully ingested into a data store. From that moment on, all the pings generated by the Glean client will look like this:

{ "payload": "eyJhbGciOiJFQ0RILUVTI..." }

Not a lot of info to route things within the pipeline, right? 🤦

Luckily for our pipeline, all Glean ping submissions conform to the HTTP Edge Specification. By knowing the Glean application id (which maps to the document namespace from the HTTP Edge Specification) and the ping type, the pipeline knows everything it needs to route pings to their destination, look up the decryption keys and decrypt the payload before inserting it into the destination table.

It’s important to note that only a handful of pipeline engineers are authorized to inspect the encrypted payload (and enabled to fix things if they break!) and only an explicit list of individuals, created when enabling the product in the pipeline, is allowed to access the final data within a secure, locked down environment.

How does the ✨magic✨ happen in the Glean SDKs?

As discussed, ping encryption is not a feature required by all products using Glean. From a client standpoint, it is also a feature that has the potential to significantly increase the size of the final Glean SDK because, in most environments, external dependencies are necessary to encrypt the ping payload. Ideally, we should find a way to make it an opt-in feature i.e. only users that actually need it pay the (size) price for it. And so we did.

Ping encryption was the perfect use case to implement a new and long discussed feature in the Glean SDKs: plugins. By implementing the ping encryption as a plugin and not a core feature, we achieve the goal of making it an opt-in feature. This strategy also has the added bonus of keeping the encryption initialization parameters out of the Glean configuration object, win/win.

Since the ping encryption plugin would be the first ever Glean SDK plugin, we needed to figure out our plugin architecture. In a nutshell, the concept we settled for is: plugins are classes that define an action to be performed when a specific Glean event happens. Each event might provide extra context for the action performed by the plugin and might require that the plugin return a modified version of said context. Plugin instances are passed to Glean as initialization parameters.

Let’s put a shape to this, by describing the ping encryption plugin.

The ping encryption plugin is registered to the afterPingCollection event.
- This event will call a plugin action after the ping is collected, but before it is stored and queued for upload. This event will also provide the collected ping payload as context to the plugin action and requires that the action return a JSON object. Whatever the action returns is what will be saved and queued for upload in place of the original payload. If no plugin is registered to this event, collection happens as usual.
The ping encryption plugin action, gets the ping payload from this event and returns the encrypted version of this payload.

In order to use this plugin, products using Glean need to pass an instance of it to the Glean SDK of their choice during initialization.

import Glean from "@mozilla/glean/webext"
import PingEncryptionPlugin from "@mozilla/glean/plugins/encryption"
Glean.initialize(
  "my.fancy.encrypted.app",
  uploadStatus,
  {
    plugins: [
      new PingEncryptionPlugin({
        "crv": "P-256",
        "kid": "fancy",
        "kty": "EC",
        "x": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
        "y": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
      })
    ]
  }
);

And that is it. All pings sent from this Glean instance will now be encrypted before they are sent.

Note: The ping encryption plugin is only available on the Glean JavaScript SDK at the moment. Please refer to the Glean book for comprehensive documentation on using the PingEncryptionPlugin.

Limitations and next steps

While the current approach serves the needs of Mozilla’s internal customers, there are some limitations that we are planning to smooth out in the future. For example, in order to be properly routed, products that want to opt-into Glean pipeline encryption will need to use a fixed, common prefix in their application id. Another constraint of the current system is that once a product opts into Pipeline encryption, all the pings are expected to be encrypted: the same product won’t be able to send both pipeline-encrypted and pipeline-unencrypted pings.

One final constraint is that the tooling available in the secure environment is limited to Jupyter notebooks.

Acknowledgements

The pipeline encryption support in Glean wasn’t built in a day! This major feature is based on the incremental work that happened over the past year, of many Mozillians (thank you Wesley Dawson, Anthony Miyaguchi, Arkadiusz Komarzewski and anyone else who helped with it!).

And kudos to the first product making use of this neat feature!

Data@Mozilla

⚠️Danger zone⚠️: handling sensitive data in Glean

Data collection + Pipeline Encryption = ✨

How does the ✨magic✨ happen in the Glean SDKs?

Limitations and next steps

Acknowledgements

This Week in Data: There’s No Such Thing as a Normal Month

Glean Memory Usage Reporting

This Week in Data: Cosmic Rays From Outer-Space! (What comes next?)

This Week in Glean: Page Load Data, Three Ways (Or, How Expensive Are Events?)

This Week in Glean: Your personal Glean data pipeline

This Week in Data: There’s No Such Thing as a Normal Month

Incident Report: A compiler bug and JSON

Glean Memory Usage Reporting

Data and Firefox Suggest

How do we preserve the integrity of business metrics while safeguarding our users privacy choice?

This Week in Data: There’s No Such Thing as a Normal Month

Never Look at the Data: Why did we start getting so many pings from Korea?

This Week in Data: Python Environment Freshness

This Week in Glean: What Flips Your Bit?

This Week in Glean: Designing a telemetry collection with Glean

Data and Firefox Suggest

Documenting outages to seek transparency and accountability

Announcing Mozilla Rally

Data Publishing @ Mozilla

Understanding default browser trends

Data and Firefox Suggest

This Week in Glean: Data Reviews are Important, Glean Parser makes them Easy

Comparing data-stewardship at Mozilla with Lauren Maffeo’s book “Designing Data Governance from the Ground Up”

This Week in Data: Cosmic Rays From Outer-Space! (What comes next?)

This Week in Data: Reading “The Manager’s Path” by Camille Fournier

This Week in Glean: What Flips Your Bit?

Detecting Internet Outages with Mozilla Telemetry Data

Making your Data Work for you with Mozilla Rally

This Week in Glean: Fantastic Facts and where to find them

Welcome (back) to Data@Mozilla