The Glean logo
Categories: Data Engineering Glean

⚠️Danger zone⚠️: handling sensitive data in Glean

Co-authored by Alessio Placitelli and Beatriz Rizental.
(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. You can find an index of all TWiG posts online.)

🎵 “Precious and fragile things, need special handling […]” 🎵, and that applies to data, too!

Over the years, a number of projects at Mozilla had to handle the collection of sensitive data users explicitly decided to share with us (think, just as an example, things as sensitive as full URLs). Most of the time projects were designed and built over our legacy telemetry systems, leaving developers with the daunting task of validating their implementations, asking for security reviews and re-inventing their APIs.

With the advent of Glean, Mozilla’s Data Org took the opportunity to improve this area, allowing our internal customers to build better data products.

Fresh Prince saying: Wait a Minute..?

 

Data collection + Pipeline Encryption = ✨

We didn’t really talk about what we mean by “special handling”, did we?

For data that is generally not sensitive (e.g. the amount of available RAM), after a product using Glean submits a ping, it hits the ingestion pipeline. The communication channel between the Glean client and the ingestion server is HTTPS, which means the channel is encrypted from one end (the client) to the other end (the ingestion server). After the ingestion server is hit, unencrypted pings are routed within our ingestion pipeline and dispatched to the destination tables.

For products requesting pipeline encryption to make sure only specific individuals and pipeline engineers can access the data, the path is slightly different. When enabling them in the ingestion pipeline, an encryption key is provisioned and must be configured in the product using Glean before new pings can be successfully ingested into a data store. From that moment on, all the pings generated by the Glean client will look like this:

{
"payload": "eyJhbGciOiJFQ0RILUVTI..."
}

Not a lot of info to route things within the pipeline, right? 🤦

Luckily for our pipeline, all Glean ping submissions conform to the HTTP Edge Specification. By knowing the Glean application id (which maps to the document namespace from the HTTP Edge Specification) and the ping type, the pipeline knows everything it needs to route pings to their destination, look up the decryption keys and decrypt the payload before inserting it into the destination table.

It’s important to note that only a handful of pipeline engineers are authorized to inspect the encrypted payload (and enabled to fix things if they break!) and only an explicit list of individuals, created when enabling the product in the pipeline, is allowed to access the final data within a secure, locked down environment.

How does the ✨magic✨ happen in the Glean SDKs?

As discussed, ping encryption is not a feature required by all products using Glean. From a client standpoint, it is also a feature that has the potential to significantly increase the size of the final Glean SDK because, in most environments, external dependencies are necessary to encrypt the ping payload. Ideally, we should find a way to make it an opt-in feature i.e. only users that actually need it pay the (size) price for it. And so we did.

Ping encryption was the perfect use case to implement a new and long discussed feature in the Glean SDKs: plugins. By implementing the ping encryption as a plugin and not a core feature, we achieve the goal of making it an opt-in feature. This strategy also has the added bonus of keeping the encryption initialization parameters out of the Glean configuration object, win/win.

Since the ping encryption plugin would be the first ever Glean SDK plugin, we needed to figure out our plugin architecture. In a nutshell, the concept we settled for is: plugins are classes that define an action to be performed when a specific Glean event happens. Each event might provide extra context for the action performed by the plugin and might require that the plugin return a modified version of said context. Plugin instances are passed to Glean as initialization parameters.

Let’s put a shape to this, by describing the ping encryption plugin.

  • The ping encryption plugin is registered to the afterPingCollection event.
    •  This event will call a plugin action after the ping is collected, but before it is stored and queued for upload. This event will also provide the collected ping payload as context to the plugin action and requires that the action return a JSON object. Whatever the action returns is what will be saved and queued for upload in place of the original payload. If no plugin is registered to this event, collection happens as usual.
  • The ping encryption plugin action, gets the ping payload from this event and returns the encrypted version of this payload.

In order to use this plugin, products using Glean need to pass an instance of it to the Glean SDK of their choice during initialization.

import Glean from "@mozilla/glean/webext"
import PingEncryptionPlugin from "@mozilla/glean/plugins/encryption"
Glean.initialize(
  "my.fancy.encrypted.app",
  uploadStatus,
  {
    plugins: [
      new PingEncryptionPlugin({
        "crv": "P-256",
        "kid": "fancy",
        "kty": "EC",
        "x": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
        "y": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
      })
    ]
  }
);

And that is it. All pings sent from this Glean instance will now be encrypted before they are sent.

Harry Potter loves magic

Note: The ping encryption plugin is only available on the Glean JavaScript SDK at the moment. Please refer to the Glean book for comprehensive documentation on using the PingEncryptionPlugin.

Limitations and next steps

Futurama Welcome to the world of tomorrow!

While the current approach serves the needs of Mozilla’s internal customers, there are some limitations that we are planning to smooth out in the future. For example, in order to be properly routed, products that want to opt-into Glean pipeline encryption will need to use a fixed, common prefix in their application id. Another constraint of the current system is that once a product opts into Pipeline encryption, all the pings are expected to be encrypted: the same product won’t be able to send both pipeline-encrypted and pipeline-unencrypted pings.

One final constraint is that the tooling available in the secure environment is limited to Jupyter notebooks.

Acknowledgements

The pipeline encryption support in Glean wasn’t built in a day! This major feature is based on the incremental work that happened over the past year, of many Mozillians (thank you Wesley Dawson, Anthony Miyaguchi, Arkadiusz Komarzewski and anyone else who helped with it!).

And kudos to the first product making use of this neat feature!