The Glean logo

This Week in Glean: What If I Want To Collect All The Data?

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. All “This Week in Glean” blog posts are listed in the TWiG index).

Mozilla’s approach to data is “as little as necessary to get the job done” as espoused in our Firefox Privacy Promise and put in a shape you can import into your own organization in Mozilla’s Lean Data Practices. If you didn’t already know, you’d find out very quickly by using it that Glean is a Mozilla project. All of its systems are designed with the idea that you’ve carefully considered your instrumentation ahead of time, and you’ve done some review to ensure that the collection aligns with your values.

(This happens to have some serious knock-on benefits for data democratization and tooling that allows Mozilla’s small Data Org to offer some seriously-powerful insights on a shoestring budget, which you can learn more about in a talk I gave to Ubisoft at their Data Summit in 2021.)

Less Data, as the saying goes, implies Greater Data and Greatest Data. Or in a less memetic way, Mozilla wants to collect less data… but less than what?

Less than more, certainly. But how much more? How much is too much?

How much is “all”?

Since my brain’s weird I decided to pursue this thought experiment of “What is the _most_ data you could collect from a software project being used?”.

Well, looking at Firefox, every button press and page load and scroll and click and and and… all of that matters. As does the state of Firefox when it’s being clicked and scrolled and so forth. Typing in the urlbar is different if you already have a page loaded. Opening your first tab is different from opening your nine-thousand-two-hundred-and-fiftieth.

And, underneath it all, is the code. How fast is it running? How much memory are we using? All these performance questions that Firefox Telemetry was originally built to answer. Is code on line 123 of file XYZ.cpp running? Is it running well? What do we run next?

For software this means to record all of the data, we’ll need to know the full state of the program at every expression it runs in every line of code. At every advancement of the Program Counter, we’d need to dump the entire Stack and Heap.

Yikes! That’s gigabytes of data per clock cycle.

Well, maybe we can be cleverer than this. Another one of those projects Mozilla incubated that now has a whole community of contributors and users (like Rust) is a lightweight record-and-replay debugger called rr. The rr debugger collects traces of a running piece of software and can deterministically replay it over and over again (backwards, even!), meaning it has all the information we need in it.

So a decent size estimate for “all the data” might be the size of one of these trace recordings. They’re big, but not “full heap and stack at every program counter” big. A short test run of Firefox was about 2GB for a one minute run (albeit without any user interaction or graphics).

Could Glean collect traces like these? Or bigger ones after, say, a full day’s use? Not easily. Not without modification.

Let’s say we did those modifications. Let’s push this thought experiment further. What does that mean for analysis? Well, we’d have all these recordings we could spin up a VM to replay for us. If we want the number of open tabs, we could replay it and sample that count whenever we wanted.

This would be a seismic shift in how instrumentation interacted with analysis. We’d no longer have to ship code to instrument Firefox, we could “simply” (in quotes because using rr requires you to be a programming nerd) replay existing traces and extract the new data we needed.

It would also be absolutely horrible. We’d have to store every possible metric just in case we wanted _one_ of them. And there’s so much data in these traces that Mozilla doesn’t want to store: pictures you looked at, videos you watched, videos you uploaded… good grief. We don’t want any of that.

(( I’d like to take a second to highlight that this is a thought experiment: Mozilla doesn’t do this. We don’t have plans to do this. In fact, Mozilla’s Data Privacy Principles (specifically “Limited Data”) and Mozilla’s Manifesto (specifically Principle 4 “Individuals’ security and privacy on the internet are fundamental and must not be treated as optional.”) pretty clearly state how we think about data like this. ))

And processing these traces into a useful form for analysis to be performed would take the CPU processing power of a small country, over and over again.

(( And rr introduces a 20% performance penalty which really wouldn’t ingratiate us to our users. And it only works on Linux meaning the data we’d have access to wouldn’t be representative of our user base anyway. ))

And what was the point of this again? Right. We’re here to quantify what “less data” means. But how can we do that, even knowing as we do now what the size of “all data” is? Can we compare the string value of the profile directory’s random portion comparable to the url the user visits the most? Are those both 1 piece of data that we can compare to the N pieces of data we get in a full rr trace? Mozilla doesn’t think they’re the same, since we categorize (and thus treat) these collections differently.

All in all maybe figuring out the maximum amount of data you could collect in order to contextualize how much less of it you are collecting might not be meaningful.

Oh well.

I guess this means that the only way Mozilla (and you!) can continue to quantify “less data” is by comparing it to “no data” – the least possible amount of data.

:chutten

(( This post is a syndicated copy of the original post. ))