{"id":393,"date":"2022-02-16T15:49:16","date_gmt":"2022-02-16T15:49:16","guid":{"rendered":"https:\/\/blog.mozilla.org\/data\/?p=393"},"modified":"2022-02-16T15:49:16","modified_gmt":"2022-02-16T15:49:16","slug":"this-week-in-glean-what-if-i-want-to-collect-all-the-data","status":"publish","type":"post","link":"https:\/\/blog.mozilla.org\/data\/2022\/02\/16\/this-week-in-glean-what-if-i-want-to-collect-all-the-data\/","title":{"rendered":"This Week in Glean: What If I Want To Collect All The Data?"},"content":{"rendered":"<p>(\u201cThis Week in Glean\u201d is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. All \u201cThis Week in Glean\u201d blog posts are listed in the <a href=\"https:\/\/mozilla.github.io\/glean\/book\/appendix\/twig.html\">TWiG index<\/a>).<\/p>\n<p>Mozilla\u2019s approach to data is \u201cas little as necessary to get the job done\u201d as espoused in our <a href=\"https:\/\/www.mozilla.org\/firefox\/privacy\/\">Firefox Privacy Promise<\/a> and put in a shape you can import into your own organization in Mozilla\u2019s <a href=\"https:\/\/www.mozilla.org\/about\/policy\/lean-data\/\">Lean Data Practices<\/a>. If you didn\u2019t already know, you\u2019d find out very quickly by using it that Glean is a Mozilla project. All of its systems are designed with the idea that you\u2019ve carefully considered your instrumentation ahead of time, and you\u2019ve done some review to ensure that the collection aligns with your values.<\/p>\n<p>(This happens to have some serious knock-on benefits for data democratization and tooling that allows Mozilla\u2019s small Data Org to offer some seriously-powerful insights on a shoestring budget, which you can learn more about in <a href=\"https:\/\/blog.mozilla.org\/data\/2021\/07\/06\/responsible-data-collection-is-good-actually-ubisoft-data-summit-2021\/\">a talk I gave to Ubisoft at their Data Summit in 2021<\/a>.)<\/p>\n<p>Less Data, as the saying goes, implies Greater Data and Greatest Data. Or in a less memetic way, Mozilla wants to collect less data\u2026 but less than what?<\/p>\n<p>Less than more, certainly. But how much more? How much is too much?<\/p>\n<p>How much is \u201call\u201d?<\/p>\n<p>Since my brain\u2019s weird I decided to pursue this thought experiment of \u201cWhat is the _most_ data you could collect from a software project being used?\u201d.<\/p>\n<p>Well, looking at Firefox, every button press and page load and scroll and click and and and\u2026 all of that matters. As does the state of Firefox when it\u2019s being clicked and scrolled and so forth. Typing in the urlbar is different if you already have a page loaded. Opening your first tab is different from opening your <a href=\"https:\/\/mzl.la\/3HQSaCw\">nine-thousand-two-hundred-and-fiftieth<\/a>.<\/p>\n<p>And, underneath it all, is the code. How fast is it running? How much memory are we using? All these performance questions that Firefox Telemetry was originally built to answer. Is code on line 123 of file XYZ.cpp running? Is it running well? What do we run next?<\/p>\n<p>For software this means to record all of the data, we\u2019ll need to know the full state of the program at every expression it runs in every line of code. At every advancement of the Program Counter, we\u2019d need to dump the entire Stack and Heap.<\/p>\n<p>Yikes! That\u2019s gigabytes of data per clock cycle.<\/p>\n<p>Well, maybe we can be cleverer than this. Another one of those projects Mozilla incubated that now has a whole community of contributors and users (like <a href=\"https:\/\/www.rust-lang.org\/\">Rust<\/a>) is a lightweight record-and-replay debugger called <a href=\"https:\/\/rr-project.org\/\">rr<\/a>. The rr debugger collects traces of a running piece of software and can deterministically replay it over and over again (backwards, even!), meaning it has all the information we need in it.<\/p>\n<p>So a decent size estimate for \u201call the data\u201d might be the size of one of these trace recordings. They\u2019re big, but not \u201cfull heap and stack at every program counter\u201d big. A short test run of Firefox was about 2GB for a one minute run (albeit without any user interaction or graphics).<\/p>\n<p>Could Glean collect traces like these? Or bigger ones after, say, a full day\u2019s use? Not easily. Not without modification.<\/p>\n<p>Let\u2019s say we did those modifications. Let\u2019s push this thought experiment further. What does that mean for analysis? Well, we\u2019d have all these recordings we could spin up a VM to replay for us. If we want the number of open tabs, we could replay it and sample that count whenever we wanted.<\/p>\n<p>This would be a seismic shift in how instrumentation interacted with analysis. We\u2019d no longer have to ship code to instrument Firefox, we could \u201csimply\u201d (in quotes because using rr requires you to be a programming nerd) replay existing traces and extract the new data we needed.<\/p>\n<p>It would also be absolutely horrible. We\u2019d have to store every possible metric just in case we wanted _one_ of them. And there\u2019s so much data in these traces that Mozilla doesn\u2019t want to store: pictures you looked at, videos you watched, videos you uploaded\u2026 good grief. We don\u2019t want any of that.<\/p>\n<p>(( I\u2019d like to take a second to highlight that this is a thought experiment: Mozilla doesn\u2019t do this. We don\u2019t have plans to do this. In fact, Mozilla\u2019s <a href=\"https:\/\/www.mozilla.org\/privacy\/principles\/\">Data Privacy Principles<\/a> (specifically \u201cLimited Data\u201d) and <a href=\"https:\/\/www.mozilla.org\/about\/manifesto\/\">Mozilla\u2019s Manifesto<\/a> (specifically Principle 4 \u201cIndividuals\u2019 security and privacy on the internet are fundamental and must not be treated as optional.\u201d) pretty clearly state how we think about data like this. ))<\/p>\n<p>And processing these traces into a useful form for analysis to be performed would take the CPU processing power of a small country, over and over again.<\/p>\n<p>(( And rr introduces a 20% performance penalty which really wouldn\u2019t ingratiate us to our users. And it only works on Linux meaning the data we\u2019d have access to wouldn\u2019t be <a href=\"https:\/\/data.firefox.com\/\">representative of our user base anyway.<\/a> ))<\/p>\n<p>And what was the point of this again? Right. We\u2019re here to quantify what \u201cless data\u201d means. But how can we do that, even knowing as we do now what the size of \u201call data\u201d is? Can we compare the string value of the profile directory\u2019s random portion comparable to the url the user visits the most? Are those both 1 piece of data that we can compare to the N pieces of data we get in a full rr trace? Mozilla doesn\u2019t think they\u2019re the same, since we <a href=\"https:\/\/wiki.mozilla.org\/Data_Collection#Data_Collection_Categories\">categorize<\/a> (and thus treat) these collections differently.<\/p>\n<p>All in all maybe figuring out the maximum amount of data you could collect in order to contextualize how much less of it you are collecting might not be meaningful.<\/p>\n<p>Oh well.<\/p>\n<p>I guess this means that the only way Mozilla (and you!) can continue to quantify \u201cless data\u201d is by comparing it to \u201cno data\u201d \u2013 the least possible amount of data.<\/p>\n<p>:chutten<\/p>\n<p>(( This post is a syndicated copy of the <a href=\"https:\/\/chuttenblog.wordpress.com\/2022\/02\/16\/this-week-in-glean-what-if-i-want-to-collect-all-the-data\/\">original post<\/a>. ))<\/p>\n","protected":false},"excerpt":{"rendered":"<p>(\u201cThis Week in Glean\u201d is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release &hellip; <a class=\"go\" href=\"https:\/\/blog.mozilla.org\/data\/2022\/02\/16\/this-week-in-glean-what-if-i-want-to-collect-all-the-data\/\">Read more<\/a><\/p>\n","protected":false},"author":1437,"featured_media":197,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[448297],"tags":[525,448297,72422,448335],"coauthors":[],"_links":{"self":[{"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/posts\/393"}],"collection":[{"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/users\/1437"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/comments?post=393"}],"version-history":[{"count":0,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/posts\/393\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/media\/197"}],"wp:attachment":[{"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/media?parent=393"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/categories?post=393"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/tags?post=393"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/coauthors?post=393"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}