{"id":205,"date":"2020-07-16T20:23:23","date_gmt":"2020-07-16T20:23:23","guid":{"rendered":"https:\/\/blog.mozilla.org\/data\/?p=205"},"modified":"2020-07-16T20:23:23","modified_gmt":"2020-07-16T20:23:23","slug":"mozilla-telemetry-in-2020-from-just-firefox-to-a-galaxy-of-data","status":"publish","type":"post","link":"https:\/\/blog.mozilla.org\/data\/2020\/07\/16\/mozilla-telemetry-in-2020-from-just-firefox-to-a-galaxy-of-data\/","title":{"rendered":"Mozilla Telemetry in 2020: From &#8220;Just Firefox&#8221; to a &#8220;Galaxy of Data&#8221;"},"content":{"rendered":"<p><em>(\u201cThis Week in Glean\u201d is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. You can find <a href=\"https:\/\/mozilla.github.io\/glean\/book\/appendix\/twig.html\">an index of all TWiG posts online.<\/a>)<\/em><\/p>\n<p><em>This is a special guest post by non-Glean-team member William Lachance!<\/em><\/p>\n<p>In the last year or so, there\u2019s been a significant shift in the way we (Data Engineering) think about application-submitted data @ Mozilla, but although we have a new application-based SDK based on these principles (<a href=\"https:\/\/mozilla.github.io\/glean\/book\/index.html\">the Glean SDK<\/a>), most of our <a href=\"https:\/\/telemetry.mozilla.org\">data tools<\/a> and <a href=\"https:\/\/docs.telemetry.mozilla.org\">documentation<\/a> have not yet been updated to reflect this new state of affairs.<\/p>\n<p>Much of this story is known <em>inside<\/em> Mozilla Data Engineering, but I thought it might be worth jotting them down into a blog post as a point of reference for people outside the immediate team. Knowing this may provide some context for some our activities and efforts over the next year or two, at least until our tools, documentation, and tribal knowledge evolve.<\/p>\n<p>In sum, the key differences are:<\/p>\n<ul>\n<li>Instead of just one application we care about, there are many.<\/li>\n<li>Instead of just caring about (mostly<sup><a href=\"https:\/\/wlach.github.io\/blog\/2020\/07\/mozilla-telemetry-in-2020-from-just-firefox-to-a-galaxy-of-data\/#2020-07-16-mozilla-telemetry-in-2020-footnote-1-definition\" name=\"2020-07-16-mozilla-telemetry-in-2020-footnote-1-return\">1<\/a><\/sup>) one type of ping (the Firefox <em>main<\/em> ping), an individual application may submit <em>many different<\/em> types of pings in the course of their use.<\/li>\n<li>Instead of having both probes (histogram, scalar, or other data type) <em>and<\/em> bespoke parametric values in a JSON schema like the <a href=\"https:\/\/firefox-source-docs.mozilla.org\/toolkit\/components\/telemetry\/data\/environment.html\">telemetry environment<\/a>, there are now only <em>metric types<\/em> which are explicitly defined as part of each ping.<\/li>\n<\/ul>\n<p>The new world is pretty exciting and freeing, but there is some new domain complexity that we need to figure out how to navigate. I\u2019ll discuss that in my last section.<\/p>\n<h2 id=\"the-old-world-firefox-is-king\">The Old World: Firefox is king<\/h2>\n<p>Up until roughly mid\u20132019, Firefox was the centre of Mozilla\u2019s data world (with the occasional nod to Firefox for Android, which uses the same source repository). The Data Platform (often called \u201cTelemetry\u201d) was explicitly designed to cater to the needs of Firefox Developers (and to a lesser extent, product\/program managers) and a set of bespoke tooling was built on top of our data pipeline architecture &#8211; <a href=\"https:\/\/ravitillo.wordpress.com\/2017\/01\/23\/an-overview-of-mozillas-data-pipeline\/\">this blog post from 2017 describes much of it<\/a>.<\/p>\n<p>In outline, the model is simple: on the client side, assuming a given user had not turned off Telemetry, during the course of a day\u2019s operation Firefox would keep track of various measures, called \u201cprobes\u201d. At the end of that duration, it would submit a JSON-encoded \u201cmain ping\u201d to our servers with the probe information and <a href=\"https:\/\/github.com\/mozilla-services\/mozilla-pipeline-schemas\/blob\/97bac7acaaa5cb328d7f0f7348f3ddaaae657eda\/schemas\/telemetry\/main\/main.4.schema.json\">a bunch of other mostly hand-specified junk<\/a>, which would then find its way to a \u201cdata lake\u201d (read: an Amazon S3 bucket). On top of this, we provided a <a href=\"https:\/\/github.com\/mozilla\/python_moztelemetry\/\">python API<\/a> (built on top of <a href=\"https:\/\/spark.apache.org\/docs\/latest\/api\/python\/index.html\">PySpark<\/a>) which enabled people inside Mozilla to query all submitted pings across our usage population.<\/p>\n<p>The only type of low-level object that was hard to keep track of was the list of probes: Firefox is a complex piece of software and there are <em>many<\/em> aspects of it we wanted to instrument to validate performance and quality of the product &#8211; especially on the more-experimental Nightly and Beta channels. To solve this problem, a <a href=\"https:\/\/probes.telemetry.mozilla.org\/\">probe dictionary<\/a> was created to help developers find measures that corresponded to the product area that they were interested in.<\/p>\n<p>On a higher-level, accessing this type of data using the python API quickly became slow and frustrating: the aggregation of years of Firefox ping data was hundreds of terabytes big, and even taking advantage of PySpark\u2019s impressive capabilities, querying the data across any reasonably large timescale was slow and expensive. Here, the solution was to create derived datasets which enabled fast(er) access to pings and other derived measures, document them on docs.telemetry.mozilla.org, and then allow access to them through tools like <a href=\"https:\/\/docs.telemetry.mozilla.org\/tools\/stmo.html\">sql.telemetry.mozilla.org<\/a> or the <a href=\"https:\/\/telemetry.mozilla.org\/new-pipeline\/dist.html\">Measurement Dashboard<\/a>.<\/p>\n<h2 id=\"the-new-world-more-of-everything\">The New World: More of everything<\/h2>\n<p>Even in the old world, other products that submitted telemetry <em>existed<\/em> (e.g. Firefox for Android, Firefox for iOS, the venerable FirefoxOS) but I would not call them first-class citizens. Most of our documentation treated them as (at best) weird edge cases. At the time of this writing, you can see this distinction clearly on docs.telemetry.mozilla.org where there is one (fairly detailed) tutorial called \u201cChoosing a Desktop Dataset\u201d while essentially all other products are lumped into \u201cChoosing a Mobile Dataset\u201d.<\/p>\n<div class=\"figure\"><img decoding=\"async\" src=\"https:\/\/wlach.github.io\/files\/2020\/07\/docs-tmo-pic.png\" alt=\"\" \/><\/div>\n<p>While the new universe of mobile products are probably the most notable addition to our list of things we want to keep track of, they\u2019re only one piece of the puzzle. Really we\u2019re interested in measuring <em>all the things<\/em> (in accordance with our <a href=\"https:\/\/www.mozilla.org\/en-US\/about\/policy\/lean-data\/\">lean data practices<\/a>, of course) including tools we use to <em>build our products<\/em> like <a href=\"https:\/\/wiki.mozilla.org\/MozPhab\">mozphab<\/a> and <a href=\"https:\/\/mozilla.github.io\/mozregression\">mozregression<\/a>.<\/p>\n<p>In expanding our scope, we\u2019ve found that mobile (and other products) have different requirements that influence what data we would want to send and when. For example, sending one blob of JSON multiple times per day might make sense for performance metrics on a desktop product (which is usually on a fast, unmetered network) but is much less acceptable on mobile (where every byte counts). For this reason, it makes sense to have <em>different ping types<\/em> for the same product, not just one. For example, Fenix (the new Firefox for Android) sends a tiny baseline ping<sup><a href=\"https:\/\/wlach.github.io\/blog\/2020\/07\/mozilla-telemetry-in-2020-from-just-firefox-to-a-galaxy-of-data\/#2020-07-16-mozilla-telemetry-in-2020-footnote-2-definition\" name=\"2020-07-16-mozilla-telemetry-in-2020-footnote-2-return\">2<\/a><\/sup> on every run to (roughly) measure daily active users and a larger metrics ping sent on a (roughly) daily interval to measure (for example) a distribution of page load times.<\/p>\n<p>Finally, we found that naively collecting certain types of data as raw histograms or inside the schema didn\u2019t always work well. For example, encoding session lengths as plain integers <a href=\"https:\/\/bugzilla.mozilla.org\/show_bug.cgi?id=1514392\">would often produce weird results in the case of clock skew<\/a>. For this reason, we decided to <a href=\"https:\/\/mozilla.github.io\/glean\/book\/user\/metrics\/index.html\">standardize on a set of well-defined metrics<\/a> using Glean, which tries to minimize footguns. We explicitly no longer allow clients to submit arbitrary JSON or values as part of a telemetry ping: if you have a use case not covered by the existing metrics, <a href=\"https:\/\/wiki.mozilla.org\/Glean\/Adding_or_changing_Glean_metric_types\">make a case for it and add it to the list<\/a>!<\/p>\n<p>To illustrate this, let\u2019s take a (subset) of what we might be looking at in terms of what the Fenix application sends:<\/p>\n<div class=\"figure\"><img decoding=\"async\" src=\"https:\/\/wlach.github.io\/files\/2020\/07\/fenix-pings-diagram.png\" alt=\"\" \/><\/div>\n<p><a href=\"https:\/\/wlach.github.io\/files\/2020\/07\/fenix-pings-diagram.mmd\">mermaid source<\/a><\/p>\n<p>At the top level we segment based on the \u201capplication\u201d (just Fenix in this example). Just below that, there are the pings that this application might submit (I listed three: the baseline and metrics pings described above, along with a \u201cmigration\u201d ping, which tracks metrics when a user migrates from Fennec to Fenix). And below <em>that<\/em> there are different types of metrics included in the pings: I listed a few that came out of a quick scan of the Fenix BigQuery tables using my <a href=\"https:\/\/mozilla-schema-dictionary.netlify.app\/#!\/tables\/org_mozilla_fenix.metrics\">prototype schema dictionary<\/a>.<\/p>\n<p>This is actually only the surface-level: at the time of this writing, Fenix has no fewer than 12 different ping types and <em>many<\/em> different metrics inside each of them.<sup><a href=\"https:\/\/wlach.github.io\/blog\/2020\/07\/mozilla-telemetry-in-2020-from-just-firefox-to-a-galaxy-of-data\/#2020-07-16-mozilla-telemetry-in-2020-footnote-3-definition\" name=\"2020-07-16-mozilla-telemetry-in-2020-footnote-3-return\">3<\/a><\/sup> On a client level, the new Glean SDK provides easy-to-use primitives to help developers collect this type of information in a principled, privacy-preserving way: for example, <a href=\"https:\/\/github.com\/mozilla\/data-review\">data review<\/a> is built into every metric type. But what about after it hits our ingestion endpoints?<\/p>\n<p>Hand-crafting schemas, data ingestion pipelines, and individualized ETL scripts for such a large matrix of applications, ping types, and measurements would quickly become intractable. Instead, we (Mozilla Data Engineering) refactored our data pipeline to parse out the information from the Glean schemas and then create tables in our BigQuery datastore corresponding to what\u2019s in them &#8211; this has proceeded as an extension to our (now somewhat misnamed) <a href=\"https:\/\/github.com\/mozilla\/probe-scraper\">probe-scraper<\/a> tool.<\/p>\n<p>You can then query this data directly (see <a href=\"https:\/\/docs.telemetry.mozilla.org\/concepts\/glean\/accessing_glean_data.html\">accessing glean data<\/a>) or build up a derived dataset using our SQL-based ETL system, <a href=\"https:\/\/github.com\/mozilla\/bigquery-etl\/\">BigQuery-ETL<\/a>. This part of the equation has been working fairly well, I\u2019d say: we now have a diverse set of products producing Glean telemetry and submitting it to our servers, and the amount of manual effort required to add each application was minimal (aside from adding new capabilities to the platform as we went along).<\/p>\n<p>What hasn\u2019t quite kept pace is our tooling to make navigating and using this new collection of data tractable.<\/p>\n<h2 id=\"what-could-bring-this-all-together\">What could bring this all together?<\/h2>\n<p>As mentioned before, this new world is quite powerful and gives Mozilla a bunch of new capabilities but it isn\u2019t yet well documented and we lack the tools to easily connect the dots from \u201cI have a product question\u201d to \u201cI know how to write an SQL query \/ Spark Job to answer it\u201d or (better yet) \u201cthis product dashboard will answer it\u201d.<\/p>\n<p>Up until now, our defacto answer has been some combination of \u201cUse the probe dictionary \/ telemetry.mozilla.org\u201d and\/or \u201crefer to docs.telemetry.mozilla.org\u201d. I submit that we\u2019re at the point where these approaches break down: as mentioned above, there are many more types of data we now need to care about than just \u201cprobes\u201d (or \u201cmetrics\u201d, in Glean-parlance). When we just cared about the main ping, we could write dataset documentation for its recommended access point (<a href=\"https:\/\/docs.telemetry.mozilla.org\/datasets\/batch_view\/main_summary\/reference.html\">main_summary<\/a>) and the raw number of derived datasets was managable. But in this new world, where we have <em>N<\/em> applications times <em>M<\/em> ping types, the number of canonical ping tables are now so many that documenting them all on docs.telemetry.mozilla.org no longer makes sense.<\/p>\n<p>A few months ago, I thought that <a href=\"https:\/\/cloud.google.com\/data-catalog\">Google\u2019s Data Catalog<\/a> (billed as offering \u201ca unified view of all your datasets\u201d) might provide a solution, but on further examination it only solves part of the problem: it provides only a view on your BigQuery tables and it isn\u2019t designed to provide detailed information on the domain objects we care about (products, pings, measures, and tools). You can map some of the properties from these objects onto the tables (e.g. adding a probe\u2019s description field to the column representing it in the BigQuery table), but Data Calalog\u2019s interface to surfacing and filtering through this information is rather slow and clumsy and requires detailed knowledge of how these higher level concepts relate to BigQuery primitives.<\/p>\n<p>Instead, what I think we need is a <em>new system<\/em> which allows a data practitioner (Data Scientist, Firefox Engineer, Data Engineer, Product Manager, whoever) to visualize the relevant set of domain objects relevant to their product\/feature of interest <em>quickly<\/em> then map them to specific BigQuery tables and other resources (e.g. visualizations using tools like <a href=\"https:\/\/github.com\/mozilla\/glam\">GLAM<\/a>) which allow people to quickly answer questions so we can make better products. Basically, I am thinking of some combination of:<\/p>\n<ul>\n<li>The existing probe dictionary (derived from existing product metadata)<\/li>\n<li>A new \u201capplication\u201d dictionary (derived from some simple to-be-defined application metadata description)<\/li>\n<li>A new \u201cping\u201d dictionary (derived from existing product metadata)<\/li>\n<li>A BigQuery schema dictionary (I wrote up a <a href=\"https:\/\/mozilla-schema-dictionary.netlify.app\/\">prototype of this a couple weeks ago<\/a>) to map between these higher-level objects and what\u2019s in our low-level data store<\/li>\n<li>Documentation for derived datasets generated by BigQuery-ETL (ideally stored alongside the ETL code itself, so it\u2019s easy to keep up to date)<\/li>\n<li>A data tool dictionary describing how to easily <em>access<\/em> the above data in various ways (e.g. SQL query, dashboard plot, etc.)<\/li>\n<\/ul>\n<p>This might sound ambitious, but it\u2019s basically just a system for collecting and visualizing various types of documentation\u2014 something we have proven we know how to do. And I think a product like this could be incredibly empowering, not only for the internal audience at Mozilla but also the <em>external<\/em> audience who wants to support us but has valid concerns about what we\u2019re collecting and why: since this system is based entirely on systems which are already open (inside GitHub or Mercurial repositories), there is no reason we can\u2019t make it available to the public.<\/p>\n<div class=\"footnotes\">\n<ol>\n<li id=\"2020-07-16-mozilla-telemetry-in-2020-footnote-1-definition\" class=\"footnote-definition\">Technically, <a href=\"https:\/\/docs.telemetry.mozilla.org\/datasets\/pings.html\">there are various other types of pings<\/a> submitted by Firefox, but the main ping is the one 99% of people care about.\u00a0<a href=\"https:\/\/wlach.github.io\/blog\/2020\/07\/mozilla-telemetry-in-2020-from-just-firefox-to-a-galaxy-of-data\/#2020-07-16-mozilla-telemetry-in-2020-footnote-1-return\">\u21a9<\/a><\/li>\n<li id=\"2020-07-16-mozilla-telemetry-in-2020-footnote-2-definition\" class=\"footnote-definition\">This is actually a capability that the Glean SDK provides, so other products (e.g. Lockwise, Firefox for iOS) also benefit from this capability.\u00a0<a href=\"https:\/\/wlach.github.io\/blog\/2020\/07\/mozilla-telemetry-in-2020-from-just-firefox-to-a-galaxy-of-data\/#2020-07-16-mozilla-telemetry-in-2020-footnote-2-return\">\u21a9<\/a><\/li>\n<li id=\"2020-07-16-mozilla-telemetry-in-2020-footnote-3-definition\" class=\"footnote-definition\">The scope of this data collection comes from the fact that Fenix is a <em>very<\/em> large and complex application. rather than a desire to collect everything just because we can\u2014 smaller efforts like mozregression collect a <a href=\"https:\/\/mozilla.github.io\/mozregression\/documentation\/telemetry.html\">much more limited set of data<\/a>.\u00a0<a href=\"https:\/\/wlach.github.io\/blog\/2020\/07\/mozilla-telemetry-in-2020-from-just-firefox-to-a-galaxy-of-data\/#2020-07-16-mozilla-telemetry-in-2020-footnote-3-return\">\u21a9<\/a><\/li>\n<\/ol>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>(\u201cThis Week in Glean\u201d is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release &hellip; <a class=\"go\" href=\"https:\/\/blog.mozilla.org\/data\/2020\/07\/16\/mozilla-telemetry-in-2020-from-just-firefox-to-a-galaxy-of-data\/\">Read more<\/a><\/p>\n","protected":false},"author":1528,"featured_media":197,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[315988,448297],"tags":[525,448297,3969],"coauthors":[],"_links":{"self":[{"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/posts\/205"}],"collection":[{"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/users\/1528"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/comments?post=205"}],"version-history":[{"count":0,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/posts\/205\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/media\/197"}],"wp:attachment":[{"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/media?parent=205"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/categories?post=205"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/tags?post=205"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/coauthors?post=205"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}