{"id":325,"date":"2021-08-23T17:45:59","date_gmt":"2021-08-23T17:45:59","guid":{"rendered":"https:\/\/blog.mozilla.org\/data\/?p=325"},"modified":"2021-08-23T17:45:59","modified_gmt":"2021-08-23T17:45:59","slug":"this-week-in-glean-why-choosing-the-right-data-type-for-your-metric-matters","status":"publish","type":"post","link":"https:\/\/blog.mozilla.org\/data\/2021\/08\/23\/this-week-in-glean-why-choosing-the-right-data-type-for-your-metric-matters\/","title":{"rendered":"This Week in Glean: Why choosing the right data type for your metric matters"},"content":{"rendered":"<p>(\u201cThis Week in Glean\u201d is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. You can find an <a href=\"https:\/\/mozilla.github.io\/glean\/book\/appendix\/twig.html\">index of all TWiG posts online<\/a>.)<\/p>\n<p>One of my favorite tasks that comes up in my day to day adventure at Mozilla is a chance to work with the data collected by this amazing <a href=\"https:\/\/github.com\/mozilla\/glean\/\">Glean<\/a> thing my team has developed. This chance often arises when an engineer needs to verify something, or a product manager needs a quick question answered. I am not a data scientist (and I always include that caveat when I provide a peek into the data), but I do understand how the data is collected, ingested, and organized and I can often guide people to the correct tools and techniques to find what they are looking for.<\/p>\n<p>In this regard, I often encounter challenges in trying to read or analyze data that is related to another common task I find myself doing: advising engineering teams on how we intend Glean to be used and what metric types would best suit their needs. A recent example of this was a quick Q&amp;A for a group of mobile engineers who all had similar questions. My teammate <a href=\"https:\/\/blog.mozilla.org\/data\/author\/chuttenmozilla-com\/\">chutten<\/a> and I were asked to explain the differences between <a href=\"https:\/\/mozilla.github.io\/glean\/book\/reference\/metrics\/counter.html\">Counter Metrics<\/a> and <a href=\"https:\/\/mozilla.github.io\/glean\/book\/reference\/metrics\/event.html\">Event Metrics<\/a>, and try and help them understand the situations where each of them were the most appropriate to use. It was a great session and I felt like the group came away with some deeper understanding of <a href=\"https:\/\/docs.telemetry.mozilla.org\/concepts\/glean\/glean.html?highlight=glean#the-glean-design-principles\">the Glean principles<\/a>. But, after thinking about it afterwards, I realized that we do a lot of hand-wavy things when explaining why not to do things. Even in our documentation, we aren\u2019t very specific about the overhead of things like Event Metrics. For example, from the Glean documentation section regarding \u201c<a href=\"https:\/\/mozilla.github.io\/glean\/book\/user\/metrics\/adding-new-metrics.html#choosing-a-metric-type\">Choosing a Metric Type<\/a>\u201d in a warning about events:<\/p>\n<p>\u201cImportant: events are the most expensive metric type to record, transmit, store and analyze, so they should be used sparingly, and only when none of the other metric types are sufficient for answering your question.\u201d<\/p>\n<p>This is sufficiently scary to make me think twice about using events! But what exactly do we mean by \u201cthey are the most expensive\u201d? What about recording, transmitting, storing, and analyzing makes them \u201cexpensive\u201d? Well, that\u2019s what I hope to dive into a little deeper with some real numbers and examples, rather than using scary hand-wavy words like \u201cexpensive\u201d and \u201cshould be used sparingly\u201d. I\u2019ll mostly be focusing on events here, since they contain the \u201cscariest\u201d warning. So, without further ado, let\u2019s take a look at some real comparisons between metric types, and what challenges someone looking at that data may encounter when trying to answer questions about it or with it.<\/p>\n<p>Our claim is that events are expensive to record, store and transmit; so let\u2019s start by examining that a little closer. The primary API surface for the <a href=\"https:\/\/mozilla.github.io\/glean\/book\/reference\/metrics\/event.html\">Event Metric Type<\/a> in Glean is the <a href=\"https:\/\/mozilla.github.io\/glean\/book\/reference\/metrics\/event.html#recordobject\"><code>record()<\/code><\/a> function. This function also takes an optional collection of \u201cextra\u201d information in a key-value shape, which is supposed to be used to record additional state that is important to the event. The \u201cextras\u201d, along with the category, name, and (relative) timestamp, makes up the data that gets recorded, stored, and eventually transmitted to the ingestion pipeline for storage in the data warehouse.<\/p>\n<p>Since Glean is built with <a href=\"https:\/\/www.rust-lang.org\/\">Rust<\/a> and then provides SDKs in various target languages, one of the first things we have to do is serialize the data from the shiny target language object that Glean generates into something we can pass into the Rust that is at the heart of Glean. It is worth noting that the Glean JavaScript SDK does this a little differently, but the same ideas should apply about events. A similar structure is used to store the data and then transmit it to the telemetry endpoint when the <a href=\"https:\/\/mozilla.github.io\/glean\/book\/user\/pings\/events.html\">Events Ping<\/a> is assembled. A real-world example of what this serialized event, coming from <a href=\"https:\/\/github.com\/mozilla-mobile\/fenix\/blob\/693fbef88d676e600d92f2ca592889af4aa2a96b\/app\/metrics.yaml#L57\">Fenix\u2019s \u201cEntered URL\u201d event<\/a> would look like this JSON:<\/p>\n<p><code>{<br \/>\n\"category\": \"events\",<br \/>\n\"extra\": {<br \/>\n\"autocomplete\": \"false\"<br \/>\n},<br \/>\n\"name\": \"entered_url\",<br \/>\n\"timestamp\": 33191<br \/>\n}<\/code><\/p>\n<p>A similar amount of data would be generated every time the metric was recorded, stored and transmitted. So, if the user entered in 10 URLs, then we would record this same thing 10 times, each with a different relative timestamp. To take a quick look at how this affects using this data for analysis: if I only needed to know how many users interacted with this feature and how often, I would have to count each event with this category and name for every user. To complicate the analysis a bit further, Glean doesn\u2019t transmit events one at a time, it collects all events during a \u201csession\u201d (or if it hits 500 events recorded) and <a href=\"https:\/\/mozilla.github.io\/glean\/book\/user\/pings\/events.html#contents\">transmits them as an array within an Event Ping<\/a>. This Event Ping then becomes a single row in the data, and nested in a column we find the array of events. In order to even count the events, I would need to \u201cunnest\u201d them and flatten out the data. This involves cross joining each event in the array back to the parent ping record in order to even get at the category, name, timestamp and extras. We end up with some SQL that looks like this (WARNING: this is just an example. Don\u2019t try this, it could be expensive and shouldn\u2019t work because I left out the filter on the submission date):<\/p>\n<p><code>SELECT *<br \/>\nFROM fenix<br \/>\nCROSS JOIN UNNEST (events) AS event<\/code><\/p>\n<p>For an average day in <a href=\"https:\/\/github.com\/mozilla-mobile\/fenix\">Fenix<\/a> we see 75-80 million Event Pings from clients on our release version, with an average of a little over 8 events per ping. That adds up to over 600 million events per day, and just for Fenix! So when we do this little bit of SQL flattening of the data structure, we end up manipulating over a half a billion records for a single day, and that adds up really quickly if you start looking at more than one day at a time. This can take a lot of computer horsepower, both in processing the query and in trying to display the results in some visual representation. Now that I have the events flattened out, I can finally filter for the category and name of the event I am looking for and count how many of that specific event is present. Using the <a href=\"https:\/\/github.com\/mozilla-mobile\/fenix\/blob\/693fbef88d676e600d92f2ca592889af4aa2a96b\/app\/metrics.yaml#L57\">Fenix event \u201centered_url\u201d<\/a> from above, I end up with something like this to count the number of clients and events:<\/p>\n<p><code>SELECT<br \/>\nCOUNT(DISTINCT client_info.client_id) AS client_count,<br \/>\nCOUNT(*) AS event_count,<br \/>\nDATE(submission_timestamp) AS event_date<br \/>\nFROM<br \/>\nfenix.events<br \/>\nCROSS JOIN<br \/>\nUNNEST(events.events) AS event  -- Yup, event.events, naming=hard<br \/>\nWHERE<br \/>\nsubmission_timestamp &gt;= \u20182021-08-12\u2019<br \/>\nAND event.category = \u2018events\u2019<br \/>\nAND event.name = \u2018entered_url\u2019<br \/>\nGROUP BY<br \/>\nevent_date<br \/>\nORDER BY<br \/>\nevent_date<\/code><\/p>\n<p>Our query engine is pretty good, this only takes about 8 seconds to process and it has narrowed down the data it needs to scan to a paltry 150 GB, but this is a very simple analysis of the data involved. I didn\u2019t even dig into the \u201cextra\u201d information, which would require yet another level of flattening through UNNESTing the \u201cextras\u201d array that they are stored in in each individual event.<\/p>\n<p>As you can see, this explodes pretty quickly into some big datasets for just counting things. Don\u2019t get me wrong, this is all very useful if you need to know the sequence of events that led the client to entering a URL, that\u2019s what events are for after all. To be fair, our lovely Data Engineering folks have taken the time and trouble to create views where these events are already unnested, and so I could have avoided doing it manually and instead use the automatically flattened dataset. I wanted to better illustrate the additional complexity that goes on downstream from events and working with the \u201craw\u201d data seemed the best way to do this.<\/p>\n<p>If we really just need to know how many clients interact with a feature and how often, then a much lighter weight alternative recommended by the Glean team would be a Counter Metric. To return to what the data representation of this looks like, we can look at an <a href=\"https:\/\/github.com\/mozilla\/glean\/blob\/795a5d1827de0e8918806bfc377b0a410226340b\/glean-core\/metrics.yaml#L604\">internal Glean metric that counts the number of times Fenix enters the foreground<\/a> per day (since the metrics ping is sent once per day). It looks like this:<\/p>\n<p><code>\"counter\": {<br \/>\n\"glean.validation.foreground_count\": 1<br \/>\n}<\/code><\/p>\n<p>No matter how many times we <a href=\"https:\/\/mozilla.github.io\/glean\/book\/reference\/metrics\/counter.html#add\"><code>add()<\/code><\/a> to this metric, it will always take up that same amount of space right there, only the value would change. So, we don\u2019t end up with one record per event, but a single value that represents the count of the interactions. When I go to query this and find out how many clients this involved and how many times the app moved to the foreground of the device, I can do something like this in SQL (without all the UNNESTing):<\/p>\n<p><code>SELECT<br \/>\nCOUNT(DISTINCT client_info.client_id) AS client_count,<br \/>\nSUM(m.metrics.counter.glean_validation_foreground_count) AS foreground_count,<br \/>\nDATE(submission_timestamp) AS event_date<br \/>\nFROM<br \/>\norg_mozilla_firefox.metrics AS m<br \/>\nWHERE<br \/>\nsubmission_timestamp &gt;= '2021-08-12'<br \/>\nGROUP BY<br \/>\nevent_date<br \/>\nORDER BY<br \/>\nevent_date<\/code><\/p>\n<p>This runs in just under 7 seconds, but the query only has to scan about 5 GB of data instead of the 150 GB we saw with the event. And, for comparison, there were only about 8 million of those <code>entered_url<\/code> events per day compared to 80 million foreground occurrences per day. Even with many more incidents, the amount of data scanned by the query that used the Counter Metric Type to count things scanned 1\/30th the amount of data. It is also fairly obvious which query is easier to understand. The foreground count is just a numeric counter value stored in a single row in the database along with all of the other metrics that are collected and sent on the daily metrics ping, and it ultimately results in selecting a single column value. Rather than having to unnest arrays and then counting them, I can simply SUM the values stored in the column for the counter to get my result.<\/p>\n<p>Events do serve a beautiful purpose, like building an onboarding funnel to determine how well we retain users and what onboarding path results in that. We can\u2019t do that with counters because they don\u2019t have the richness to be able to show the flow of interactions through the app. Counters also serve a purpose, and can answer questions about the usage of a feature with very little overhead. I just hope that as you read this, you will consider what questions you need to answer and remember that there is probably a well-suited Glean Metric Type just for your purpose, and if there isn\u2019t, you can always request a new metric type! The <a href=\"https:\/\/mozilla.github.io\/glean\/book\/index.html#contact\">Glean Team<\/a> wants you to get the most out of your data while being true to our lean data practices, and we are always available to discuss which metric type is right for your situation if you have any questions.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>(\u201cThis Week in Glean\u201d is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release &hellip; <a class=\"go\" href=\"https:\/\/blog.mozilla.org\/data\/2021\/08\/23\/this-week-in-glean-why-choosing-the-right-data-type-for-your-metric-matters\/\">Read more<\/a><\/p>\n","protected":false},"author":1757,"featured_media":197,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[315988,323282,448297],"tags":[],"coauthors":[],"_links":{"self":[{"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/posts\/325"}],"collection":[{"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/users\/1757"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/comments?post=325"}],"version-history":[{"count":0,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/posts\/325\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/media\/197"}],"wp:attachment":[{"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/media?parent=325"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/categories?post=325"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/tags?post=325"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/coauthors?post=325"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}