{"id":426,"date":"2022-10-27T14:02:16","date_gmt":"2022-10-27T14:02:16","guid":{"rendered":"https:\/\/blog.mozilla.org\/data\/?p=426"},"modified":"2022-10-27T14:02:16","modified_gmt":"2022-10-27T14:02:16","slug":"this-week-in-glean-page-load-data-three-ways-or-how-expensive-are-events","status":"publish","type":"post","link":"https:\/\/blog.mozilla.org\/data\/2022\/10\/27\/this-week-in-glean-page-load-data-three-ways-or-how-expensive-are-events\/","title":{"rendered":"This Week in Glean: Page Load Data, Three Ways (Or, How Expensive Are\u00a0Events?)"},"content":{"rendered":"<p>(\u201cThis Week in Glean\u201d is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. All \u201cThis Week in Glean\u201d blog posts are listed in the <a href=\"https:\/\/mozilla.github.io\/glean\/book\/appendix\/twig.html\">TWiG index<\/a>).<\/p>\n<p>At Mozilla we make, <a href=\"https:\/\/getpocket.com\/\">among<\/a> <a href=\"https:\/\/www.mozilla.org\/vpn\">other<\/a> <a href=\"https:\/\/relay.firefox.com\/\">things<\/a>, Web Browsers which we tend to call <a href=\"https:\/\/getfirefox.com\/\">Firefox<\/a>. The central activity in a Web Browser like Firefox is loading a web page. It gets done a lot by each and every one of our users, and so you can imagine that data about pageloads is of important business interest to us.<\/p>\n<p>But exactly because this is done a lot and by every one of our users, this inspires concerns of scale and cost. How much does it cost us to learn more about pageloads?[0]<\/p>\n<p>As with all things in Data, the answer is the same: \u201cWell, it depends.\u201d<\/p>\n<p>In this case it depends on how you record the data. How you record the data depends on what questions you hope to answer with it. We\u2019re going to stick to the simplest of questions to make this (highly-suspect) comparison even remotely comparable.<\/p>\n<h2>Option 1: Just the Counts, Ma\u2019am<\/h2>\n<p>I say page loads are done a lot, but how much is \u201ca lot\u201d? If that\u2019s our only question, maybe the data we need is simply a count of pageloads. Glean already has <a href=\"https:\/\/mozilla.github.io\/glean\/book\/reference\/metrics\/counter.html\">a metric type for counting things<\/a>, so it should be fairly <a href=\"https:\/\/dictionary.telemetry.mozilla.org\/apps\/firefox_desktop\/metrics\/browser_engagement_uri_count\">quick to implement<\/a>.<\/p>\n<p>This should be cheap, right? Just a single number? Well, it depends.<\/p>\n<h3>Scale 1: Frequency<\/h3>\n<p>The count of pageloads is just a single number. One, maybe as many as eight, bytes to record, store, transmit, retain, and analyze. But Firefox has to report it more than once, so we need to first scale our cost of \u201cone, maybe as many as eight, bytes\u201d by the number of times we send this information.<\/p>\n<p>When we first implemented Firefox\u2019s pageload count in Glean, I wanted to send it on the builtin \u201cmetrics\u201d ping which is sent once a day from anyone running Firefox that day[1]. In an effort to gain more complete and timely data, we ended up adding it to the builtin \u201cbaseline\u201d ping which is sent (on average for Firefox Desktop) 8 or more times per day.<\/p>\n<p>For our frequency scale we thus use 8\/day.<\/p>\n<h3>Scale 2: Population<\/h3>\n<p>These 8 recordings per day are sent by about <a href=\"https:\/\/data.firefox.com\/dashboard\/user-activity\">200M users over a month<\/a>. Days and months aren\u2019t easy to scale between as not all users use Firefox every day, and our population gains new users and loses old users at variable rates\u2026 so I recalculated the Frequency scale to be in terms of months and found that we get 68 pings per month from these roughly 200M users.<\/p>\n<p>So the cost is pretty easy to calculate then? Whatever the cost is of storing and transmitting 200M x 68\/month x eight bytes ~= 109 GB?<\/p>\n<p>Not entirely. But until and unless those other costs are not comparable between options, we can just treat them as noise. This cost, rendered in the size of the data, of about 109GB? It\u2019ll do.<\/p>\n<h2>Option 2: What an Event<\/h2>\n<p>Page loads are interesting not just in how many of them there are, but also about what type of load they are and how long the load took. The order of a page load in between other events might also be of interest: did it happen before or after <a href=\"https:\/\/blog.mozilla.org\/data\/2022\/03\/09\/mozilla-opens-access-to-dataset-on-network-outages\/\">some network trouble<\/a>? Did a bunch of pageloads happen all at once, or spread across the day? We might wish to <a href=\"https:\/\/dictionary.telemetry.mozilla.org\/apps\/firefox_desktop\/metrics\/perf_page_load\">instrument page loads as Glean events<\/a>.<\/p>\n<p>Events are each more expensive than a count. They carry a timestamp (eight bytes) and repeat their names each time they\u2019re recorded (some strings, say fifteen bytes).<\/p>\n<p>(We are not counting the load type or how long the load took in our calculations of the size of an individual sample as we\u2019re still trying to compare methods of answering the same \u201cHow many page loads are there?\u201d question.)<\/p>\n<h3>Scale 3: Page Loads<\/h3>\n<p>\u201cEach time they\u2019re recorded\u201d, huh. Guess that means we get to multiply by the number of page loads. Each Firefox Desktop user, over the course of a month, loads on average 1190 pages[2]. This means instead of sending 68 numbers a month, we\u2019re sending 1190 batches of strings a month.<\/p>\n<p>So the comparable cost is whatever the cost is of storing and transmitting 200M x (eight bytes and fifteen bytes) x 1190 ~= 5.47TB..<\/p>\n<p>We\u2019ve jumped an order of magnitude here. And we\u2019re not done.<\/p>\n<h2>Option 3: Custom Pings, and Custom Pings Only<\/h2>\n<p>What if the context we wish to record alongside the event of a page load cannot fit inside Glean\u2019s <a href=\"https:\/\/mozilla.github.io\/glean\/book\/reference\/metrics\/event.html#limits\">prudent \u201cevent\u201d metric type limits<\/a>? What if the collected pageload data would benefit from a retention limit or access control list different from other counts or events? What if you want to submit this data to be uploaded as soon as it has been recorded? In that case, we could send a pageload as a <a href=\"https:\/\/mozilla.github.io\/glean\/book\/user\/pings\/custom.html\">Glean custom ping<\/a>.<\/p>\n<p>We\u2019ve not (yet) done this in Firefox Desktop (at least partially because it complicates ordering amongst other events: the Glean SDK expends a lot of effort to ensure the timestamps between events are reliable. Ping times are client times which are subject to the whims of the user.), so I\u2019m going to get even <a href=\"https:\/\/en.wikipedia.org\/wiki\/Hand-waving\">hand-wavier<\/a> than before as I try to determine how large each individual data sample will be.<\/p>\n<p>A Glean custom ping without any metrics in it comes to around 500 bytes. When our data platform ingests the ping and turns it into a row in a dataset, we add <a href=\"https:\/\/docs.telemetry.mozilla.org\/datasets\/pings.html#ping-metadata\">some metadata<\/a> which adds another 300 bytes or so (which only affects storage inside the Data Platform and doesn\u2019t add costs to client storage or client bandwidth).<\/p>\n<p>We could go deeper and cost out the network headers, the costs of using TLS to ensure the integrity of the connection\u2026 but we\u2019d be here all day. So I\u2019m gonna call that 200 bytes to make it a nice round 1000 bytes per ping.<\/p>\n<p>We\u2019re sending these pings per pageload, so the cost is whatever the cost is of storing and transmitting 200M x 1190 x 1000 bytes = 238TB.<\/p>\n<h2>Rule of Thumb: 50x<\/h2>\n<p>There you have it: for each step up the cost ladder you\u2019re adding an extra 50x multiplier to the cost of storing and transmitting the data. The reality\u2019s actually much worse if it\u2019s harder to analyze and reason about the data as it gets more complex (which it in most cases is) because, as you might remember from <a href=\"https:\/\/blog.mozilla.org\/data\/2020\/04\/15\/this-week-in-glean-how-much-does-that-data-cost\/\">one of my previous explorations in costing out metrics<\/a>: it\u2019s the human costs of things (like analysis) that really getcha.<\/p>\n<p>But you have to balance it out. If adding more context and information ensures your analyses only have to look in one place for its data instead of trying to tie together loosely-coupled concepts from multiple locations\u2026 if using a custom ping ensures you have everything you need and don\u2019t have to form a committee to resource an engineer to add implementation which needs to be deployed and individually validated\u2026 if you\u2019re willing to bet 50x or 250x the cost on getting it right the first time, then that could be a good price to pay.<\/p>\n<p>But is this the case for you and your data?<\/p>\n<p>Well, it depends.<\/p>\n<p>:chutten<\/p>\n<p>[0]: Avid readers of this blog may notice that this isn\u2019t <a href=\"https:\/\/blog.mozilla.org\/data\/2020\/04\/15\/this-week-in-glean-how-much-does-that-data-cost\/\">the first time I\u2019ve written on the costs of data<\/a>. And it likely won\u2019t be the last!<\/p>\n<p>[1]: How often a \u201cmetrics\u201d ping is sent is a <a href=\"https:\/\/mozilla.github.io\/glean\/book\/user\/pings\/metrics.html#scheduling\">little more complicated<\/a> than \u201conce a day\u201d, but it averages out to about that much so I\u2019m sticking with it <a href=\"https:\/\/en.wikipedia.org\/wiki\/Back-of-the-envelope_calculation\">for this napkin<\/a>.<\/p>\n<p>[2]: Yes there are some wild and wacky outliers included in the figure \u201can average of 1190 page loads\u201d that I\u2019m not bothering to clean up. You can <a href=\"https:\/\/knowyourmeme.com\/memes\/spiders-georg\">Page Loads Georg<\/a> to your hearts\u2019 content.<\/p>\n<p>[3]: This is about how many characters the JSON-encoded ping payload comes to, uncompressed.<\/p>\n<p>(This post is a syndicated copy of <a href=\"https:\/\/chuttenblog.wordpress.com\/2022\/10\/27\/this-week-in-glean-page-load-data-three-ways-or-how-expensive-are-events\/\">the original<\/a>.)<\/p>\n","protected":false},"excerpt":{"rendered":"<p>(\u201cThis Week in Glean\u201d is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release &hellip; <a class=\"go\" href=\"https:\/\/blog.mozilla.org\/data\/2022\/10\/27\/this-week-in-glean-page-load-data-three-ways-or-how-expensive-are-events\/\">Read more<\/a><\/p>\n","protected":false},"author":1437,"featured_media":422,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[],"tags":[525,30,448297,448330],"coauthors":[311808],"_links":{"self":[{"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/posts\/426"}],"collection":[{"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/users\/1437"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/comments?post=426"}],"version-history":[{"count":0,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/posts\/426\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/media\/422"}],"wp:attachment":[{"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/media?parent=426"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/categories?post=426"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/tags?post=426"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/coauthors?post=426"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}