{"id":274,"date":"2021-02-24T14:44:45","date_gmt":"2021-02-24T14:44:45","guid":{"rendered":"https:\/\/blog.mozilla.org\/data\/?p=274"},"modified":"2021-04-16T13:26:58","modified_gmt":"2021-04-16T13:26:58","slug":"this-week-in-glean-boring-monitoring","status":"publish","type":"post","link":"https:\/\/blog.mozilla.org\/data\/2021\/02\/24\/this-week-in-glean-boring-monitoring\/","title":{"rendered":"This Week in Glean: Boring Monitoring"},"content":{"rendered":"<p>(\u201cThis Week in Glean\u201d is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean.)<\/p>\n<p>All \u201cThis Week in Glean\u201d blog posts are listed in the <a href=\"https:\/\/mozilla.github.io\/glean\/book\/appendix\/twig.html\">TWiG index<\/a> (and on the <a href=\"https:\/\/blog.mozilla.org\/data\/category\/glean\/\">Mozilla Data blog<\/a>).<\/p>\n<hr \/>\n<p>Every Monday the Glean has its weekly Glean SDK meeting. This meeting is used for 2 main parts: First discussing the features and bugs the team is currently investigating or that were requested by outside stakeholders. And second bug triage &amp; monitoring of data that Glean reports in the wild.<\/p>\n<p>Most of the time looking at our monitoring is boring and that\u2019s a good thing.<\/p>\n<p>From the beginning the Glean SDK supported extensive <a href=\"https:\/\/mozilla.github.io\/glean\/book\/user\/error-reporting.html\">error reporting<\/a> on data collected by the framework inside end-user applications. Errors are produced when the application tries to record invalid values. That could be a negative value for a counter that should only ever go up or stopping a timer that was never started. Sometimes this comes down to a simple bug in the code logic and should be fixed in the implementation. But often this is due to unexpected and surprising behavior of the application the developers definitely didn\u2019t think about. Do you know all the ways that your Android application can be started? There\u2019s a whole lot of events that can launch it, even in the background, and you might miss instrumenting all the right parts sometimes. Of course this should then also be fixed in the implementation.<\/p>\n<h2 id=\"monitoring-firefox-for-android\">Monitoring Firefox for Android<\/h2>\n<p>For our weekly monitoring we look at one application in particular: <a href=\"https:\/\/github.com\/mozilla-mobile\/fenix\/\">Firefox for Android<\/a>. Because errors are reported in the same way as other metrics we are able to query our database, aggregate the data by specific metrics and errors, generate graphs from it and create dashboards on our instance of <a href=\"https:\/\/redash.io\/\">Redash<\/a>.<\/p>\n<figure><img decoding=\"async\" src=\"https:\/\/tmp.fnordig.de\/blog\/2021\/fenix-error-monitoring.png\" alt=\"Graph of the error counts for different metrics in Firefox for Android\" \/><figcaption aria-hidden=\"true\">Graph of the error counts for different metrics in Firefox for Android<\/figcaption><\/figure>\n<p>The above graph displays error counts for different metrics. Each line is a specific metric and error (such as <code>Invalid Value<\/code> or <code>Invalid State<\/code>). The exact numbers are not important. What we\u2019re interested in is the general trend. Are the errors per metrics stable or are there sudden jumps? Upward jumps indicate a problem, downward jumps probably means the underlying bug got fixed and is finally rolled out in an update to users.<\/p>\n<figure><img decoding=\"async\" src=\"https:\/\/tmp.fnordig.de\/blog\/2021\/fenix-error-monitoring2.png\" alt=\"Rate of affected clients in Firefox for Android\" \/><figcaption aria-hidden=\"true\">Rate of affected clients in Firefox for Android<\/figcaption><\/figure>\n<p>We have another graph that doesn\u2019t take the raw number of errors, but averages it across the entire population. A sharp increase in error counts sometimes comes from a small number of clients, whereas the errors for others stay at the same low-level. That\u2019s still a concern for us, but knowing that a potential bug is limited to a small number of clients may help with finding and fixing it. And sometimes it\u2019s really just bogus client data we get and can dismiss fully.<\/p>\n<p>Most of the time these graphs stay rather flat and boring and we can quickly continue with other work. Sometimes though we can catch potential issues in the first days after a rollout.<\/p>\n<figure><img decoding=\"async\" src=\"https:\/\/tmp.fnordig.de\/blog\/2021\/fenix-error-monitoring3.png\" alt=\"Sudden jump upwards in errors for 2 metrics in Firefox for Android Nightly\" \/><figcaption aria-hidden=\"true\">Sudden jump upwards in errors for 2 metrics in Firefox for Android Nightly<\/figcaption><\/figure>\n<p>In this graph from the nightly release of Firefox for Android two metrics started reporting a number of errors that\u2019s far above any other error we see. We can then quickly find the implementation of these metrics and report that to the responsible team (<a href=\"https:\/\/github.com\/mozilla-mobile\/fenix\/issues\/18114\">Filed bug<\/a>, and the <a href=\"https:\/\/github.com\/mozilla-mobile\/fenix\/pull\/18115\">remediation PR<\/a>).<\/p>\n<h2 id=\"but-cant-that-be-automated\">But can\u2019t that be automated?<\/h2>\n<p>It probably can! But it requires more work than throwing together a dashboard with graphs. It\u2019s also not as easy to define thresholds on these changes and when to report them. There\u2019s work underway that hopefully enables us to more quickly build up these dashboards for any product using the Glean SDK, which we can then also extend to do more reporting automated. The final goal should be that the product teams themselves are responsible for monitoring their data.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>(\u201cThis Week in Glean\u201d is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release &hellip; <a class=\"go\" href=\"https:\/\/blog.mozilla.org\/data\/2021\/02\/24\/this-week-in-glean-boring-monitoring\/\">Read more<\/a><\/p>\n","protected":false},"author":1756,"featured_media":197,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[448297],"tags":[],"coauthors":[],"_links":{"self":[{"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/posts\/274"}],"collection":[{"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/users\/1756"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/comments?post=274"}],"version-history":[{"count":0,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/posts\/274\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/media\/197"}],"wp:attachment":[{"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/media?parent=274"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/categories?post=274"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/tags?post=274"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/blog.mozilla.org\/data\/wp-json\/wp\/v2\/coauthors?post=274"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}