{"id":5286,"date":"2011-12-13T12:25:49","date_gmt":"2011-12-13T19:25:49","guid":{"rendered":"http:\/\/blog.mozilla.org\/metrics\/?p=5286"},"modified":"2019-09-18T12:05:34","modified_gmt":"2019-09-18T19:05:34","slug":"comparing-the-bias-in-telemetry-data-vs-the-typical-firefox-user","status":"publish","type":"post","link":"https:\/\/blog.mozilla.org\/metrics\/2011\/12\/13\/comparing-the-bias-in-telemetry-data-vs-the-typical-firefox-user\/","title":{"rendered":"Comparing the Bias in Telemetry Data vs The Typical Firefox User"},"content":{"rendered":"<p style=\"text-align: justify;\"><a title=\"Telemetry\" href=\"https:\/\/addons.mozilla.org\/en-US\/firefox\/addon\/abouttelemetry\/\">Telemetry<\/a> \u00a0is\u00a0a feature in Firefox that captures performance metrics such as start up time,\u00a0DNS latency among others. The number of metrics captured is in the order of a\u00a0couple hundred. The data is sent back to the Mozilla <a title=\"Bagheera\" href=\"https:\/\/github.com\/mozilla-metrics\/bagheera\">Bagheera<\/a> servers \u00a0which is then analyzed by the\u00a0engineers.<\/p>\n<p>The Telemetry feature asks the Nightly\/Aurora (pre-release) users \u00a0if they would like to submit their anonymized performance data . This resulted in \u00a0a response rate (number of people\u00a0who opted in divided by the number of people who were asked) of less than\u00a03%. This led to two concerns: small number of responses (which changed when\u00a0Telemetry became part of \u00a0Firefox release) and more importantly representativeness: <em>are\u00a0the performance measurements as collected from the 3% representative of those of\u00a0people who chose not to \u00a0opt in?<\/em><\/p>\n<p>Measuring the bias is not easy unless we have measurements about the users who\u00a0did not opt in. Firefox sends the following pieces of information to the Mozilla servers:\u00a0operating system, Firefox version, extension identifiers and the time for the\u00a0session to be restored. This is sent by all Firefox installations unless the\u00a0distribution or user have the feature turned off (this is called the services AMO ping).\u00a0The Telemetry data contains\u00a0the same pieces of information.<\/p>\n<p>What this implies is that we have start up times for i) <em>the users who opted in<\/em>\u00a0<em>to Telemetry and ii) everyone<\/em>. We can now answer the question &#8220;<em>Are the startup<\/em>\u00a0<em>times for the people who opted into Telemetry representative of the typical<\/em>\u00a0<em>Firefox user?&#8221;<\/em><\/p>\n<p>Note: &#8216;everyone&#8217; is <em>almost everyone. <\/em>Very few have this feature turned off.<\/p>\n<p><strong>Data Collection<\/strong><\/p>\n<p>We collected start up times for Firefox 7,8 and 9 for November, 2011 from\u00a0the log files of services.addons.mozilla.org (SAMO). We also took the same information for the\u00a0same period from the Telemetry data contained in HBase ( some\u00a0code examples can be found at the end of the article).<\/p>\n<p><strong>Objective<\/strong><\/p>\n<p>Are start up times different by Firefox version and\/or Source, where source can\u00a0be SAMO or Telemetry.<\/p>\n<p><strong>Displays<\/strong><\/p>\n<p>Figure 1 is boxplot of log of start up time for Telemetry (tele) vs. SAMO\u00a0(samo) by Firefox version. At first glance it appears the start up times from\u00a0Telemetry are less than those of SAMO. But the length of the bars makes it\u00a0difficult to stand by this conclusion.<\/p>\n<div id=\"attachment_5287\" style=\"width: 624px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/blog.mozilla.org\/metrics\/files\/2011\/12\/box.png\"><img aria-describedby=\"caption-attachment-5287\" decoding=\"async\" loading=\"lazy\" class=\"size-large wp-image-5287 \" title=\"Figure 1:Boxplot of Log SessionRestored for Telemetry\/SAMO by FF Version\" src=\"http:\/\/blog.mozilla.org\/metrics\/files\/2011\/12\/box-1024x1024.png\" alt=\"Figure 1: Boxplot of Log SessionRestored for Telemetry\/SAMO by FF Version\" width=\"614\" height=\"614\" srcset=\"https:\/\/blog.mozilla.org\/metrics\/files\/2011\/12\/box-1024x1024.png 1024w, https:\/\/blog.mozilla.org\/metrics\/files\/2011\/12\/box-150x150.png 150w, https:\/\/blog.mozilla.org\/metrics\/files\/2011\/12\/box-300x300.png 300w, https:\/\/blog.mozilla.org\/metrics\/files\/2011\/12\/box.png 1041w\" sizes=\"(max-width: 614px) 100vw, 614px\" \/><\/a><p id=\"caption-attachment-5287\" class=\"wp-caption-text\">Figure 1: Boxplot of Log SessionRestored for Telemetry\/SAMO by FF Version<\/p><\/div>\n<p style=\"text-align: justify;\">Figure 2 is the difference in the deciles\u00a0of log of start up time. In other words, approximately speaking, the deciles of\u00a0ratio of Telemetry start up time to SAMO start up time. The medians hover in the\u00a00.8 region, though the bars are very wide and do not support to a the quick\u00a0conclusion that Telemetry start up time is smaller.<\/p>\n<div id=\"attachment_5297\" style=\"width: 624px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/blog.mozilla.org\/metrics\/files\/2011\/12\/box4.png\"><img aria-describedby=\"caption-attachment-5297\" decoding=\"async\" loading=\"lazy\" class=\"size-large wp-image-5297 \" title=\"Figure 2: Difference of Deciles of Logs\" src=\"http:\/\/blog.mozilla.org\/metrics\/files\/2011\/12\/box4-1024x1020.png\" alt=\"Figure 2: Difference of Deciles of Logs\" width=\"614\" height=\"612\" srcset=\"https:\/\/blog.mozilla.org\/metrics\/files\/2011\/12\/box4-1024x1020.png 1024w, https:\/\/blog.mozilla.org\/metrics\/files\/2011\/12\/box4-150x150.png 150w, https:\/\/blog.mozilla.org\/metrics\/files\/2011\/12\/box4-300x298.png 300w, https:\/\/blog.mozilla.org\/metrics\/files\/2011\/12\/box4.png 1030w\" sizes=\"(max-width: 614px) 100vw, 614px\" \/><\/a><p id=\"caption-attachment-5297\" class=\"wp-caption-text\">Figure 2: Difference of Deciles of Logs<\/p><\/div>\n<p style=\"text-align: justify;\">In Figure 3, we have the mean of medians of 1000 samples: red circles are for telemetry and black for SAMO. The ends of the line segments correspond the sample 95% confidence interval (based on the sample of sample medians). The CI for the SAMO data lies entirely within that of the Telemetry\u00a0data. This makes one believe that the two groups are not different.<\/p>\n<div id=\"attachment_5298\" style=\"width: 624px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/blog.mozilla.org\/metrics\/files\/2011\/12\/box5.png\"><img aria-describedby=\"caption-attachment-5298\" decoding=\"async\" loading=\"lazy\" class=\"size-large wp-image-5298 \" title=\"Figure 4: Mean of the medians (circles) with their 95% confidence intervals. Red isTelemetry, Black is SAMO\" src=\"http:\/\/blog.mozilla.org\/metrics\/files\/2011\/12\/box5-1024x1024.png\" alt=\"Figure 4: Mean of the medians (circles) with their 95% confidence intervals. Red isTelemetry, Black is SAMO\" width=\"614\" height=\"614\" srcset=\"https:\/\/blog.mozilla.org\/metrics\/files\/2011\/12\/box5-1024x1024.png 1024w, https:\/\/blog.mozilla.org\/metrics\/files\/2011\/12\/box5-150x150.png 150w, https:\/\/blog.mozilla.org\/metrics\/files\/2011\/12\/box5-300x300.png 300w, https:\/\/blog.mozilla.org\/metrics\/files\/2011\/12\/box5.png 1050w\" sizes=\"(max-width: 614px) 100vw, 614px\" \/><\/a><p id=\"caption-attachment-5298\" class=\"wp-caption-text\">Figure 4: Mean of the medians (circles) with their 95% confidence intervals. Red isTelemetry, Black is SAMO<\/p><\/div>\n<p><strong>Analysis of Variance<\/strong><\/p>\n<p>For a more numerical approach, we can estimate the analayis of variance\u00a0components. The model is<\/p>\n<p style=\"text-align: center;\"><em>log(startup time) ~ version + src<\/em><\/p>\n<p>(we ignore\u00a0interaction). Since the data is in the order of billions of rows, I instead take 1000 samples of approximately 20,000 (sampling rate of 0.001%) rows each. Compute ANOVA results\u00a0of each and then average the summary tables of the <em>lm<\/em> function in R. In\u00a0other words we make our conclusions based on the average of the 1000 samples of\u00a0~20,000 rows each. ( I should point out that the residuals (as per a quick\u00a0visual check) were roughly distributed as gaussian and other diagnostics came out clean)<\/p>\n<p>The average ANOVA indicates does not support version effect or source effect (at the\u00a01% level). In other words, the log of start up time is not affected by the\u00a0version nor is it affected by the source (Telemetry\/ SAMO).<\/p>\n<pre>               Estimate Std. Error     t value   Pr(&gt;|t|)\r\n(Intercept)  8.62635472 0.01171420 736.4390937 0.00000000\r\nvers8       -0.05995627 0.01928947  -3.1089666 0.02922402\r\nvers9       -0.03382135 0.10466330  -0.3247165 0.48286903\r\nvers10      -0.03862282 0.29308642  -0.1418623 0.48228122\r\nsrctele     -0.02290538 0.03946150  -0.5811779 0.45300964<\/pre>\n<p>This is good news! <em><strong>Insofar start up time is concerned, Telemetry is\u00a0representative of SAMO.<\/strong><\/em><\/p>\n<p><strong>A Different Approach and Some Checks<\/strong><\/p>\n<p>By now, the reader should note that we have answered our question (see last line\u00a0of previous section). Two questions remain:<\/p>\n<p>1. The samples are representative. We are sampling on 3 dimensions: startup\u00a0time, src and version. Consider the 1000 quantiles of startup time, the 2 levels\u00a0of src and 4 levels of version. All in all, we have 1000x2x4 or 8000\u00a0cells. Sampling from the population might result in several empty cells, so much so,\u00a0that the joint distribution of the sample might be very different from that\u00a0of the population. To confirm that our cell distribution of the samples reflect\u00a0the cell distribution of the population, we computed Chi Square tests comparing the sample cell counts with that of the parent. All 1000 samples\u00a0passed!<\/p>\n<p>2. Why use samples? We can do a log linear regression testing on the 8000 cell\u00a0counts (i.e all the 1.9 BN data points) . This of course loses a lot of power: we are binning the data and all\u00a0monotonic transformations are equivalent. The model equivalent (using R&#8217;s\u00a0formula language) of the ANOVA described above is<\/p>\n<p style=\"text-align: center;\"><em>log(cell count) ~ src+ver+binned_startup:(src+ver)<\/em><\/p>\n<p>\u00a0If the effects of<em> binned_startup:src<\/em> and<em>\u00a0binned_startup:ver<\/em> are not significant this corresponds to our conclusion in the\u00a0previous section. And nicely enough, it does! \u00a0Output of <em>summary(aov(glm(&#8230;)))<\/em> is<\/p>\n<pre>summary(aov(glmout &lt;- glm(n~ver+src+sesscut:(ver+src)\r\n                          , family=poisson\r\n                          , data=cells3.parent))<\/pre>\n<pre>              Df     Sum Sq    Mean Sq   F value Pr(&gt;F)\r\nver            3 4.6465e+14 1.5488e+14 1131.8666 &lt;2e-16 ***\r\nsrc            1 3.2705e+14 3.2705e+14 2390.0704 &lt;2e-16 ***\r\nver:sesscut 3952 5.4969e+13 1.3909e+10    0.1016      1\r\nsrc:sesscut  988 2.0009e+13 2.0252e+10    0.1480      1\r\nResiduals   2967 4.0600e+14 1.3684e+11<\/pre>\n<p><strong>Some R Code and Data Sizes:<\/strong><\/p>\n<p>1. The data for SAMO was obtained from Hive, sent to a text file and then\u00a0imported to blocked R data frames using <a title=\"RHIPE\" href=\"http:\/\/code.google.com\/p\/rhipe\/\">RHIPE<\/a>. All subsequent analysis was done\u00a0using <a href=\"http:\/\/code.google.com\/p\/rhipe\/\">RHIPE<\/a>.<\/p>\n<p>2. The data for Telemetry, was obtained from Hbase using Pig (<a href=\"http:\/\/code.google.com\/p\/rhipe\/\">RHIPE<\/a> can read\u00a0HBase, but I couldn&#8217;t install it on this particular cluster). The text data was\u00a0then imported as blocked R data frames and placed in the same directory as the<br \/>\nimported SAMO data.<\/p>\n<p>3. Data sizes were in the few hundreds of gigabytes. All computations were done using <a href=\"http:\/\/code.google.com\/p\/rhipe\/\">RHIPE<\/a> (R not on the on the nodes) on \u00a0a 350TB\/33 node Hadoop cluster.<\/p>\n<p>3. I include some sample code to give a flavor of <a href=\"http:\/\/code.google.com\/p\/rhipe\/\">RHIPE<\/a>.<\/p>\n<p style=\"text-align: justify;\"><strong>Importing text data as Data Frames<\/strong><\/p>\n<pre>map         &lt;- expression({\r\n  ln        &lt;- strsplit(unlist(map.values),\"\\001\")\r\n  a         &lt;- do.call(\"rbind\",ln)\r\n  addonping &lt;- data.frame(ds=a[,1]\r\n                         ,vers=a[,3]\r\n                         ,sesssionrestored=as.numeric(a[,6])\r\n                         ,src=rep(\"samo\",length(a[,6]))\r\n                         ,stringsAsFactors=FALSE)\r\n  rhcollect(runif(1),addonping)\r\n})\r\nz &lt;- rhmr(map=map\r\n          ,ifolder=\"\/user\/sguha\/somequants\"\r\n          ,ofolder=\"\/user\/sguha\/teledf\/samo\"\r\n          ,zips=\"\/user\/sguha\/Rfolder.tar.gz\"\r\n          ,inout=c(\"text\",\"seq\")\r\n          ,mapred=list(mapred.reduce.tasks=120\r\n             ,rhipe_map_buff_size=5000))\r\nrhstatus(rhex(z,async=TRUE),mon.sec=4)<\/pre>\n<p style=\"text-align: justify;\"><strong>Creating Random Samples<\/strong><\/p>\n<pre>map         &lt;- expression({\r\n  y         &lt;- do.call('rbind', map.values)\r\n  p         &lt;- 20000\/1923725302\r\n  for(i in 1:1000){\r\n    zz      &lt;- runif(nrow(y)) &lt; p\r\n    mu      &lt;- y[zz,,drop=FALSE]\r\n    if(nrow(mu)&gt;0)\r\n      rhcollect(i,mu)\r\n  }\r\n})\r\nreduce      &lt;- expression(\r\n    pre={ x &lt;- NULL}\r\n    ,reduce = {\r\n      x     &lt;- rbind(x,do.call('rbind',reduce.values))\r\n    }\r\n    ,post={ rhcollect(reduce.key,x) }\r\n    )\r\nz &lt;- rhmr(map=map,reduce=reduce\r\n          ,ifolder=\"\/user\/sguha\/teledfsubs\/p*\"\r\n          ,ofolder=\"\/user\/sguha\/televers\/dfsample\"\r\n          ,inout=c('seq','seq')\r\n          ,orderby='integer'\r\n          ,partition=list(lims=1,type='integer')\r\n          ,zips=\"\/user\/sguha\/Rfolder.tar.gz\"\r\n          ,mapred=list(mapred.reduce.tasks=72\r\n             ,rhipe_map_buff_size=20))\r\nrhstatus(rhex(z,async=TRUE),mon.sec=5)<\/pre>\n<pre><\/pre>\n<p><strong>Run Models Across Samples<\/strong><\/p>\n<pre>map        &lt;- expression({\r\n  cuts     &lt;- unserialize(charToRaw(Sys.getenv(\"mcuts\")))\r\n  lapply(map.values, function(y){\r\n    y$tval &lt;- sapply(y$sesssionrestored\r\n                     ,function(r) {\r\n                       if(is.na(r)) return( r)\r\n                       max(min(r,cuts[2]),cuts[1])\r\n                     })\r\n    mdl    &lt;- lm(log(tval)~vers+src,data=y)\r\n    rhcollect(NULL, summary(mdl))\r\n  })})\r\nz &lt;- rhmr(map=map\r\n          ,ifolder=\"\/user\/sguha\/televers\/dfsample\/p*\"\r\n          ,ofolder=\"\/user\/sguha\/televers2\",\r\n          ,zips=\"\/user\/sguha\/Rfolder.tar.gz\"\r\n          ,inout=c(\"seq\",\"seq\")\r\n          ,mapred=list(mapred.reduce.tasks=0))\r\nrhstatus(rhex(z,async=TRUE),mon.sec=4)<\/pre>\n<p><strong>Computing Cell Counts For A Log Linear Model<\/strong><\/p>\n<pre>cuts2                &lt;- wtd.quantile(tms$x,tms$n,\r\n                                     p=seq(0,1,length=1000))\r\ncuts2[1]             &lt;- cuts[1]\r\ncuts2[length(cuts2)] &lt;- cuts[2]\r\nmap.count &lt;- expression({\r\n  cuts       &lt;- unserialize(charToRaw(Sys.getenv(\"mcuts\")))\r\n  z          &lt;- do.call(rbind,map.values)\r\n  z$tval     &lt;- sapply(z$sesssionrestored,function(r)\r\n                  max(min(r,cuts[length(cuts)]),cuts[1]))\r\n  z$sessCuts &lt;-\r\n    factor(findInterval(z$tval,\r\n                        cuts),ordered=TRUE)\r\n  f          &lt;- split(z,list(z$vers,z$sessCuts,z$src),drop=FALSE)\r\n  for(i in seq_along(f)){\r\n    y &lt;-strsplit(names(f)[[i]],\"\\\\.\")[[1]]\r\n    rhcollect(y,nrow(f[[i]])) }\r\n})\r\nz &lt;-\r\n  rhmr(map=map.count,reduce=rhoptions()$templates$scalarsummer\r\n       ,combiner=TRUE,\r\n       ifolder=\"\/user\/sguha\/teledfsubs\/p*\"\r\n       ,ofolder=\"\/user\/sguha\/telecells\",\r\n       ,zips=\"\/user\/sguha\/Rfolder.tar.gz\"\r\n       ,inout=c(\"seq\",\"seq\") ,mapred=\r\n       list(mapred.task.timeout=0\r\n            ,rhipe_map_buff_size=40\r\n            ,mcuts=rawToChar(serialize(cuts2, NULL,\r\n                                ascii=TRUE))))<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Telemetry \u00a0is\u00a0a feature in Firefox that captures performance metrics such as start up time,\u00a0DNS latency among others. The number of metrics captured is in the order of a\u00a0couple hundred. The data is sent back to the Mozilla Bagheera servers \u00a0which &hellip; <a class=\"go\" href=\"https:\/\/blog.mozilla.org\/metrics\/2011\/12\/13\/comparing-the-bias-in-telemetry-data-vs-the-typical-firefox-user\/\">Continue reading<\/a><\/p>\n","protected":false},"author":263,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[101],"tags":[3651,3649,3652,3654,577],"_links":{"self":[{"href":"https:\/\/blog.mozilla.org\/metrics\/wp-json\/wp\/v2\/posts\/5286"}],"collection":[{"href":"https:\/\/blog.mozilla.org\/metrics\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.mozilla.org\/metrics\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/metrics\/wp-json\/wp\/v2\/users\/263"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/metrics\/wp-json\/wp\/v2\/comments?post=5286"}],"version-history":[{"count":0,"href":"https:\/\/blog.mozilla.org\/metrics\/wp-json\/wp\/v2\/posts\/5286\/revisions"}],"wp:attachment":[{"href":"https:\/\/blog.mozilla.org\/metrics\/wp-json\/wp\/v2\/media?parent=5286"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.mozilla.org\/metrics\/wp-json\/wp\/v2\/categories?post=5286"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.mozilla.org\/metrics\/wp-json\/wp\/v2\/tags?post=5286"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}