{"id":2924,"date":"2014-06-16T16:08:44","date_gmt":"2014-06-16T05:08:44","guid":{"rendered":"http:\/\/blog.mozilla.org\/nnethercote\/?p=2924"},"modified":"2014-06-16T16:08:44","modified_gmt":"2014-06-16T05:08:44","slug":"a-browser-benchmarking-manifesto","status":"publish","type":"post","link":"https:\/\/blog.mozilla.org\/nnethercote\/2014\/06\/16\/a-browser-benchmarking-manifesto\/","title":{"rendered":"A browser benchmarking manifesto"},"content":{"rendered":"<p><em><strong>This post represents my own opinion, not the official position of Mozilla.<\/strong><\/em><\/p>\n<p>I&#8217;ve been working on Firefox for over five years and I&#8217;ve been unhappy with the state of browser benchmarking the entire time.<\/p>\n<h3>The current problems<\/h3>\n<h4>The existing benchmark suites are bad<\/h4>\n<p>Most of the <span class=\"il\">benchmark<\/span> suites are terrible. Consider at Hennessy and Patterson&#8217;s categorization of <span class=\"il\">benchmarks<\/span>, from best to worst.<\/p>\n<ol>\n<li>Real applications.<\/li>\n<li>Modified applications (e.g. with I\/O removed to make it CPU-bound).<\/li>\n<li>Kernels (key fragments of real applications).<\/li>\n<li>Toy <span class=\"il\">benchmarks<\/span> (e.g. sieve of Erastosthenes).<\/li>\n<li>Synthetic <span class=\"il\">benchmarks<\/span> (code created artificially to fit a profile of particular operations, e.g. Dhrystone).<\/li>\n<\/ol>\n<p>In my opinion, benchmark suites should only contain benchmarks in categories 1 and 2. There are certainly shades of grey in these categories, but I personally have a high bar, and I&#8217;m likely to consider most programs whose line count is in the hundreds as closer to a &#8220;toy benchmark&#8221; than a &#8220;real application&#8221;.<\/p>\n<p>Very few browser <span class=\"il\">benchmark<\/span> suites contain benchmarks from category 1 or 2. I think six or seven of the benchmarks in <a href=\"https:\/\/developers.google.com\/octane\/benchmark\">Octane<\/a> meet this standard. I&#8217;m having trouble thinking of any other existing benchmark suite that contains <em>any<\/em> that reach this standard. (This includes Mozilla&#8217;s own Kraken suite.)<\/p>\n<p>Bad benchmark suites hurt the web, because every hour that browser vendors spend optimizing bad benchmarks is an hour they don&#8217;t spend improving browsers in ways that actually help users. I&#8217;ve seen first-hand how much time this wastes.<\/p>\n<h4>Conflicts of interest<\/h4>\n<p>Some of the <span class=\"il\">benchmark<\/span> suites &#8212; including the best-known ones &#8212; come from browser vendors. The conflict of interest is so obvious I won&#8217;t bother discussing it further.<\/p>\n<h4>Opaque scoring systems<\/h4>\n<p>Some of them have opaque scoring systems. Octane (derived from V8bench) is my least favourite here. It also suffers from much higher levels of variance between runs than the benchmarks with time-based scoring systems.<\/p>\n<h3>A better way<\/h3>\n<p>Here&#8217;s how I think a new browser <span class=\"il\">benchmark<\/span> suite should be created. \u00a0Some of these ideas are taken from the <a href=\"http:\/\/www.spec.org\/\">SPEC benchmark suites<\/a>which have been used for over 20 years to benchmark the performance of C, C++, Fortran and Java programs, among other things.<\/p>\n<h4>A collaborative and ongoing process<\/h4>\n<p>The suite shouldn&#8217;t be owned by a single browser vendor. Ideally, all the major browser vendors (Apple, Google, Microsoft, and Mozilla) would be involved. Involvement of experts from outside those companies would also be desirable.<\/p>\n<p>This is a major political and organizational challenge, but it would avoid bias (both in appearance and reality), and would likely lead to a higher-quality suite, because low-quality benchmarks are likely to be shot down.<\/p>\n<p>There should be a public submissions process, where people submit web<br \/>\napps\/sites for consideration. \u00a0And obviously there needs to be a selection<br \/>\nprocess to go with that.<\/p>\n<p>The <span class=\"il\">benchmark<\/span> suite should be open source and access should be free of<br \/>\ncharge. \u00a0It should be hosted on a public site not owned by a browser vendor.<\/p>\n<p>There should also be an official process for modifying the suite: modifying <span class=\"il\">benchmarks<\/span>, removing <span class=\"il\">benchmark<\/span>, and adding new <span class=\"il\">benchmarks<\/span>. \u00a0This process<br \/>\nshouldn&#8217;t be too fluid. \u00a0Perhaps allowing changes once per year would be reasonable. The current trend of continually augmenting old suites should be resisted.<\/p>\n<h4>High-quality<\/h4>\n<p>All <span class=\"il\">benchmarks<\/span> should be challenging, and based on real website\/apps; \u00a0they should all belong to category 1 or 2 on the above scale.<\/p>\n<p>Each <span class=\"il\">benchmark<\/span> should have two or three inputs of different sizes. \u00a0Only the<br \/>\nlargest input would be used in official runs, but the smaller inputs<br \/>\nwould be useful for browser developers for testing and profiling purposes.<\/p>\n<p>All <span class=\"il\">benchmarks<\/span> should have side-effects of some kind that depend on most or<br \/>\nall of the computation. \u00a0There should be no large chunks of computation whose<br \/>\nresults are unused, because such <span class=\"il\">benchmarks<\/span> are subject to silly results due<br \/>\nto certain optimizations such as dead code elimination.<\/p>\n<h4>Categorization<\/h4>\n<p>Any categorization of <span class=\"il\">benchmarks<\/span> should not be done by browser subsystems. E.g. there shouldn&#8217;t be groups like &#8220;DOM&#8221;, &#8220;CSS&#8221;, &#8220;hardware acceleration&#8221;, &#8220;code loading&#8221; because that encourages artificial <span class=\"il\">benchmarks<\/span> that stress a single<br \/>\nsubsystem. Instead, any categories should be based on application domain:<br \/>\n&#8220;games&#8221;, &#8220;productivity&#8221;, &#8220;multimedia&#8221;, &#8220;social&#8221;, etc. If a grouping like<br \/>\nthat ends up stressing some browser subsystems more than others, well that&#8217;s<br \/>\nfine, because it means that&#8217;s what real websites\/apps are doing (assuming the<br \/>\n<span class=\"il\">benchmark<\/span> selection is balanced).<\/p>\n<p>There is one possible exception to this rule, which is the JavaScript engine. At least some of the <span class=\"il\">benchmarks<\/span> will be dominated by JS execution time, and because JS engines can be built standalone, the relevant JS engine<br \/>\nteams will inevitably want to separate JS code from everything else so that<br \/>\nthe code can be run in JS shells. It may be worthwhile to pre-emptively do that, where appropriate. If that were done, you wouldn&#8217;t run these JS-only sub-applications as part of an official suite run, but they would be present in the repository.<\/p>\n<h4>Scoring<\/h4>\n<p>Scoring should be as straightforward as possible. Times are best because their meaning is more obvious than anything else, *especially* when multiple<br \/>\ntimes are combined. Opaque points systems are awful.<\/p>\n<p>Using times for everything makes most sense when <span class=\"il\">benchmarks<\/span> are all<br \/>\nthroughput-oriented. \u00a0If some <span class=\"il\">benchmarks<\/span> are latency-oriented (or FPS, or<br \/>\nsomething else) this becomes trickier. \u00a0Even still, a simple system that humans can intuitively understand is best.<\/p>\n<h3>Conclusion<\/h3>\n<p>It&#8217;s time the browser world paid attention to the performance and benchmarking lessons that the CPU and C\/C++\/Fortran worlds learned years ago. It&#8217;s time to grow up.<\/p>\n<p>If you want to learn more, I highly recommend getting your hands on a copy of Hennessy &amp; Patterson&#8217;s <em>Computer Architecture<\/em> and reading the section on measuring performance. This is section 1.5 (&#8220;Measuring and Reporting Performance&#8221;) in the 3rd edition, but I expect later editions have similar sections.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This post represents my own opinion, not the official position of Mozilla. I&#8217;ve been working on Firefox for over five years and I&#8217;ve been unhappy with the state of browser benchmarking the entire time. The current problems The existing benchmark suites are bad Most of the benchmark suites are terrible. Consider at Hennessy and Patterson&#8217;s [&hellip;]<\/p>\n","protected":false},"author":139,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[15269],"tags":[],"_links":{"self":[{"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/posts\/2924"}],"collection":[{"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/users\/139"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/comments?post=2924"}],"version-history":[{"count":0,"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/posts\/2924\/revisions"}],"wp:attachment":[{"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/media?parent=2924"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/categories?post=2924"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/tags?post=2924"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}