{"id":242,"date":"2010-02-01T11:09:35","date_gmt":"2010-02-01T00:09:35","guid":{"rendered":"http:\/\/blog.mozilla.org\/nnethercote\/?p=242"},"modified":"2010-02-01T11:59:11","modified_gmt":"2010-02-01T00:59:11","slug":"a-win-for-code-hygiene","status":"publish","type":"post","link":"https:\/\/blog.mozilla.org\/nnethercote\/2010\/02\/01\/a-win-for-code-hygiene\/","title":{"rendered":"A win for code hygiene"},"content":{"rendered":"<p>I&#8217;ve <a href=\"http:\/\/blog.mozilla.org\/nnethercote\/2010\/01\/21\/safer-refactoring-in-nanojit\/\">written<\/a> <a href=\"http:\/\/blog.mozilla.org\/nnethercote\/2010\/01\/26\/my-contribution-to-firefox-3-6\/\">recently<\/a> about clean-ups I&#8217;ve been making in Nanojit &#8212; simplifying code, refactoring, adding more internal sanity checking.\u00a0 I&#8217;m a firm believer that these kinds of changes are worth doing, that &#8220;if it ain&#8217;t broke, don&#8217;t fix it&#8221; is not always true.\u00a0 But there&#8217;s always a voice in the back of my head wondering &#8220;am I just polishing this code for little gain?\u00a0 Should I be working on something else instead?&#8221;\u00a0 Yesterday I received confirmation that this work has been worthwhile.\u00a0 But before I can explain why, I need to give some background.<\/p>\n<p>Nanojit serves as the back-end of TraceMonkey.\u00a0 It converts LIR (a low-level intermediate representation) generated by TraceMonkey&#8217;s front-end into native code.\u00a0 So Nanojit is a critical part of TraceMonkey.\u00a0 And TraceMonkey is a critical part of Firefox.<\/p>\n<p>However, code generators (and register allocators) are tricky beasts &#8212; easy to get wrong and hard to debug.\u00a0 (A long time ago I heard someone, I can&#8217;t remember who, say &#8220;code generators and garbage collectors should crash as early and noisily as possible&#8221;.)\u00a0 So with a piece of code this important and inherently difficult, it&#8217;s a good idea to arm yourself with some weapons that give you confidence that it is correct.<\/p>\n<h3>Weapon 1: clean IR semantics<\/h3>\n<p>One way to do this is to give your IR clean semantics.\u00a0 LIR&#8217;s semantics are mostly clean, but there are a few nasty cases.\u00a0 One of the worst is the &#8216;ov&#8217; instruction.\u00a0 It performs an overflow check on an arithmetic (add, mul, sub, neg) operation.<\/p>\n<p>JavaScript numbers are doubles by default, but TraceMonkey sometimes demotes them to integers for speed.\u00a0 TraceMonkey has to check for integer overflow in these cases;\u00a0 if an integer does overflow TraceMonkey has to step back and use doubles for the computation.<\/p>\n<p>Here&#8217;s some example LIR that uses &#8216;ov&#8217;:<\/p>\n<pre> ld6 = ld sp[-8]\r\n\u00a0add1 = add ld6, 1\r\n ov1 = ov add1\r\n xt4: xt ov1 -&gt; pc=0x883ed8 imacpc=0x0 sp+0 rp+0 (GuardID=218)<\/pre>\n<p>The first instructions loads a value from the stack.\u00a0 The second instruction adds 1 to that value.\u00a0 The third instruction performs an overflow check and puts the result in &#8216;ov1&#8217;.\u00a0 The fourth instruction is a guard;\u00a0 if &#8216;ov1&#8217; is true it exits the code block.<\/p>\n<p>So why is &#8216;ov&#8217; semantically nasty?\u00a0 First consider &#8216;add&#8217;, which is a semantically cleaner instruction.\u00a0 Its result (in this case &#8216;add1&#8217;) is a function of its inputs (&#8216;ld6&#8217; and 1).\u00a0 In comparison, &#8216;ov&#8217; does not have this property.\u00a0 The &#8216;ov&#8217; doesn&#8217;t really take &#8216;add1&#8217; as an input &#8212;\u00a0 you can&#8217;t tell by looking at &#8216;add1&#8217; whether the addition overflowed.\u00a0 In fact you can&#8217;t really understand what is happening here at the LIR level;\u00a0 you have to know what happens in the native code that is generated from this LIR.\u00a0 It turns out that the native code for the &#8216;add&#8217; also sets some condition codes, and the native code generated for the &#8216;ov&#8217;\/&#8217;xt&#8217; pair inspects those condition codes.\u00a0 For this to work, the &#8216;add&#8217; has to immediately precede the &#8216;ov&#8217;;\u00a0 if it does not, whatever intermediate native code that is generated could overwrite the condition codes, in which case the behaviour of the guard becomes completely unpredictable.\u00a0 This is obviously bad.<\/p>\n<p>The real problem is that the addition has two outputs:\u00a0 the result, and the overflow status indicator (which maps to condition codes in the hardware).\u00a0 But LIR has no explicit way to represent that second output.\u00a0 So the second output is implicit, and an extra constraint must be imposed on the LIR instead, i.e. &#8216;ov&#8217; must immediately follow an arithmetic operation.\u00a0 Because this constraint is so arbitrary it&#8217;s easy to forget and easy to break.<\/p>\n<p>In May 2009 Julian Seward opened <a href=\"https:\/\/bugzilla.mozilla.org\/show_bug.cgi?id=495569\">a bug<\/a> to improve LIR&#8217;s semantics, giving five suggestions.\u00a0 It generated a lot of discussion and Julian drafted a patch to replace &#8216;ov&#8217; with some cleaner opcodes, but there was enough disagreement that it never went anywhere.\u00a0 But I didn&#8217;t forget about it, and &#8216;ov&#8217; has annoyed me enough times recently that I decided to resurrect Julian&#8217;s idea, this time in <a href=\"https:\/\/bugzilla.mozilla.org\/show_bug.cgi?id=539874\">a new bug<\/a> to escape the baggage of the old one.\u00a0 The idea is simple: if &#8216;ov&#8217; always has to follow an arithmetic operation, then why not create new opcodes that fuse the two parts together?\u00a0 These new opcodes are &#8216;addxov&#8217;, &#8216;subxov&#8217; and &#8216;mulxov&#8217;.\u00a0 With my patch the code from above now looks like this:<\/p>\n<pre> ld6 = ld sp[-8]\r\n\u00a0add1 = addxov ld6, 1 -&gt; pc=0x883ed8 imacpc=0x0 sp+0 rp+0 (GuardID=218)<\/pre>\n<p>The &#8216;addxov&#8217; adds &#8216;ld6&#8217; and 1, puts the result in &#8216;add1&#8217;, and exits if there was an overflow.\u00a0 The generated native code is unchanged, but the implicit output of the addition is now hidden within a single LIR instruction, and it&#8217;s impossible for overflow checks in the native code to become separated from the potentially-overflowing arithmetic operation.<\/p>\n<h3>Weapon 2: strong IR sanity checking<\/h3>\n<p>Compilers are usually built in a pipeline fashion, where you have multiple passes over the code representation, and each pass does a task such as optimisation, register allocation or code generation.\u00a0 It&#8217;s also a really good idea to have a pass that does a thorough sanity check of the code representation.<\/p>\n<p>In November 2008 Jim Blandy filed <a href=\"https:\/\/bugzilla.mozilla.org\/show_bug.cgi?id=463137\">a bug<\/a> suggesting such a pass for Nanojit, one that performs type-checking of the LIR.\u00a0 The bug sat untouched for over six months until some extra discussion occurred and then (once again) Julian wrote a patch.\u00a0 It found at least one case where TraceMonkey was generating bad LIR code, but again the bug stalled, this time because we spent three months merging Mozilla&#8217;s and Adobe&#8217;s copies of Nanojit.\u00a0 I resurrected the bug again recently and added some &#8220;structure checking&#8221; for things like the &#8216;ov&#8217; constraint, and it landed in the TraceMonkey repository on January 21st.\u00a0 Happily, my version of the checker was simpler than Julian&#8217;s;\u00a0 this was possible because LIR had had some other semantic clean-ups (e.g. <a href=\"https:\/\/bugzilla.mozilla.org\/show_bug.cgi?id=520714\">properly distinguishing 64-bit integers from 64-bit floats)<\/a>.\u00a0 My patch replaced a very basic type-checker (called &#8220;SanityFilter&#8221;) that had been in TraceMonkey, and immediately found <a href=\"https:\/\/bugzilla.mozilla.org\/show_bug.cgi?id=538538\">two<\/a> <a href=\"https:\/\/bugzilla.mozilla.org\/show_bug.cgi?id=538484\">bugs<\/a>, one of which involved &#8216;ov&#8217;.<\/p>\n<h3>Synergy<\/h3>\n<p>It&#8217;s not often that a bug report involving your code makes you smile.\u00a0 Yesterday Gary Kwong hit an <a href=\"https:\/\/bugzilla.mozilla.org\/show_bug.cgi?id=543161\">assertion failure<\/a> in Nanojit when fuzz-testing:<\/p>\n<pre>    LIR structure error (end of writer pipeline):\r\n    in instruction with opcode: ov\r\n    argument 1 has opcode: add\r\n    it should be: located immediately prior, but isn't\r\n  One way to debug this:  change the failing NanoAssertMsgf(0, ...) call to a\r\n  printf(...) call and rerun with verbose output.  If you're lucky, this error\r\n  message will appear before the block containing the erroneous instruction.<\/pre>\n<p>In other words, there&#8217;s an &#8216;ov&#8217; following an &#8216;add&#8217;, but there are one or more other LIR instructions between them.\u00a0 This means that the code generated will be bogus, as I explained above.\u00a0 This bug may have been in TraceMonkey for a long time, but it wasn&#8217;t detected until strong internal sanity checking was added.\u00a0 The good news is that the &#8216;ov&#8217;-removal patch (which hasn&#8217;t landed as it&#8217;s still awaiting review) will fix this patch.<\/p>\n<h3>Lessons learned<\/h3>\n<p>This story tied together a number of related ideas for me.\u00a0 In particular:<\/p>\n<ul>\n<li>Clean IR semantics are worth the effort.\u00a0 Hacks may save time initially but they will come back to bite you.<\/li>\n<li>Strong IR sanity checking is worth the effort.\u00a0 (And cleaner semantic allows for stronger checking.)<\/li>\n<li>Listen to Julian, especially when he talks about correctness.\u00a0 (And read his code!\u00a0 VEX\/pub\/libvex_ir.h in the Valgrind source code describes Valgrind&#8217;s IR, which is a wonderfully clean IR, particularly in the way it separates statements (which have side-effects) from expressions (which don&#8217;t).)<\/li>\n<li>Don&#8217;t put too many ideas into one bug report.\u00a0 Too much discussion can kill a bug, or at least put it in a coma.<\/li>\n<li>Follow-through is important.\u00a0 A patch can&#8217;t do much good unless it lands.<\/li>\n<li>Fuzz testing is awesome (but we already knew that).<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;ve written recently about clean-ups I&#8217;ve been making in Nanojit &#8212; simplifying code, refactoring, adding more internal sanity checking.\u00a0 I&#8217;m a firm believer that these kinds of changes are worth doing, that &#8220;if it ain&#8217;t broke, don&#8217;t fix it&#8221; is not always true.\u00a0 But there&#8217;s always a voice in the back of my head wondering [&hellip;]<\/p>\n","protected":false},"author":139,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[528,617,467],"tags":[],"_links":{"self":[{"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/posts\/242"}],"collection":[{"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/users\/139"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/comments?post=242"}],"version-history":[{"count":0,"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/posts\/242\/revisions"}],"wp:attachment":[{"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/media?parent=242"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/categories?post=242"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/tags?post=242"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}