Inline threading, TraceMonkey, etc.
It’s been a long time since I’ve posted here-I wanted to post some interesting results about speeding up SpiderMonkey using inline threading, but it turned out to be really hard and took a long time to get close enough to “interesting results”. At last, my patch is good enough to run SunSpider, and runs it 8% faster than baseline (non-tracing trunk SpiderMonkey from a few weeks ago), 10-20% faster on favorable benchmarks, and 48% faster on 3bit-bits-in-byte. So that’s pretty cool.
Of course, 8% looks pretty puny next to the huge gains of TraceMonkey (congrats to those guys, by the way). But I’m assured that interpreter speedups still count, so I’m chugging along. (Side note: inline threading speeds up SunSpider’s access-fannkuch benchmark by 22%, which has proved difficult to optimize with tracing.) (Side note 2: I’ve been told that TraceMonkey will hugely speed up our static analysis scripts, maybe 10x or so, which is great news.).
Insane, gory detail on inline threading, related optimizations, and detailed performance analyses can be found in bug 442379. I thought I’d go over the key ideas in this post:
Inline threading. Basically, this is yet another interpreter opcode dispatch technique. I previously wrote about opcode dispatch, concluding that direct-threading, in which the opcode is a target address, and the code to start the next op is a single indirect jump instruction is the ultimate efficient dispatch mechanism. It turns out I was wrong.
Inline threading is the “best”, because it gets dispatch down to 0 instructions. The idea is to create a buffer, and copy into it the native code for each opcode to be executed. For example, for a function body like “return a+3”, the opcodes are: JSOP_GETARG (0), JSOP_INT8 (3), JSOP_ADD, JSOP_RETURN. To inline thread this, we create a buffer and fill it with native code like this, using memcpy:
code for JSOP_GETARG
code for JSOP_INT8
code for JSOP_ADD
code for JSOP_RETURN
It’s like a really crude form of JIT compilation.
To run the function, we just jump to the start of the buffer, and then it all runs, with no further dispatch code. I’ve found that an average SpiderMonkey op executes about 35 instructions, so inline threading removes the 4 instructions for indirect threading, reducing this to 31, and should speed up SM by about 11%. Nice!
Hard Stuff. The only problem is that what I just described doesn’t actually work. For one, the compiler doesn’t necessarily compile each SpiderMonkey op handler into a single block of code. In fact, it usually reorders things a bunch, to help with code locality and reduce the number of jump/branch instructions executed on hot paths. So those are too hard to inline. (It could be done by disassembling parts of SpiderMonkey, analyzing the results, and doing code layout again, but that’s a bit much.)
Also, because most jump, branch, and call instructions express their targets using a relative address (an offset from the IP of the following instruction), any jump outside our little inline-threaded buffer becomes a jump to hell. So those would all have to be identified and patched, again with a dissassembler.
In general, small ops that don’t call functions, generate errors, or have much internal control flow can be inlined, but everything else can’t. Fortunately, “small ops” include some really common ones like JSOP_GETVAR and JSOP_POP, but the technique can’t be used without some special treatment for the big ops.
Call Threading/Context Threading. For big ops, I used something called call threading or context threading. Call threading has been used to produce big speedups on some interpreters, but turns out not to help at all for SpiderMonkey. But it can handle big ops in an inline-threading system, and after a lot of work, I at least got it to not slow down SpiderMonkey.
The idea of call threading is to create a native buffer, but fill it with calls to the opcode handlers instead of copying the whole handlers in. With the previous example, it gives you:
Those are x86 call instructions. For this to work, the opcode handlers have to end with ‘ret’ instructions. This means dispatch is 2 instructions, which is better than indirect threading’s 4, and I think better even than direct threading because that’s really 3 instructions if you count getting the op, which I should have before. Also, these call and ret instructions are highly predictable (99.9%+ prediction rate), unlike the indirect jumps used by direct and indirect threading, which are very unpredictable. Since a mispredict is very costly (~16 cycles on Core 2, I think), this gives a big speedup, 30% or so on some interpreters.
Except on SpiderMonkey, where unfortunately it doesn’t work and also slows things down. This was very frustrating.
The reason it doesn’t work is that when you execute a ‘call’ instruction, you push the return address onto the stack, decrementing the stack pointer ($esp) by 4. The opcode handler will then crash if it calls a function. There are actually several reasons why but the most important is that on most systems, the stack pointer has to be 16-byte-aligned when you call a function, and that -4 puts it off.
To make it work, I add extra code to unpush the return address right after the call, and then unpop it back right before the return. This works, but now we’re up to 4 instructions, so we haven’t saved any instructions over indirect threading. (I would love a better solution, but I couldn’t think of one.)
Next, it turns out that in SpiderMonkey, due to clever optimization by Igor & Co., the branch prediction rate is already excellent in practice: 80-100% on benchmarky-type programs. (The trick is make sure there is a separate indirect jump going out of the end of each opcode, so the processor can predict each one independently. Also, the Core 2 indirect branch predictor seems very smart.) So branch mispredicts are only costing an average 0-3 cycles per op. SpiderMonkey takes about 28 cycles per op, so this gives an estimated speedup of 0-12%.
Last, the really tragic thing is that the changes I made to make SpiderMonkey do call threading make GCC shoot itself in the face. For reasons I don’t entirely understand, GCC compiles the op handlers “differently”, so they run at least 5% slower. It took some doing to figure out how to keep the slowdown down to 5%, which makes call threading about even with indirect threading. That’s pretty disappointing, but at least it means I can call thread the big ops without penalty (on average), and then inline the small ops, getting some speedup. With the example code, I generate:
code for JSOP_GETARG
code for JSOP_INT8
(The optimizer issue is incredibly arcane, but I think I traced the problem to an optimization pass called “gcse2-post-reload”, which is really partially-redundant-load elimination after register allocation. Any kind of partial redundancy elimination (PRE) can slow code down. The slowdown effect should be mitigated when using profile-guided optimization (PGO), but I couldn’t get GCC PGO to work on SpiderMonkey to test that theory.)
For more on call threading, see this paper, keeping in mind that the results would probably differ on Core 2 because of its presumed better indirect branch predictor.
Inlining-enabled optimizations. Inline threading some stuff and call threading the rest doesn’t yield exciting gains for SpiderMonkey, maybe 2% overall and 5-20% on the “good” benchmarks. (I guess that’s not too bad, the 2% overall just seems really weak.) But inline threading enables a bunch of other cool optimizations that could speed things up more. I’ve only done the easy ones so far, and that got the 48% speedup on 3bit.
Easy optimization #1 is to stop updating the SpiderMonkey PC. The call and inline threaded code tells what op handler to run next, so the PC is unnecessary. Actually, some ops do use the PC, and I’d like to make them not do that someday, but in the meantime I can just generate code to set the PC in my native code buffer:
mov 0, [pc]
code for JSOP_GETARG
mov 2, [pc]
code for JSOP_INT8
mov 5, [pc]
It looks like I only saved the PC update for JSOP_ADD, but it’s better than that because the standard code has to load the PC, increment it, and then store it back, which I’ve replaced with just one instruction. The PC optimization is relatively easy and saves 0-3 instructions per op, which means a 0-10% speedup by itself. And it actually works.
“Easy” optimization #2 is to specialize certain opcodes. Take JSOP_INT8 as an example. This op pushes an integer onto the interpreter stack. That’s 3 or 4 instructions, to manipulate the stack pointer and store a value to it. But it also has to get the value out of the opcode stream and convert it to a jsval (the tagged value type of the interpreter), so it’s actually 9 instructions. (And JSOP_INT32 is much worse because it has to fetch 4 bytes from the opcode stream and shift and or them together.) But if we’re inlining JSOP_INT8, we can just inline the 3 or 4 instructions using the actual value, a version of JSOP_INT8 “specialized” for the given value. I do the specialization by writing the op in assembly code with a dummy value (0xdeadbeef), finding the location of the dummy as part of interpreter startup, and then patching that location as I inline. This seems crazy, but all the other design options seem crazy in their own ways, so I went with it for now. With specialization in play, the example looks like this:
code for JSOP_GETARG specialized for slot 0
code for JSOP_INT8 specalized for jsval for 3
mov 5, [pc]
Because JSOP_GETARG and JSOP_INT8 use the PC only to get their arguments, which specialization bakes in, we get to remove some more PC updates too.
Compiler nitpickery. In itself, all this stuff works, and it’s been used in various research projects before, and probably some other interpreters by now. But this particular implementation of it depends a fair amount on what the compiler does: at least (a) that the compiler doesn’t reorder small ops too insanely, (b) that you can take the address of labels (to get the start and end of op handlers), (c) patching in some ‘ret’ instructions at runtime (needed to solve an obscure problem I didn’t feel like discussing here), and (d) that the compiler doesn’t deoptimize too badly when you do all this. That seems possible, but it’s not clear yet that this will play well across different versions of GCC and ICC. (Crazy side note: in my tests, GCC 4.2 on Mac regresses SpiderMonkey SunSpider by 5% vs. GCC 4.0.)