Main menu:

Site search

Categories

Archive

Tamarin Tracing Internals IV: Trace Optimization

In part III, I went over how TT generates LIR traces. Now, I’m going to look into the trace optimization and machine code generation process. The code for this is mostly in the nanojit/ directory.

Keep in mind that a trace is always straight-line code in SSA form. This makes optimizations easier to implement, so it has a big effect on the design. By the way, a lot of the material on SSA is confusing, and also goes straight into a lot of complexity that’s not needed for TT, so I’ll give a quick explanation here.

SSA. SSA stands for static single assignment, but don’t bother trying to parse that. It just means that each virtual register in the trace appears on the left-hand side of exactly one assignment statement. This is automatically true for a TT trace, because the virtual register assigned to is implicitly the sequence number of the instruction. Note: locations that are not virtual registers, such as a slot on the data stack, may be assigned to multiple times; the SSA property in TT holds only for virtual registers.

The advantage of SSA is that any time you see a virtual register, you can just look at its one assignment statement, and you immediately know what it was assigned from, what kind of operator was used, whether the operands were constant, etc. Without SSA, there are multiple possible assignments, and an optimizer has to try to discover which assignments can actually reach the current point and then summarize their effects, which is slower and less precise.

Constant Folding. TT performs constant folding as part of LIR generation. I discussed that in part III along with LIR generation, but I mention it again just to have the full list of optimizations here. Constant folding means transforming code like “a = 3 + 4″ to “a = 7″, i.e., replacing a constant expression with the result of evaluating that expression. I should note that constant folding is also used with branches: a branch instruction with a constant conditional expression is dropped entirely, because the next instruction on the trace is simply the actual target of the branch.

Ending Tracing. Trace optimization starts when a “complete” LIR trace has been generated. In principle, the tracer could stop tracing whenever it wanted to, so there’s no particular completeness property. I just mean that optimization doesn’t start until tracing stops. This is necessary because some optimization passes go backward over the trace.

Tracing is stopped by Interpreter::eot (end of trace). (eot is often invoked by the macro EOT defined in Interpreter.h, which records the reason for ending the trace in debug builds only.) Most of eot is just debugging and error checking. The key call is:

compile(strk.location, rtrk.location, assm, tracefrag);

The compile function is defined in LIR.cpp and performs constant subexpression elimination, dead store elimination, and assembly. There is code to perform lifetime splitting just before assembly, but it is guarded by if (false) in the current code.

Interpreter::eot is called if any of these conditions occur:

  • (Interpreter::loopedge) The trace follows more than MAX_XJUMPS “cross-jumps”. A cross-jump is a backward branch that does not go to the loop header (i.e., the start of the trace). A cross-jump indicates the presence of nested loops. In the current standard configuration, MAX_XJUMPS is zero.
  • (Interpreter::pre_trace) The trace contains at least MAX_BLKS guards (i.e. side exits). This is checked before tracing each instruction. In the current standard configuration, MAX_BLKS is 10,000. There is a comment next to this check, “count # of guards to minimize heisenbugs”. I’m not sure what this means, but it might be something to make the point at which traces end more deterministic.
  • (Interpreter::eot_untraceable_prim) The current instruction cannot be traced, e.g., a general native method. This is called by the trace implementation for all untraceable primitives.
  • (Interpreter::eot_if_max_exits) The trace has returned from over MAX_EXITS ActionScript functions. Recall that tracing can go right through both function calls and returns. Tracing can start in the middle of a function, and the trace could go through many returns. This is called when EXITABC is traced. In the current standard configuration, MAX_EXITS is 2.
  • (Interpreter::eot_if_max_copies) This one’s a little tricky. The purpose is to avoid the trace explosion problem: code with k sequential if statements can generate 2^k traces, and if k is too big, this will use up all memory and crash the program. So TT counts the number of copies of an instruction that exist in different traces. If the count exceeds MAX_COPIES, TT ends the trace. Thus, the total memory used can be no more than (#instructions) * MAX_COPIES. Note that TT saves work by counting the number of copies only for CFGMERGE instructions, which are special no-ops that Verifier generates at all control-flow join points. An instruction is copied only if the CFGMERGE above it is copied, so counting CFGMERGEs is good enough. In the current standard configuration, MAX_COPIES is 2.

I understand the purpose of eot_untraceable_prim and eot_if_max_copies, but not the MAX_XJUMPS, MAX_BLKS, and MAX_EXITS conditions, so please comment if you know.

Constant Subexpression Elimination (CSE). This is a standard compiler optimization. CSE replaces code like this:

x = y + z;
w = y + z;

with code like this:

x = y + z;
w = x;

This saves an operation, and may enable additional optimizations now that w is an exact copy of x.

Given the SSA property, two expressions that apply the same pure operation to the same virtual registers, e.g., “v1 + v2″ in “v17 = v1 + v2″ and “v20 = v1 + v2″, are always equivalent. A pure operation is one that has no side effects and depends only on its arguments. In TT, this includes basic arithmetic operators and also functions that are marked PURE-FUNCTION in Forth, e.g., stringlength.

In TT, CSE is performed by nanojit::cse in LIR.cpp (nanojit:: is a namespace qualifier). cse performs a single forward scan of the trace, detecting CSEs and replacing them as it goes. This is the classic “value numbering” technique. Detection is based on a hashmap where the key is the operator and operands. (See LInsHashSet in LIR.cpp, especially LInsHashSet::_equals.)

Replacement is performed by overwriting the redundant operation with a special LIR_tramp (trampoline) instruction. A LIR_tramp is simply a reference to another instruction in the trace. (In detail, a LIR_tramp has a 24-bit offset operand: if the sequence number of the LIR_tramp is i, and the operand is offset, then the instruction is a reference to the instruction at position i+offset.) My abstract CSE example above might look like this in real LIR with tramps:

15 fadd 5, 6 // 15 is sequence number/destination; 5, 6 are operands

24 fadd 5, 6

becomes:

15 fadd 5, 6

24 tramp -9

LIR_tramp isn’t an executable instruction. Rather, X tramp -Y is a directive: “Whenever you see X as an operand, instead read it as X-Y.” It’s almost like a macro definition inside LIR. “Macro expansion” is performed automatically inside LIns::oprnd1 (the getter for the first operand of a LIR instruction) and related methods. The effect is that any operand that CSE can detect as equal to X will be named X thereafter. This, in turn, makes it easy to tell that all these Xs are equal and exposes more opportunities for CSE.

Here’s an example:

Treating tramps in this way is part of value numbering and helps optimize code like this:

3 fadd 1, 2
4 fadd 1, 2

15 fadd 3, 8
16 fadd 4, 8

3 and 4 are clearly redundant. So are 15 and 16, but they have syntactically different operands, so it looks like CSE will miss them. But it doesn’t. It converts the first two instructions to:

3 fadd 1, 2
4 tramp -1

At this point, the next two instructions are now read as:

15 fadd 3, 8
16 fadd 3, 8

and CSE gets the second redundancy as well. This example also shows why CSE is done as a forward pass: CSE applied at one point may create more opportunities for CSE farther down, so we can do all the CSEs in one pass only if we go forward.

Redundant Store Elimination (RSE). This is a restricted type of dead code elimination (DCE). Informally, RSE replaces this:

x = a + b;
x = y + z;

with this:

x = y + z;

The first value of x is overwritten before it can be used anywhere else, so the first statement can be eliminated if it has no side effects.

Note that there are 2 assignments to x here in my example, so it is not in SSA form. This is correct: this TT pass is only applied to store instructions (LIR_st and LIR_sti), which can write to the same location multiple times.

General DCE can also eliminate instructions whose values are not overwritten but are never read anyway, but we can’t do that with TT stores. The reason is that TT stores write values to the interpreter stack, which may be used later once we exit the trace, by the interpreter or the next trace. While looking at the current trace, we really have no idea which of those values is used later on, so we have to keep them all. But TT does apply general DCE to non-store instructions, and for those instructions DCE is incredibly easy.

TT RSE is performed by nanojit::rmStores as a backward pass. rmStores scans the instructions, keeping track of (a) the depth of the stack, and (b) which stack positions (starting from the bottom) are stored to. rmStores does the tracking for both the data stack (sp) and the return stack (rp). For each store instruction, rmStores determines the stack position stored to, and removes the store if that position is (a) above the top of the stack, or (b) is stored to later on (i.e., earlier in the backward scan). Case (b) is just as in the example above. Case (a) picks up situations like storing a value to the top of the stack and then DROPping it.

rmStores must handle side exits specially. As above, we have to make sure the interpreter stacks are “correct” (i.e., look exactly how they would look if the interpreter had been running) when we exit the trace. This applies to side exits as well. So when the backward scan passes a side exit, it must mark everything on the stack as potentially live (by clearing the “stored-to” bits in its scan record).

This is important: it shows that side exits preclude some optimizations.

Assembly. This converts LIR to ISA (instruction set architecture code-the TLA way to say “machine language”).

The TT assembler performs register allocation (mapping the unlimited virtual registers to the very limited ISA registers) simultaneously. Offline compilers do register allocations by applying approximation algorithms to NP-hard graph-coloring problems, but the compilation time is too long for a JIT like TT, so TT uses an integrated single-pass greedy allocator. Note that nanojit/RegAlloc.h is not the register allocator: it’s just a data structure for tracking register mappings and free registers.

Assembly is platform-specific, so TT needs a mechanism to build different assemblers for different platforms. Here’s one of the key bits (in nanojit.h):

#include “Native.h”
#include “LIR.h”
#include “RegAlloc.h”
#include “Fragmento.h”
#include “Assembler.h”

Native.h includes another file, Native*.h (e.g. Nativei386.h), controlled by preprocessor defines. This imports platform size constants, register set definitions, and macros for code generation (e.g., CALL). Assembler.h defines the Assembler class. Assembler.cpp contains platform-independent assembler logic, including methods of Assembler, customized by referring to variables and macros defined in the platform-specific header. There’s also some stuff controlled by defines like NANOJIT_IA32, generally short code snippets that interact closely with otherwise platform-independent code. Finally, there is a Native*.cpp, which contains other methods of Assembler that are defined in a purely platform-specific manner.

The TT assembler works backward. I think this is because it does a few last optimizations which work best in a backward pass. Keep this in mind reading methods like Assembler::genEpilog-the first instruction generated is the return.

Assembler::assemble is the entry point, and Assemble::gen is where the per-instruction work is done.

Assemble::gen a lot of details-I’ll just look at an examples. Here’s how a LIR_imm (place a constant value “immediate operand” in a virtual register) is assembled:

case LIR_imm:
case LIR_imm32:
{
Register rr = prepResultReg(ins, GpRegs);
int32_t val;
if (op == LIR_imm32)
val = ins->imm32();
else
val = ins->imm16();
if (val == 0)
XOR(rr,rr);
else
LDi(rr, val);
break;
}

The first step is to call prepResultReg to pick a register to store the result in. I’ll look at that later, but for now I assume it just works. The next step is to get the constant value itself from the LIR instruction. Finally, we call the LDi macro to generate the instruction, unless the constant is zero, in which case we just XOR the register with itself (x XOR x = 0 for all x), which is faster (although I don’t know why). The macros aren’t very exciting reading–they just do the bit-bashing to generate ISA opcodes, addressing mode bits, and operand encodings.

Register Allocation. This code is pretty complicated, but I think I can outline what it does. The algorithm is conceptually simple; the complexities come from dealing with platform-specific details and special cases.

The algorithm tracks the set of free registers and the mapping from virtual registers to machine storage. The machine storage is represented by the Reservation class, which can name an ISA register, an activation record (stack frame) location, or both.

The first step for most instructions is to allocate registers to use for the operands.

Assembler::findRegFor finds a register that holds the result of a given LIR instruction (e.g., an operand). If the LIR instruction has already been assigned a register, it returns that register. Otherwise, it searches for a free register and records the mapping.

One of the complexities is the second argument to findRegFor, RegisterMask allow. allow represents a set of allowed registers-the returned register must be in this set. This is needed because some operations can only be used on certain registers. In some cases, the value can’t be allocated directly in the allowed set, e.g., because it is computed by an instruction that cannot output to that register. Then TT issues an extra move instruction.

It is possible that there is are no free registers. In this case, the solution is to spill a register. This means we pick a victim LIR instruction currently in a register, and store and load it around the current instruction. As a general example, we might have two different values that need to be computed into eax:

mov eax, ebx  // first instruction writing eax
add ?, ecx       // want to use eax, but it’s occupied
add esi, ?        // want eax from previous
add edx, eax   // first eax again

We spill the first writer of eax like this:

mov eax, ebx
mov ebp[-8], eax // spill eax to memory
add eax, ecx        // eax now available
add esi, eax
mov eax, ebp[-8] // restore first eax
add edx, eax

That’s simple enough, but it looks kind of tricky in TT, because TT doesn’t see the code all at once to make this transformation, but instead does adds the spill and restore code during its backward pass.

In this example, TT would detect that a spill is required when assembling “add esi, ?”. It needs to use eax, but eax is already in use, so at this point, it knows that it has to emit code to restore the victim value to eax. This is done by Assembler::asm_restore, which finds a free memory location for the victim and emits code to restore from that location. It also records that memory location in the victim’s reservation so it will know to spill the value later on. Note that for constant values, asm_restore knows it doesn’t need to load them from memory, but can just use an LDi (load immediate).

Once the operands are allocated registers, the assembler selects a register for the result using  Assembler::prepResultReg. This again calls findRegFor. But in this case, a register was probably already allocated by an later instruction that uses this result, and that register will be returned directly. Now, if the register we select was previously selected to be spilled (in asm_restore), we need to generate the spill code. This is done by calling Assembler::asm_spill, which checks the reservation to see if a memory location has been allocated to this instruction’s result. If so, a store instruction is generated.

The register allocator is closely related to the DCE (dead code elimination) mechanism. Assemble::gen calls Assemble::ignoreInstruction on each instruction to see if no code should be generated. The basic idea is that if no storage has been reserved for the result of an instruction, then nothing ever reads the result, so it is dead (as long as there are no other side effects). LIR_tramps are always ignored, which fits with my earlier description of their being a special kind of nop.

Note that all of this is done backwards. Even the debug output is generated backwards and then reversed for printing. So if you want to read the assembler output in the order that TT processes things, read it backwards.

Assembly/Register Allocation Example. Here’s a bit of debug output from the assembler, showing a spill/restore:

58 qlo   47
010E26AE  movd ecx,xmm0                esi(8) edi(6) xmm0(47)
spill 58
010E26B2  mov -8(ebp),ecx                ecx(58) esi(8) edi(6)
60 eq    58,#0
69 xt    60 -> 11DED2
010E26B5  test ecx,ecx                         ecx(58) esi(8) edi(6)
010E26B7  je 10E3512                           ecx(58) esi(8) edi(6)
GID 49
71 arg   #4
010E26BD  mov edx,4                            ecx(58) esi(8) edi(6)
72 arg   58
73 call  getslotvalue_box
010E26C2  call 309E0:getslotvalue_box        esi(8) edi(6)
restore 58
010E26C7  mov ecx,-8(ebp)                      esi(8) edi(6)

The lines that begin with decimal numbers are LIR instructions. The indented lines that begin with hex addresses are the generated assembly. The assembly lines also show the current mapping of LIR instruction results to ISA registers. There are also some notes about 58 being spilled and restored. The “GID” line indicates a guard.

Let’s read it bottom to top. The last LIR instruction is “call getslotvalue_box”, which is a native call. Native calls potentially overwrite certain registers, including ecx. The is currently a value in ecx, the result of instruction 58. (This reservation was made earlier, i.e., farther down in the trace.) This value must be spilled. But that will happen further up. For now, TT just selects a spill location, -8(ebp), and emits the code to restore that location to ecx. Now, TT can emit the call instruction.

The previous instruction is “arg 58″, which means to load 58 into an argument storage location to passed to a call. The LIR_arg instruction encodes the storage location, and it’s not shown the debug output, but apparently this one is supposed to go in ecx. Because the value is already in ecx, no code is necessary, and none is generated. This logic is accomplished by the function Assembler::findSpecificRegFor, which is a thin wrapper that just calls findRegFor with a single allowed register. As explained above, if the allowed register can be reserved for that instruction, it is, and if not, a move is emitted.

The previous instruction is now “arg #4″, which means to load literal integer 4 into an argument storage location, this time edx. The argument is a constant, so all this has to do is emit an LDi instruction. There is no need to allocate registers because edx is potentially written by the call, so if an instruction was using edx, we would have selected it for spilling when we processed the call. At this time, edx is automatically available.

The previous instruction is “xt 60 -> 11DED2″, an “exit if true” instruction. Exiting from traces is complicated, so I’m going to leave most of this for later. For now, just note that this instruction generates both the comparison instruction test and the branch instruction je. This is because of a classic compilation issue, which is that relational operators get compiled differently according to whether the result is used by a branch or by an arithmetic operator. TT’s solution is to compile the relational operator as part of compiling the branch, and then ignore it later, as shown on this trace.

Now we reach the first LIR instruction, “qlo 47″, which picks out the “low” half of a quad (64-bit operand). The result has been reserved as ecx, but memory has also been reserved (when we generated restore code earlier), so we know we need to spill the result now. After that, we can generate the move instruction for qlo.

Next time: running compiled traces.

Comments

Comment from Andreas
Time: May 28, 2008, 10:56 pm

I think a lot of the configuration values (XJUMPS, BLKS, EXITS) are a result of Ed and his team experimenting with different strategies to contain the code explosion through tail duplication, and to steer what gets recorded and extended when.

If you set the XJUMPS limit to 2, for example, you “outerlined” traces (folding of the outer nested loop into the inner loop), which we use a lot in our trees to deal with simple nested loop.

Without having looked at the code recently, I would assume MAX_EXITS might be an attempted to limit tail duplication as a result of trying to inline outer method scopes. If you start recording in method f and you backtrack into the outer scope (method) g, that transition back into the outer scope (return) might spawn a number of side exits (one for each other scope f can be called from). Since TT does support recording into outer scopes (which we do not in Hotpath), they have to somehow limit this growth.

Comment from Rick Reitmaier
Time: May 30, 2008, 8:01 am

Compiling LIR bottom-up has a number of advantages (like you’ve
mentioned) and we think it makes for a smaller assembler, which
is a major factor for TT.

Like you’ve highlighted walking through the Assembler code can be a bit
confusing. Another way of thinking about it is that the current set of
Reservation’s forms a contract that must be met by downstream code
(i.e. code already generated).

For example we’ve generated :

add eax,ebx

Now if we want to use eax we have to make sure it gets
loaded, so we rematerialize it, like so

mov eax,ebp[-8]
add eax,ebx

Now we’re free to use eax for another instruction:

sub eax,ebx
mov eax,ebp[-8]
add eax,ebx

A really nice part of this flow is that we don’t have to worry
about how eax in the sub gets populated (likewise when we
generated the add, we didn’t know that eax would be loaded
from the stack).

This approach tends to produce reasonable code, in that
remats typically happen close to uses without additional
book-keeping or extra passes, and spills (if needed) happen
at the point when the value is produced.

tip: reading the code as its generated (i.e. backwards) can
be easily accomplished by modifying the 1st if statement in
Assembler::output() to :
if (0 && _outputCache)

Comment from dmandelin
Time: May 30, 2008, 10:15 am

Cool. I think the spill code is hard to understand just because there aren’t standard algorithms for it (vs. say, CSE), so I don’t have much of a starting idea of how it works. One thing I did notice about the design of the register allocator is that it tries to assign the same register to a result as the one it will get used in later, which tends to save copies. This is nice-coalescing can also avoid those copies but I think is much too expensive for TT.

Thanks for the tip, although what tripped me the most at first was that you have to read the code generation code backward-but I’m getting used to that now.

Pingback from dce results
Time: June 19, 2008, 11:24 am

[…] for this is mostly in the nanojit/ directory. Keep in mind that a trace is always straight-line codehttp://blog.mozilla.org/dmandelin/2008/05/28/tamarin-tracing-internals-iv-trace-optimization/Prep roundup: Fagbemi dominates at sectionals Marshfield News HeraldEAU CLAIRE — Oladipo Fagbemi […]

Comment from Indonesia Furniture Handicraft Wholesale Marketplace
Time: June 4, 2010, 8:09 pm

Thanks for this, I’m still learning about what you share here.