Main menu:

Site search

Categories

Archive

Tamarin Tracing Internals III: LIR

Program Form 3: LIR. I believe LIR stands for low-level intermediate representation (although I’ve also heard linear intermediate representation). Typically, in a compiler or VM LIR is the lowest-level (and last) form of machine-independent compiler representation, and looks much like a machine-independent assembly language. TT’s LIR plays the same role but has some special features because it is designed specifically for efficient trace compilation. The most important feature is that a LIR trace is always straight-line code, with one or more exit points, but no targeted branches. (Actually, there has to be a target, but the target is not part of the LIR, it’s the address of an IL instruction.)

The LIR instruction set is defined in nanojit/LIR.h. LIR is much simpler than IL, with fewer than 64 opcodes. Most of them are familiar, e.g., LIR_add for integer addition. LIR_imm (“immediate”) sets a constant value, contained directly in the instruction word. All the control-flow instructions are exits: e.g., LIR_x to exit unconditionally, or LIR_xt to exit if a condition is true. The exception is LIR_loop, which jumps back to the start of the trace. TT traces always start at loop headers, so this is important. See also Mason Chang’s post on the LIR instruction set.

A LIR instruction is 32-bits, with 8 bits for the opcode and up to 3 operands. As with machine ISAs, there are different forms of instruction words to accommodate multiple operands, 24-bit immediate operands, etc.

LIR operands work differently to IL operands. Recall that IL is stack-based, so IL’s IADD takes its two arguments off the stack and pushes their sum. Because machine languages like x86 are register-based, not stack-based, LIR is also register-based. More precisely, LIR uses virtual registers (my terminology, not TT’s), which just means it can have as many registers as it wants. Mapping those registers onto the finite set available on each processor is the job of a lower-level register allocator.

LIR has an interesting implicit scheme of naming the virtual registers. Each LIR instruction has a sequence number, according to its position in the trace. The sequence number is (implicitly) also the name of the virtual register where the result is stored. Later instructions can use this result by sequence number. For example,

// Instruction 13: test whether result of instruction 12 equals immediate #48
13 eq    12,#48
// Instruction 22: exit if result of instruction 13 false
22 xf    13 -> 10C41CA

This design saves space, because result operands don’t need to be named implicitly. It also automatically represents instruction dependencies. Finally, because of this design, LIR is necessarily in SSA form, which will make optimizing traces easier.

Transformation 2->3: IL->LIR. This is the actual trace generation. So while the earlier steps translated a method at a time, in this step TT operates on a trace (code path) at a time. The basic idea is that while TT executes IL, it will generate a trace of LIR instructions for the code path it follows.

Activating Tracing. TT traces always start at loop headers. A loop header is any target of a backward branch-the IL generator ensures backward branches are loop edges. Recursive methods are also a form of loop, and so TT treats a recursive call the same as a loop edge.

Every time the interpreter follows a loop edge, it checks to see if it should begin tracing. This is encoded into vm_*_interp.h with the INTERP_CHECK_LOOPEDGE macro, which wraps a call to Interpreter::loopedge. Interpreter::loopedge maintains a count of the number of times it has seen each loop header address. If the count exceeds a threshold (HOTLOOP, which is 10 in the current version), it calls Interpreter::sot (“start of trace”) to start tracing. sot initializes data structures used by tracing, emits some boilerplate prolog LIR, and sets the interpreter tracing state flag. Finally, control will exit from the interpreter (Interpreter::do_interp, but the return statement is found in the macro _CHECK_MODESWAP, used in INTERP_CHECK_LOOPEDGE), with a return value that tells the main system loop to enter tracing mode.

Tracing. Actual tracing of IL is performed by VMInterp::do_trace. do_trace continues interpreting IL, just as in the interpeter (do_interp). In addition, before interpreting each IL instruction, do_trace emits LIR for the instruction. For example, here is the do_trace implementation of my favorite IL op, IADD, from vm_fpu_trace.h:

INTERP_FOPCODE_TRACE_BEGIN(IADD)
interp.trace_binop(LIR_add, sp);
INTERP_FOPCODE_TRACE_END(IADD)
INTERP_FOPCODE_INTERP_BEGIN(IADD)
const int32_t tmp_i_0 = int32_t(sp[-1].i) + int32_t(sp[0].i) ;
INTERP_ADJUSTSP(-1)
sp[0].i = tmp_i_0;
INTERP_INVALBOXTYPE(sp[0])
INTERP_FOPCODE_INTERP_END(IADD)

The second part is an exact copy of the code in vm_fpu_interp.h. I believe this code is produced by the Forth compiler. After preprocessing, the block above looks like this (in VMInterp.ii):

foplabel_TRACE_IADD: { {
interp.trace_binop(LIR_add, sp);
} ip += 1; {
const int32_t tmp_i_0 = int32_t(sp[-1].i) + int32_t(sp[0].i) ;
sp += (-1);
sp[0].i = tmp_i_0;
} goto _goto_ip; }

The only addition for tracing is the call to interp.trace_binop(LIR_add, sp). In principle, trace_binop has a simple job: emit a LIR opcode for a binary operation. In reality, the tracer does some optimizations along the way and also must do some state bookkeeping.

trace_binop. Here is the method signature for trace_binop:

void Interpreter::trace_binop(_LOpcode op, const Boxp sp, BoxUsage insize, BoxUsage outsize);

op is the LIR opcode to emit in the trace. sp points to the top of the interpreter operand (Forth) stack. The other two arguments give the operand sizes, because operands can be 32 or 64 bytes, but I’m not going to worry about that just yet, and that aspect of the code is well-separated from the main logic.

Here is the body of trace_binop:

1        if (check_const_defc(2, sp, outsize)) return;
2        LirWriter* lirout = tracefrag->lirbuf->writer();
3        LInsp i = lirout->ins(LOpcode(op), use_q_or_lo(insize, sp-1), use_q_or_lo(insize, sp));
4        varset(sp-1, i);

Line 1 tries a constant-folding optimization: informally, if both of the operands are constants, the tracer will compute the result, which is constant, and emit a LIR_imm instead of the binary operation. But what exactly is a constant operand in LIR? Recall that an operand is just the index of an instruction that computes a value. If that instruction is a LIR_imm, then the operand is constant.

Here I should say a bit about the region tracker. (Fortunately, Graydon explained it to me, so I didn’t have to work hard to figure out that part.) Recall that in the interpreter, the operands are the top two stack elements. At the start of trace_binop, that’s all we know. So in order to get the LIR operands for tracing, trace_binop has to map the stack values to their corresponding LIR operands (i.e., the LIR instructions that computed those stack values).

The region tracker, (class RegionTracker in Interpreter.h), maintains this mapping and performs the lookup. Specifically, RegionTracker mains a map from addresses (const void *) to instructions in the LIR trace buffer (LIns *). The addresses are considered to address fixed-width elements in a range starting from a zero position. This is perfect for tracking a stack. Also, the mapping can be implemented as array lookup, which is simple and fast.

Region tracker operations are wrapped for the interepreter by inline methods like varof, which maps a Box* interpreter stack operand to a LIR instruction, and varset, which emits a store operation for a result and updates the region tracker with the new result.

Back to trace_binop. Line 2 just accesses the LirWriter (LIR.h), the class that does the bit-packing to create LIR instructions.

Line 3 emits the binary operation LIR instruction. The only fancy part is in use_q_or_lo, which uses the region tracker to map an interpreter stack operand to the LIR instruction that created it.

Line 4 emits a LIR_sti to store the result (and updates the region tracker accordingly). This surprised me, because if the result of a LIR operation is implicitly stored in its instruction sequence number, why is an explicit store needed? Looking at example traces, I see that there are few or no corresponding LIR_ld instructions, so most of these stores must be dead. I think the LIR_sti is storing to a memory location that also represents the result, and will be used later for spilling by the register allocator.

Example. LIR is so low-level that nontrivial IL tends to create more LIR than I want to put here, but I did find a readable example of adding 1 to a number. This is from the -Dverbose output of avmshell (\ is a line continuation-I added a line break to fit this format):

T 11E296     op_OP_increment_plus_000a \
-8:0 -3:10AF520 -3:10AD150 -3:10AD240 -2:0 -2:0 -3:10AD240 d:10
325 int   #11E298
327 sti   #11E298, #4(6)
T 11F4E6      LITC1 \
-8:0 -3:10AF520 -3:10AD150 -3:10AD240 -2:0 -2:0 -3:10AD240 d:10
328 imm   #1
330 sti   #1, #16(8)
T 11F4E8      I2D \
-3:10AF520 -3:10AD150 -3:10AD240 -2:0 -2:0 -3:10AD240 d:10 -3:1
333 quad  #3FF00000:0
335 sti   333, #16(8)
T 11F4EA      DADD \
-3:10AF520 -3:10AD150 -3:10AD240 -2:0 -2:0 -3:10AD240 d:10 d:1
336 fadd  321,333
338 sti   336, #8(8)

The lines beginning with “T” show IL instructions. The “T” itself indicates that the line was printed while the interpreter was in tracing mode. The next number is the address of hte IL instruction. After that comes the IL opcode. Finally, the top 8 elements of the interpreter stack are shown, with the topmost at the right. The format of the stack element is type:content, where type is a boxtype constant from Box.h and content is a hex representation of the value.

The indented lines following each “T” line show the LIR generated for that IL instruction. The LIR output format is similar to typical assembly languages, and is:

instruction-sequence-number opcode operands

Operands that are just numbers are references to other instructions by sequence number. Operands like #8 are immediate operands. #16(8) is read “#offset(base pointer)”, i.e., it says to take the value computed by instruction 8 as a pointer, then add 16 to that pointer. Most of the bases seem to be for instructions in the trace prolog that load things like the stack pointer. In particular, at least in this trace, the base pointer 6 is the interpreter return pointer and the base pointer 8 is the interpreter stack pointer.

(Which leads me to ask, what are the return pointer and the stack pointer? They are part of the interpreter state (InterpState in Interpreter.h), which has 4 fields: the frame pointer, Frame* f; the instruction pointer, FOpcodep ip; the stack pointer, Boxp sp; and the return pointer, FOpcodepp rp.

* A Frame (StackTrace.h) represents an activation record (or “stack frame”) of the ActionScript code. Thus, Frames are pushed onto and popped from a stack as ActionScript methods are entered and exited. This stack is used for generating stack traces and security checks that depend on ActionScript execution context.
* The instruction pointer points to the currently executing (Forth) IL instruction.
* The stack pointer points to the top of the Forth data stack.
* The return pointer points to the top of the Forth return stack. This is the stack that implements call and return from Forth subroutines.)

Let’s see if I can understand how the tracer works:

T 11E296     op_OP_increment_plus_000a \
-8:0 -3:10AF520 -3:10AD150 -3:10AD240 -2:0 -2:0 -3:10AD240 d:10
325 int   #11E298
327 sti   #11E298, #4(6)

This first step is a call to a superinstruction. A just means pushing the return address (note that it is the current address plus 2) onto the return stack: #4(6) means an offset of 4 from the start of the return stack. In the LIR, the first instruction loads a 32-bit immediate value, and the second instruction stores that value on the return stack. Note also the missing sequence number: LIR instructions are 32 bits, so loading a 32-bit value is done with a load instruction followed by the value.

T 11F4E6      LITC1 \
-8:0 -3:10AF520 -3:10AD150 -3:10AD240 -2:0 -2:0 -3:10AD240 d:10
328 imm   #1
330 sti   #1, #16(8)

LITC1 (“literal constant 1″?) pushes the integer value 1. Here we load the immediate value 1, then store the value 1 on the data stack (recall that 8 is the stack pointer). I’m not sure why there is a missing sequence number: I think LIR_imm is just one instruction.

T 11F4E8      I2D \
-3:10AF520 -3:10AD150 -3:10AD240 -2:0 -2:0 -3:10AD240 d:10 -3:1
333 quad  #3FF00000:0
335 sti   333, #16(8)

Note that the value “-3:1″ has been pushed onto the data stack at the right. I think in this case, the value is actually an unboxed int, so the -3 is just whatever happened to be in that memory location before, and only the 1 is significant.

I2D converts an integer to a double. Here you see we are using LIR_quad (load an immediate 64-bit “quadword”) to push the IEEE 754 double value “3FF0000000000000″, more commonly known as 1.0. So there is no conversion code at all: TT has constant-folded the I2D operation because its operand is the constant int 1.

T 11F4EA      DADD \
-3:10AF520 -3:10AD150 -3:10AD240 -2:0 -2:0 -3:10AD240 d:10 d:1
336 fadd  321,333
338 sti   336, #8(8)

Now in the data stack view, the last value is “d:1″, which means double value 1.

Finally, we do a DADD, or add doubles. This uses LIR_fadd, the floating-point addition operator, on operands 321 (not shown here) and 333, the LIR way to refer to the 1.0 we loaded previously. Finally, we store the result on the stack. Note that we store it 8 units down from the result of I2D: #8(8) instead of #16(8). This is because the stack has one fewer 8-byte element, as DADD has popped two operands and pushed only one.

Here’s the state after this instruction:

T 11F4EC      DUP          -8:0 -3:10AF520 -3:10AD150 -3:10AD240 -2:0 -2:0 -3:10AD240 d:11

The top of the stack is now 10+1=11. Way up above, I said that a loop header is considered hot and gets traced after 10 hits. And here we just incremented something by 1, and get 11. So even though there’s a ton of LIR before this, we know this is actually working on the variable “i” in the original program.

Next time: trace optimization.

Comments

Comment from shaver
Time: May 23, 2008, 8:13 pm

I’ve learned more from your last 3 posts than from actually getting something up and tracing in an experimental hack. I’m not sure how I feel about that.

Comment from Rick Reitmaier
Time: May 29, 2008, 9:51 am

You may have already figured out most of this in subsequent
posts, but I’m only now getting through your wonderful
write-ups and thought I’d post as I go if you don’t mind.

Besides starting a trace along a loop edge , we can also start a trace
from the exit of another trace.

Regarding the LIR_sti being emitted as the RegionTrackers are updated…
These stores write back operation results to the interpreter stack (sp/rp). We
had an earlier version of the code wherein, we the delayed stores until we
were in the exit code and this resulted in an excessive amount of code
in the exits. Not to mention most of it was duplicated many times over.

By placing the stores in the mainline we reduce the duplication and
relive the exit code from having to manage them. The cost of course
is that we are paying the price of a store on the main trace which
may never be needed. In compile() you’ll notice that rmStores()
attempts to reduce the redundant stores ‘locally’ (that is between
guards). It also tracks the top of stack(s) and removes any
stores that fall out of range.

Missing sequence numbers in LIR output. We’ve massaged
the output somewhat from what is actually in memory, attempting
to get rid of some of the extraneous information. So using your
example:

328 imm #1
330 sti #1, #16(8)

A LIR_tramp is used to extend the reach of instructions (remember
we use 8-bit relative addressing for the operands) and
was inserted in order to point to 8. So strictly speaking we
probably have something like;

328 imm #1
329 tramp 8
330 sti 328,329,#16

By the way sti is identical to st, except the displacement
is contained directly in the instruction. This is a memory
optimization, which saves an instruction in the case of
small displacements (<256).