Main menu:

Site search

Categories

Archive

Tamarin Tracing Internals, Part I

Tamarin (technically, tamarin-tracing, henceforth TT)-related projects keep peeking up at me from the horizon. First, there’s a good chance I’ll have an intern working on TT this summer. And then there’s this “Tracehydra” idea. It’s a way to connect Spidermonkey’s JS parser with the TT execution engine. This is the plan:

Where “profit” means “run Javascript really fast”. Tracehydra would be the fluffy cloud that translates Spidermonkey bytecode to Tamarin IL (or possibly LIR-the details get confusing fast). (In the interest of reducing confusion slightly, I’ll say that IL stands for intermediate language, and is roughly a synonym for bytecode. TT people often refer to their IL as “Forth” because they based the design on Forth or something, but I know nothing more about Forth than that it involves stacks, so that doesn’t help me.)

Specifically, Tracehydra means using Treehydra to translate the Spidermonkey (SM hereafter) C code that interprets each bytecode into C++ that emits equivalent (to the C) Tamarin IL. So I guess it reduces the problem from translating SM bytecode TT IL to translating SM C cases to TT IL-building code. Put that way, it’s not clear this actually helps, but I think SM bytecode is believed to have complex semantics that would be difficult to code in TT IL by hand, and maybe the C in SM has fewer constructs that are easier to translate. Seems possible, anyway.

If I’m gonna make any sense out of this I need to learn something about TT IL and how the TT VM uses it.

Digging in to Tamarin. There doesn’t seem to be a lot of documentation on TT, so I thought I’d write up whatever I managed to figure out, for my own benefit at least. By the way, I’ve probably gotten some things wrong, so any TT experts are highly encouraged to correct me.

I should mention that Chris Double’s Tamarin posts have been invaluable-they are the best source I know of that explains how to actually build and run TT. And Graydon Hoare’s diagrams and comments are what got me started on having any idea where to look for anything in the code.

My first question was, what is the “life cycle” of a program run by TT? The TT shell as it exists today runs ActionScript programs, specifically ABC files (ActionScript bytecode). Once the tracer kicks in, TT is running machine code traces (e.g. x86-64 ISA). (Ugh. This is turning in alphabet soup.) What goes on in between? Here’s what I found.

By the way, this picture gives an overview. (Picture does not exist yet.)

Program Form 0: ActionScript. I followed a sample program, the simplest program I could think of that will get traced (you need a hot loop), through TT. Here’s my program:

  var sum = 0;
  for (var i = 0; i < 1000000; ++i) {
     sum += i;   
  }
  print(sum);

By the way, on my MacBook, this runs in 1.7s in interpreted mode (tracing disabled) and .28s with tracing enabled, a 6x speedup. And that includes VM startup time. For comparison, Java runs the equivalent program in .13s. (But Java gets the answer “wrong”, because ActionScript uses unlimited-precision integers while Java uses int32s. So this test is unfavorable toward TT.) Excluding startup time, Java takes about 4ms, apparently, the same as C. I’ll have to retest TT excluding startup time, once I figure out how.

Program Form 1: ABC. This is the ActionScript bytecode and is the input to the current Tamarin shell. I don’t need to know too much about this, but the real Tamarin stuff is generated from here, so I peeked at the ABC for my program. ABC is apparently a stack-based bytecode language, much like Java bytecode. Here’s a snippet from my sample program, with my annotations after //:

  // var sum = 0;
  // sum was assigned local variable ‘slot 0′ in ABC file headers
  // Stack starts as: [stuff]
  pushbyte 0
  // Pushed a 0 value onto the stack. Stack is now: [stuff] 0
  getglobalscope
  // Pushed the global ‘this’ onto the stack. Stack is now: [stuff] 0 this
  swap
  // Swapped top 2 elements of stack. Stack is now: [stuff] this 0
  setslot 1
  // Stored top of stack into slot 1 of next stack element. Stack is now: [stuff]

I’m not entirely sure how the compiler writers manage to get all swap, over, dup, pick3, etc. operators right, but the code itself is understandable enough.

Transformation 0->1: ActionScript->ABC. This can be done externally to TT by the ActionScript Compiler, ASC, which is part of the Flex SDK.

Program Form 2: IL. Tamarin IL is yet another stack-based bytecode. This is the IL that the VM executes directly when in interpreter mode. The basic operations include:

  • Stack manipulators, such as DUP, DROP, OVER,
  • Arithmetic and logical operators, such as IADD,
  • Control flow operators, such as LBRT,
  • Object and variable storage operators, such as SETSLOTVALUE_I,
  • Interpreter internals operators, such as _debugenter, verify_x, and
  • Weird stuff like op_ROTNAME2_SWAP_ROT_SETRT.

I think the weird stuff might be “superinstructions”, which are instructions that just implement a short sequence of basic instructions. Apparently they help interpreters run faster because by reducing decode overhead, like old CISC processors.

These instructions are defined in files named core/vm_*.h, e.g., vm_min_interp.h. These files are heavily macroized, which allows the actual meaning to be controlled by defining those macros in different ways, although I’m not sure exactly how this feature is used yet. Here’s IADD:

  INTERP_FOPCODE_INTERP_BEGIN(IADD)
    /* IADD None {‘stktop': 0} */
    const int32_t tmp_i_0 = int32_t(sp[-1].i) + int32_t(sp[0].i) ;
   
    INTERP_ADJUSTSP(-1)
    sp[0].i = tmp_i_0;
    INTERP_INVALBOXTYPE(sp[0])
  INTERP_FOPCODE_INTERP_END(IADD)

As you can see, it adds the top two elements of the stack (sp) as integers, cuts the top element off the stack, and stores the sum in the new stack top. There’s also some junk about invalidating the box type, which seems to be some kind of debugging feature. When this is included in VMInterp.cpp, the begin and end macros will turn it into a switch case. From VMInterp.ii in my build:

  foplabel_IADD: { pre_interp(interp, f, ip+-1, sp, rp);
    const int32_t tmp_i_0 = int32_t(sp[-1].i) + int32_t(sp[0].i) ;
    sp += (-1);
    sp[0].i = tmp_i_0;
    do { } while (0);
  goto *k_foplabels_interp[*ip++]; }

pre_interp just prints out a trace of the instruction if the interpreter is in verbose mode and a bunch of other conditions are true. The “computed goto” at the end is an indirect jump to the case for the next instruction. This is some kind of optimization but I’ve never gotten a really convincing answer as to why it works, or if in fact it works, so I won’t go into it here. Processor experts, feel free to educate me.

The main control flow operators (i.e., used for if) seem to be LBRT and LBRF (local branch if true/false?). LBRT branches to a selected point in the IL sequence if the top of the stack is true, interpreted as a boolean. The target is specified as an offset from the address of the LBRT instruction. The target is given in the IL stream in the 2 16-bit units immediately following the LBRT code.

I also wanted to know what a function call looks like in IL, and it was surprisingly hard to figure out for some reason that I still don’t understand. But it looks like a standard call to a JS function is through an opcode w_callprop_only or w_callprop_argcok. These opcodes don’t seem to be defined in the usual vm_*_interp.h, but somehow they are made to branch to foplabel_TRACE_super_or_extern in VMInterp.ii. That code does the usual saving of a return pointer and setting the instruction pointer. Returning is accomplished with a less mysterious w_returnvalue or w_returnvoid opcode.

The type of the instructions is FOpcode, which is a 16-bit int.

Transformation 1->2: ABC->IL. This is where Tamarin starts. This step is really important because some other system, like Tracehydra, that wants to use Tamarin, should basically do the same thing, except for some other form of input instead of ABC.

Tamarin performs this transformation on one method at a time, which is typical of JITs. The transformation is perfomed by Verifier::verify, which simultaneously verifies the ABC (checks for ill-formed ABC, like the Java bytecode verifier) and outputs Tamarin IL. The entry point to the verifier is apparently  Interpreter::verify_x, which is just the implementation of an IL instruction also called verify_x.

That last part was probably pretty confusing. It surprised me, at least. It means the interpreter is already running IL by the time it actually translates and runs any ABC. I think what this means is that when the shell starts up, it starts the interpreter with a bit of boilerplate IL. The IL itself has code to verify and run the ABC program.

[Warning: compiler-expert-level material.] The verifier works as an abstract interpretation of the ABC. It uses the information gathered both to check for problems with the ABC and to guide IL generation. The state of the abstract interpretation is an abstraction of the ABC stack, just a list of types of objects on the stack, along with a lot of flags describing other conditions.

A standard abstract interpreter works as a fixed-point solver that possibly makes several iterations over the code. But TT’s verifier doesn’t work like that at all. The reason is that it requires that the states be equal at every join point, otherwise the ABC is invalid. So if the ABC is valid, then the interpreter never needs to look at a given instruction twice. And because backward branches in the ABC exist only for loops, this means the verifier can just run in a single forward pass through the ABC bytecode sequence. But it’s still really an abstract interpreter. Pretty neat. [End super-hard stuff.]

So the verifier runs over the ABC in sequence, always tracking the current abstract stack state. For each instruction, it generates IL. Part of generating the IL is generating any IL needed to fetch operands-the verifier can use the stack state to figure out where they are. After generating IL, the verifier simulates the effect of the instruction on the stack. This is all handled by a big switch on the ABC opcode, and each case has separate logic to generate IL and simulate the results.

The actual writing of the bits and bytes of IL is delegated to a class MethodWriter, which in turn delegates to ForthWriter (there’s that Forth stuff again). ForthWriter is a small class that maintains a buffer of IL bytecodes. It has a small API with methods like emit_simple(), which emits a simple IL instruction. MethodWriter is a wrapper used to generate Forth from ABC. The MethodWriter API takes ABC instructions as input and then tells the ForthWriter to emit the corresponding Forth bytecodes. For ABC opcodes with a direct IL equivalent, it looks up the equivalent in k_abc_opcode_map, which ultimately comes from vm_*_codepool.h. Otherwise it just has to work a little harder.

Time for a little example. I’ll use the same snippet I used above, with the IL translation trace straight out of the log file:

                  frame: global * * global
  2:pushbyte 0
+000D LITC0
+000E w_ibox
                  frame: global * * global int
  4:getglobalscope
+000F OVER
                  frame: global * * global int global
  5:swap
+0010 SWAP
                  frame: global * * global global int
  6:setslot 1
+0011 LITC4
+0012 w_ck_setslot_box
                  frame: global * * global

I had no idea what any of this was the first time I looked at it. There are 3 kinds of lines. First, the “frame: ” lines are the abstract interpreter’s stack state (“stack frame”) at each point. Second, the lines that start with numbers are ABC bytecodes. Third, the lines that start with “+” are the IL translation of the ABC code just above them.

The first bytecode:

                  frame: global * * global
  2:pushbyte 0
+000D LITC0
+000E w_ibox
                  frame: global * * global int

‘pushbyte 0′ in ABC pushes the value 0 onto the stack. In IL, this is two steps: first, an opcode that pushes the literal C int value 0 (i.e., a machine word of 0 bits), then an opcode that boxes that C int into a Box. Box is Tamarin’s standard value type that can hold anything. To C, a Box looks like an IEEE-754 64-bit NaN. I guess any float that has an exponent (bits 52-62) of all 1s and a fraction that is not all 0s is a NaN, so there are really 2^53-1 different NaN values. Tamarin cleverly uses only one of those in floating-pointer computations so it can pack other data types, like 32-bit ints, into the 2^53-2 spare NaNs. A Box starting with the bit pattern 1111111111000 is an int, and the rightmost 32 bits contain the int data. See Box.h.

In our example, note how the verifier knows the top of the stack now holds an int.

The second bytecode:

                  frame: global * * global int
  4:getglobalscope
+000F OVER
                  frame: global * * global int global

This is kind of interesting. An object-model-type ABC instruction, getglobalscope (pushes scope to use for unqualified name lookups), got translated into a stack manipulation IL instruction, OVER (pushes the value just under top of the stack). Apparently, the global ‘this’ scope goes on the stack at the start of a method, and the interpreter records its position. If no other scopes have been entered (no with blocks), then the verifier can just emit an instruction to pick it from its current depth in the stack, since of course the interpreter also always knows the stack size. If there are other scopes in play, then the verifier emits a w_getouterscope instruction, which calls the interpreter C method Interpreter::getscopeobj, which goes through a few levels of direction but is ultimately pretty simple, grabbing a scope chain for the current method, and then picking off the first scope. Well, I really don’t know what that means, because I’m not too clear on scope chains yet, but it’s only a few lines of code, anyway.

That’s enough for now.

Comments

Comment from Steven Johnson
Time: May 18, 2008, 12:00 pm

Hi David, nice post. A few comments/corrections/clarifications:

> TT people often refer to their IL as “Forth” because they based the design on Forth or something

The IL is essentially a dialect of Forth. Nothing magic about Forth, it’s just a simple, small, portable language that’s easy to model and pretty close to a portable assembly language.

> Digging in to Tamarin. There doesn’t seem to be a lot of documentation on TT

Sadly, yes, this is the case. Efforts like this are highly welcomed. Mason Chang’s blog is also a good resource: http://masonchang.blogspot.com/

> Once the tracer kicks in, TT is running machine code traces (e.g. x86-64 ISA).

Actually, we don’t yet support x86-64, only x86-32, ARM, and Thumb. (x86-64 work is underway.)

> Tamarin IL is yet another stack-based bytecode

Technically, it’s not a “bytecode” because we’re currently using 16-bit opcodes…

> I think the weird stuff might be “superinstructions”, which are instructions that just implement a short sequence of basic instructions.
> Apparently they help interpreters run faster because by reducing decode overhead, like old CISC processors.

Exactly. Our Forth compiler (fc.py) tries to create superinstructions aggressively as it makes a huge difference in interpreter speed, doubling speed in a lot of cases. The faster the interpreter can run, the less agressively we have to create traces, which translates into less memory needed for traces. We’re not done yet in improving interpreter speed. There’s more work to be done here, as there’s no point in generating superinstructions for non-time-critical code (eg error handling), and conversely you want to be more aggressive in really hot spots… we need to add some profile-driven feedback to the Forth compiler to help it make those decisions; see https://bugzilla.mozilla.org/show_bug.cgi?id=432541 for more info.

> There’s also some junk about invalidating the box type, which seems to be some kind of debugging feature.

Bingo. Basically every Box is a tagged value, but there are times when we don’t bother updating the tag because we know it won’t be examined before it’s set again. Early in development this was the source of lots of heisenbugs, so in Debug builds we always update the tag, but sometimes to a value which generates assertions if you attempt to examine it.

> The “computed goto” at the end is an indirect jump to the case for the next instruction. This is some kind of optimization but I’ve
> never gotten a really convincing answer as to why it works, or if in fact it works, so I won’t go into it here.

It makes a huge difference in speed, something like 20% last I checked (try building both ways and run a performance test and you’ll see). Unfortunately, GCC is just about the only mainstream compiler that supports this feature (AFAIK). Why is it faster? Mainly because so many of our operations are small, so the overhead of dispatching is significant relative to the actual work being done. The code generated for a computed goto is often simpler/faster than for switch, and also performs better with branch prediction on modern processors.

> LBRT and LBRF (local branch if true/false?).

Actually, Long branch (32-bit offset) — there is also a BRF (16-bit offset).

> These opcodes don’t seem to be defined in the usual vm_*_interp.h, but somehow they are made to
> branch to foplabel_TRACE_super_or_extern in VMInterp.ii.

Actually, only in trace mode. What you have to keep in mind is that the interpreter has two distinct loops: do_interp(), which is used when simply interpreting a sequence of opcodes, and do_trace(), which is used when we decide that we are interpreting a “hot” section of code that might need to be jitted. do_trace() has to do considerable extra work, as it does everything that do_interp() does, PLUS build up a series of instructions to be JIT’ed. The reason for the branch to foplabel_TRACE_… is that the JIT only knows how to process the primitive set of Forth operations, so superinstructions are traced as the series of primitives they are composed of. (This is considerably slower, of course, but in do_trace() the time is completely dominated by work done by the JIT so this is generally lost in the noise.)

Pingback from Ajaxian » Having a Tamarin trace a Spidermonkey
Time: May 19, 2008, 4:15 am

[…] Mandelin has posted about Tracehydra, which is the idea that the traced based JIT engine that is being worked on as part of Tamarin […]

Comment from Edwin Smith
Time: May 19, 2008, 6:27 am

Reading with baited breath…

* macros are used in superinstruction defs to mainly enable various kinds of code for different compilers, without stuffing all that into the forth compiler (utils/fc.py).

* the motivation for superinstructions is that when interpreting, CISC is better because it reduces dispatch overhead and does more work per instruciton, providing more context to the C++ compiler. All SM’s bytecodes are fat — what we’d call superinstructions in TT.

* the motivation for writing fat opcodes in Forth is that we can generate IL for them easily. we need the IL so we can both interpret and trace these fat opcodes. Tracing C++ means tracing x86 (or whatever) and makes it hard to recover important semantics required for optimization. (the same risk is present in forth — if it’s too low level we can boil away important semantics). its a balancing act.

* why forth? the IL is stack machine bytecode, forth is just a “source” dialect for stack machine code. an advantage of handwriting forth is easy factorability because functions do all their side effects on the stack. a disadvantage is obscurity …

* using stack machine IL is somewhat arbitrary. there is evidence that a virtual-register based IL (one-addr or two-addr encoding) would be a better fit. — we dont have to give up semantic richness to do this especially if treehydra is feasible.

Comment from Callek
Time: June 1, 2008, 9:16 pm

I’m already lost and not even through this whole post… one note, seems you forgot something:

“By the way, this picture gives an overview. (Picture does not exist yet.)”

Perhaps it could exist now? ;-)

Comment from Zsolt Világos
Time: June 23, 2008, 5:09 am

I am qurious about the diagrams created by Graydon Hoare but I could find them, can anyone give me a link.
I also could find Chris Double’s posts.

regards,
Zsolt

Comment from Zsolt Világos
Time: June 24, 2008, 7:06 am

Sorry I mistyped twice: I __couldn’t__ find the mentioned ones.

Comment from muhtar
Time: March 4, 2010, 6:02 pm

I’m already lost and not even through this whole post… one note, seems you forgot something:

Comment from interpreting services
Time: September 18, 2010, 11:45 pm

Tamarin (technically, tamarin-tracing, henceforth TT)-related projects keep peeking up at me from the horizon. First, there’s a good chance I’ll have an intern working on TT this summer.It is very good. I also visit this site.
Thanks for sharing.

Comment from parfume
Time: September 22, 2010, 6:27 am

race nesting is having a trace tree as part of another trace tree (inner loops)

Comment from Mesothelioma Forum
Time: October 13, 2010, 4:51 pm

Tamarin is a free virtual machine with just-in-time compilation (JIT) support intended to implement the fourth edition of the ECMAScript standard.

Comment from karot
Time: October 18, 2010, 5:38 am

very interesting thank you

Comment from Chest congestion remedies
Time: October 19, 2010, 11:38 am

Hi David, good post. Some comments / corrections and clarifications: TT people often refer to the IL as ‘Forth’, because it bases the design of Forth or something of the IL is essentially a dialect of Forth. No magic Forth is a simple, language of small, portable, easy to model and close to a portable assembly language. Tamarin excavation. There seems to be a lot of documentation on the TT Unfortunately, yes, this is the case. Efforts like this are very welcome. Blog Mason Chang is also a good resource: ttp: / masonchang.blogspot.com / / Once the indicator is activated, the TT is running machine code traces (eg x86-64 ISA). In fact, I still do not support x86-64, only x86-32, ARM, and thumb. (X86-64 is working.) IL Tamarin is another stack-based bytecode technically is not a “bytecode” because we are currently using opcodes 16-bit … I think things may be rare “superinstructions’, which are the instructions that implement just a short sequence of b Nice to be visiting your blog again, it has been months for me. Well this article that i’ve been waited for so long. I need this article to complete my assignment in the college, and it has same topic with your article. Thanks, great share.

Comment from testking 350-030
Time: November 22, 2010, 8:33 pm

Hi, nice post. I have been thinking about this topic,so thanks for sharing. I will likely be coming back to your blog. Keep up the good work.

Comment from cna certification
Time: November 23, 2010, 11:21 am

I really believe you will do much better in the future I appreciate everything you have added to my knowledge base.Admiring the time and effort you put into your blog and detailed information you offer!

Comment from Cheap Car Parking at Heathrow Airport
Time: November 24, 2010, 9:31 pm

I suggest this site to my friends so it could be useful & informative for them also. Great effort.

Comment from student accommodation cardiff
Time: November 25, 2010, 10:53 am

Great stuff from you, man. Ive read your stuff before and youre just too awesome. I love what youve got here, love what youre saying and the way you say it. You make it entertaining and you still manage to keep it smart. I cant wait to read more from you

Comment from Argumentative essay topics ideas
Time: November 27, 2010, 4:15 am

I qurious on the diagrams created by Graydon Hoare but I could find, can anyone give me a link. You can also find the messages Chris Double. regards, Zsolt You have a point. Very insightful. A nice different perspective