Main menu:

Site search

Categories

Archive

Tamarin Tracing Interals, Part II: Forth

The Need for Forth Subroutines. I had a really hard time tracking down how TT adds a pair numbers (ActionScript code like “sum += i”) worked until I finally figured out that ECMAScript “+” is not a primitive operation in TT. This makes perfect sense now, as “+” is complicated: it has to do different things depending on the argument types (floating-point addition, string concatenation).

It would still be possible to make “+” an IL primitive implemented in C++, but it’s better not to, and TT doesn’t. The reason is that TT wants to be able to specialize the code for “+” when tracing. For example, if the arguments are doubles on the trace, then TT could emit only the floating-point addition code, which is faster and smaller than the general code. But this is hard to do right if “+” is an IL primitive. In that case, the tracer would need complex C++ code to perform the specialization for every primitive of this kind. And even the slightest difference between the tracer’s logic and the interpreted code would cause weird VM bugs. As Graydon told me, TT tries hard to avoid this kind of redundancy.

Instead, TT implements ECMAScript “+” (IL OP_add) as a subroutine of IL instructions. I’m not sure what the official TT terminology is, but extern is the word used in code identifiers. To execute an extern in interpreted mode, the system simply jumps to that subroutine, executes the IL instructions that carry out the case logic, and returns when done. In tracing mode, the system can trace and optimize the extern’s IL just as it does any other IL.

Basic Forth. Externs are written in Forth. (The Adobe guys explained about Forth in comments to my last post.) Forth is pretty close to IL, so the Forth compiler can be pretty simple (fc.py, 1900 lines of code). I don’t know Forth, but I did once do a bunch of HP-28S programming, which used a Forthlike language.

Mason Chang has some good pointers on Forth. I’ll also give a quick summary of what I found out. In Forth, the state of the system is pretty much just a stack, and Forth code is pretty much just a sequence of stack operations separated by spaces. The stack operations are called Forth words (“word” as in “command”, no relation to machine words). Take this code snippet:

0 i2d

“0” is a Forth word that means “push 0 onto the stack”. “i2d” is another Forth word, a TT primitive operator that converts the top of the stack from an int to a double. So the total effect of this snippet is to push double 0.0.

We can package this as a Forth subroutine named “float0″ just like this:

: float0 0 i2d ;

“float0″ is now a Forth word, usable like any other. The colon and semicolon start and end a definition. (I think colon and semicolon are “officially” Forth words themselves, although the TT Forth compiler fc.py treats them more like syntax.)

Cases. Another interesting Forth feature is case “statements” (not sure what the correct term is). I think this is a typical feature but the exact way it is defined in TT is specific to TT. In TT, a Forth word can be defined as a case statement, which seems to be used mostly for dynamic dispatch. E.g.:

CASE: toboolvec ( xvalue bt — bool )
( 0 BoxedDouble ) d2b
( 1 BoxedNull ) no-op

( 8 BoxedInt ) i2b ;

This defines a new Forth word toboolvec, which converts a value to a boolean. (I think “vec” is a TT conventional ending for case-defined words.) TT Forth CASE is pretty tricky, and I had to look around for a while to figure out how what it really does.

The first thing to note is that “( … )” is a comment in Forth. The first comment in the CASE above is a stack comment, documenting the effect of the word on the stack. This word pops 2 inputs: xvalue, a boxed value to convert to a boolean, and bt, a boxtype constant that represents the type of xvalue. The word then pushed one value: the boolean representation of xvalue.

Second, unlike a C switch, these CASEs don’t represent the conditions for each case explicitly. Rather, a CASE pops the top value (which should be boxtype, here), which must be an integer k, and then executes the kth word of the body of the case. The comments in the case body document this relationship and make it a lot easier to read.

toboolvec is used by an easier-to-use word, tobool:

: tobool ( xvalue — bool ) boxtype toboolvec ;

boxtype ( x — x i ) pushes the boxtype constant indicating the type of the top of the stack, which must be a boxed value. So tobool just converts the top of the stack to a boolean.

OP_add. Now I actually know enough to understand how OP_add is defined. It’s defined in vm.fs in several pieces, but basically it uses the boxtype word to get the type of both operands, then case words that double dispatch to a type-specific addition operator. When adding two doubles the final addition operator is w_fadd. This is defined:

EXTERN: w_fadd f+ dbox ;

f+ is another name for DADD, which defined:

PRIM: DADD (( f0 f1 — fr ))
interp:{ fr = f0 + f1 ; }
trace:{ interp.trace_binop(LIR_fadd, sp, USE_Q, USE_Q); } ;

(Mason explained how primitives work.)

Sadly, it appears from reading the int+int case that when adding two ints, TT must promote them to doubles because ECMAScript requires numeric addition to work as if the operands were doubles. I wonder how much of a performance penalty this is for array-iteration code and if there are any opportunities to optimize this by figuring out when it’s safe to keep the int representation, perhaps by loop variable interval analysis or speculation…

Comments

Comment from Edwin Smith
Time: May 22, 2008, 9:20 am

A forth subroutine is usually called a “high level word”. the syntax is

: words ;

EXTERN: replaces : and marks the word as callable from C. fc.py only compiles words reachable from the EXTERN entry points. so EXTERN in a forth file is like the “extern” keyword in a C file.

TT habit of appending “vec” to name cases comes from the forth notion of vectorized execution. A CASE is basically a jump vector (table) definition – executing a case word is like “goto *vec[*sp–]”.

int(double(a)+double(b) used in an array index context is a hotspot in TT and would definitely benefit from statically optimizing out when possible, or adding an overflow check to dynamically keep things as ints as long as possible.

when doing int + int it is usually a win to speculate that the result doesnt overflow. in array index contexts, its always a win.

for cyphers that count on wraparound, you want to have a way to optimize out the int->double->int conversion that happens in the overflow case. (in other words cyphers want wraparound behavior).

outside of array index contexts, the jury is still out. Mike Pall made a good case for leaving everything as double.

Comment from dmandelin
Time: May 22, 2008, 10:12 am

Cool, thanks for adding your comments, by the way. This really helps.

I saw the email about Mike Pall’s experience. We were talking about that the other day and array accesses was the one probable exception that came up.

So if you speculate non-overflow on integer addition, and use an integer addition ISA instruction, does that mean you have to add a check for overflow afterward? If so, is that costly?

The cipher thing is interesting. I guess it would be a lot easier if int semantics are available. But I guess there is no int type even in ES4? So are you thinking language extensions, or some kind of pattern recognizer for ES that encodes wraparound semantics for Number?

Comment from Edwin Smith
Time: May 22, 2008, 11:36 am

You can get int semantics in ES3 like this:

~~(a+b)

since >> (and the other bitwise operators) convert to int32 first.

I prototyped adding an overflow check operator in TT’s forth that translated into checking the condition codes at the ISA level, and guarding on not overflowing. it generally works, and it worked well enough to pursue further.

i did hit a snag optimizing it because that deepdown branch can cause a tree branch where you dont want one. both sides of the branch will end up computing (a+b) after you optimize, but now you have two identical tree branches.

Without a technique like UCI’s tree folding, which would eliminate those branches, the penalty for too many branches was higher than the penalty for doing extraneous doubleToInt32()’s. crypto-md5() is a case that showed this behavior.

this was one of the factors that led us to introduce a static abc->IL translation step. in the case we’re talking about its soo easy to handle staticaly, and really tricky to handle dynamicaly.