Main menu:

Site search

Categories

Archive

SquirrelFish

If you’re reading this, chances are that you already know about SquirrelFish, Appl/WebKit’s new Javascript implementation. Early tests show SquirrelFish to be 60% faster than WebKit 3.1 JS, 46% faster than Spidermonkey and 52% faster than TT (Tamarin Tracing) on SunSpider.

Clearly we have some work to do. The plan is to improve TT so that hot loops run highly optimized native code; TT’s optimizer is in the early stages, and we think there’s a lot of room for further optimization. For example, as explained in my previous posts, when TT jumps from one trace to another, it has to save the interpreter state to a standard format and then reload the state on the new trace, and no cross-trace optimization is possible. Ideas like trace folding have potential for a big improvement here.

But I still wanted to take a quick peek into why SquirrelFish’s interpreter is so fast. The announcement touts 3 key design features:

  • Using a bytecode interpreter.
  • Using direct-threaded interpreter dispatch.
  • Using a register-based bytecode interpreter.

Bytecode interpreter. The old WebKit JS was based on an AST walk, which is explained in some detail in the announcement. It’s well-known that bytecode interpreters are faster, so it’s no surprise Apple made this change. Spidermonkey and TT’s interpreter have always been bytecode interpreters.

Direct-threaded dispatch. Interpreters tend to spend a lot of time dispatching bytecode operations. Direct-threaded dispatch is a technique for efficient dispatch.

The obvious way to write a bytecode interpreter is with a switch statement inside a loop:

void run(Bytecode *ip) {
for (;;) {
switch (*ip++) {
case OP_ADD: …
case OP_JUMP: …

Each iteration runs one bytecode instruction. Each case of the switch handles one instruction type. It really doesn’t look like there’s any room for improvement here unless you look at the assembly code generated for the switch dispatch (from a tiny test interpreter I wrote today, comments added by me):

# ip is in %edx
# Check that switch expression (%edx) is in table range
cmpl    $9, (%edx)
leal    4(%edx), %ecx
ja    L26
# Look up case address offset for (%edx) in table (L37)
movl    (%edx), %eax
movl    L37-“L00000000006$pb”(%ebx,%eax,4), %eax
# Add base address to offset
addl    %ebx, %eax
# Indirect jump to computed address
jmp    *%eax

The basic idea is to keep a table of relative offsets to the cases, and then jump using that offset. Because the switch expression could evaluate to anything, the compiler must first generate a range check, so that if the switch expression doesn’t map to any case, the program leaves the switch instead of crashing unpredictably.

But in reality the range check is useless, because the interpreter can control what bytecodes actually appear, so this is 3 wasted instructions. Also, the lookup and base+offset address computation seems kind of clunky. I’d prefer dispatch code something like this:

jmp    *%edx

This is direct threading. In principle, the idea is simple: the instruction code (e.g., OP_ADD) is the address of the case target code. (In the basic design, instruction codes are integers.) Then, this jump is all you need for dispatch.

Coding up direct threading is weird; normal C compilers don’t know how to do it. But it can be done with GCC’s computed goto extension (see the paper on direct threading from the SquirrelFish announcement). See also my tiny interpreter.

I believe TT and Spidermonkey use an intermediate design called indirect threading, which gets most of the speedup of direct threading, but allows integer opcodes. The opcode is an index into a table of case target addresses. So the dispatch code has to look up the case target, then jump, something like:

leal   TABLE(%edx,4), %eax
jmp   *%eax

Not too bad. I have no idea how significant the difference between direct and indirect threading is in practice, but even a few percent speedup would be great.

Register-based interpreter. “Traditional” interpreter design keeps all operands and intermediate values of expressions on a data stack. (This is completely different from the call stack.) I can think of a few design reasons why this might be, but I really don’t know why it’s been done this way. Anyway, with a stack, bytecode for a line of code like “x = a + b + c” looks like this:

LOAD a  # push a onto stack
LOAD b
ADD      # pop 2 elements, add them, push the result
LOAD c
ADD
STORE x

One nice thing about this is that the algorithm for generating this bytecode is very simple, so it’s not a lot of work to code and runs fast. But there’s a lot of code just to manipulate the stack here, and it might be nice to avoid it. (Real stack-based programs have even more stack manipulation, with DUP, SWAP, ROT, etc. operations.)

A register-based design avoids the stack by using a fixed array of “slots” or “registers” for operands. (These registers are completely different from assembly code registers.) The register-based version of the above line of code would be something like:

ADD temp, a, b   # temp = a + b
ADD x, temp, c

Keep in mind that in the register-based bytecode, where I have “x” it would actually have something like “0”, if x had been assigned slot 0 in the register table. Note also that the bytecode generator has to analyze the code a bit to decide how many registers are needed and assign them to the variables and temporaries.

Each design has advantages and disadvantages:

  • Bytecode code size. Stack-based bytecode programs tend to be smaller, because the operands are implicit. But the register-based program gets to omit the stack manipulation instructions, so the stack-based bytecode should be only a little bit smaller. The memory savings shouldn’t matter in most environments, but reading less bytecode will save the interpreter a bit of time.
  • Bytecode instruction count. As shown above, a register-based program will have fewer bytecodes. Fewer bytecodes to execute means faster run times, if it takes the same length of time to execute a bytecode in each design. Which it doesn’t:
  • Operand access. To access operands, the stack-based interpreter just reads and writes from the top of the stack or very near it. A register-based interpreter needs a two-step process: first it has to get the address of that register, then read or write it. But the register-based version doesn’t need to adjust the stack. Still, it seems like more work per instruction for the register version: for ADD, the register version needs to compute 3 operand addresses, while the stack version just one (the new stack pointer).

Overall, when you look across instructions, I think the total amount of operand addressing computation is about the same. The stack version distributes it across more instructions. The stack version might be able to save a few by having special instructions like LOAD0 (to load slot 0, needing no computation). But the register version still has few instructions, so fewer dispatches. With direct threading, the dispatch is fairly efficient, so this is less important, but an indirect jump is still usually pretty expensive compared to a normal instruction because of the high branch mispredict probability.

Microbenchmarking. I wrote a tiny interpreter that runs a bytecode program for an empty for loop using doubles as numbers (as in JS) to test out these things for myself. I tried 4 design choices: direct threading vs. switch and stack vs. register. The times for 10M iterations in milliseconds:

Stack Register
Switch 230 90
Direct 105 55

Here you see huge differences. The reason the differences are so big (far bigger than WebKit got) is that my tiny interpreter’s opcodes are very simple. In a real JS interpreter, the code to run for the average operation is a lot longer, so dispatch overhead is a much smaller fraction of total time, and dispatch overhead is the main thing saved by direct-threading and registers.

Note that the speedup of register vs. stack is smaller in the direct-threaded case, as I predicted above. But not that much smaller, because the dispatch really is expensive compared to everything else in this simple interpreter.

You can get my microbenchmark code here.

Comments

Comment from Cialis Generic
Time: October 11, 2010, 8:18 pm

Hi, just I’ve got the code. I’ll work on it. Thank you so much.

Comment from Tile Floor Cleaner
Time: October 11, 2010, 8:21 pm

Great man! You are a genius!

Comment from fulltilt-poker
Time: October 12, 2010, 1:15 am

thanks for all the tips!

Comment from Vintage go kart
Time: October 18, 2010, 11:24 am

One thing I noticed curl his interpreter does is this: # define ENDCASE Go COUNT_INCR ** + + IP, which puts the indirect jump at the end of each virtual machine instruction. Because the current is so small, and most effectively instructions to issue the following (for example, JGE LOAD1 always comes after), this should lead to the prediction of the branch MUCH better in most modern CPU indirect branch predictors (eg the CORE architecture – ttp: / / arstechnica.com/articles/paedia/cpu/core.ars/7). At least I think that is true would be interesting to see what the difference is changing that: # define ENDCASE COUNT_INCR continue; Rob This blog shows you some of the top tips and information on how to go about SEO, link building, articles and search engine news.

Comment from holiday spectacular tickets
Time: October 21, 2010, 9:17 pm

Useful information shared..I am very happy to read this article..thanks for giving us nice info.Fantastic walk-through. I appreciate this post.

Comment from SHOES
Time: October 22, 2010, 5:41 pm

Great site !
You explain well as a developer i can say you put great effort here. As Dehydra is gcc plugin then compilation become too easy because of taking local intraction with database

Comment from bwin.fr
Time: October 24, 2010, 5:52 am

Great article very interesting, thanks

Comment from bad credit personal loans
Time: October 24, 2010, 3:32 pm

Thanks for the amazing article here. I was searching for something like that for quite a long time and at last I have found it here. Your blog is better than others because of useful and meaningful posts. Keep posting them in the future too, I will be waiting for that!

Comment from online porn game
Time: October 24, 2010, 9:55 pm

Nice work by the blogger, I enjoyed reading the article…….. Keep posting such interesting stuffs.

Comment from new ps3 games
Time: October 25, 2010, 2:19 am

Interesting one.thanks for posting

Comment from Mobil Keluarga Ideal Terbaik Indonesia
Time: October 25, 2010, 8:57 am

regards.

very detailed discussion, yes .. although I do not know but this will add my insight. thanks.

Comment from well pumps
Time: October 26, 2010, 3:39 am

Fabulous post and great point mentioned in ti.

Comment from Frederik
Time: October 26, 2010, 3:52 pm

am very happy to discover your post as it will become number 1 in my collection of favorite blogs to visit.

Comment from wicked canon theatre on
Time: October 28, 2010, 4:49 am

Great website…and cool article man…thanx for the great post…keep on posting such articles… Resources like the one you mentioned here will be very useful to me! I will post a link to this page on my blog. I am sure my visitors will find that very useful.

Comment from Cash Advance
Time: October 28, 2010, 5:01 am

Pay day loans are a type of loan that is offered in the UK. It is a short term loan used to shoulder unnecessary, emergency, and unexpected expenses. As such, there is a limit to the amount of money you can borrow in pay day loans. And you will also have to pay for the loaned money as soon as payday arrives.

Comment from Norwesco Tanks
Time: October 30, 2010, 12:24 am

Great blog, I ultimately found one of the best blog which I was searching from a long time. I loved this blog so much because you used shared nice information with me.

Comment from Toy army men
Time: October 30, 2010, 8:28 am

Therefore, I think the question is, SpiderMonkey / actionmonkey / benefit go TT bytecode stack-based registration? Very good post. The language barrier is difficult, but I think I understand the jist of what you are saying.

Pingback from Chrome: refreshing the Web browser – Daniel Shorten
Time: November 5, 2010, 8:17 am

[…] but they really feel like they're online.So I'm excited to see the innovative steps Google and some other groups are taking to improve the performance of modern Web applications:Webkit engine: Google is basing […]

Comment from pari en ligne
Time: November 6, 2010, 12:40 pm

Ecxellent tutorial, thanks for sharing.

Comment from iphone sale
Time: November 13, 2010, 5:38 am

Great article.

Comment from free grants for women
Time: November 13, 2010, 5:38 am

Great article.

Comment from Spray Foam Insulation
Time: November 13, 2010, 5:43 am

Nice job.

Comment from New York Sightseeing tours
Time: November 13, 2010, 5:55 am

Very good site.

Comment from paris sportifs en ligne
Time: November 14, 2010, 9:25 am

Nice website. Excellent !

Comment from Mobil Keluarga Ideal Terbaik Indonesia
Time: November 14, 2010, 10:20 am

I admire the valuable information you offer in your articles. I will bookmark your blog and have my children check up here often. I am quite sure they will learn lots of new stuff here than anybody else!

Comment from Mobil Keluarga Ideal Terbaik Indonesia
Time: November 14, 2010, 10:21 am

Hello, nice to know this blog. I hope you can visit my blog too.

Comment from skin tag removal
Time: November 16, 2010, 10:14 pm

I really didnt had any idea about SquirrelFish.But really enjoyed a lot reading your post.

Comment from jump manual
Time: November 21, 2010, 4:26 am

Early tests show SquirrelFish to be 60% faster than WebKit 3.1 JS, 46% faster than Spidermonkey

Comment from Mobil Keluarga Ideal Terbaik Indonesia
Time: November 22, 2010, 12:53 am

Interesting blog, well worth a shot!. Thanks for great article. Great insights. I loved to read your article. Admiring the time and effort you put into your blog and detailed information you offer!

Comment from testking 70-642
Time: November 22, 2010, 8:35 pm

Thanks for post. It’s really informative stuff.I really like to read.Hope to learn a lot and have a nice experience here! my best regards guys!

Comment from cure leucorrhea
Time: November 23, 2010, 1:24 am

I am a newbie to your website but the information you posted on SquirrelFish is just Amazing.

Comment from sports news
Time: November 23, 2010, 3:50 pm

This is one of the good articles you can find in the net explaining everything in detail regarding the topic. I thank you for taking your time sharing your thoughts and ideas to a lot of readers out there

Comment from laptop repair putney
Time: November 25, 2010, 8:02 am

Woah this is just an insane amount of information, must of taken ages to compile so thank you so much for just sharing it with all of us. If your ever in any need of related info, perhaps a bit of coaching, seduction techniques or just general tips, just check out my own site!

Comment from third party logistics services
Time: November 26, 2010, 4:54 am

I had absolutely no idea about SquirrelFish.Thanks for sharing your views.

Comment from New Used motorcycle values
Time: November 27, 2010, 3:19 am

Nice one. Webkit said that taking into account that have now moved into a bytecode interpreter, which are planning to do a lot of optimizations in the virtual machine level. What type of optimizations and TT did not considering that they have been using the byte code for quite some time? I love the photos too. You can just see candid smiles there.. So attractive, eventhough different eyes may regard it differently!