Main menu:

Site search

Categories

Archive

Squirrelfishing regexp-dna.js

Recently I’ve been trying to figure out exactly how SquirrelFish Extreme (SFX) is kicking our butts so badly on regexp-dna.js, by about 5x on my machine. Numerically, (WebKit Regular Expression Compiler) WREC provides most of that 5x, but there are some weird twists to the story. My main conclusions: WREC indeed makes for a good regex engine, but also, WREC bails out of any regex with parentheses, and regexp-dna.js contains a bug that favors SFX over SpiderMonkey.

Update: The bug in regexp-dna.js has been fixed in WebKit! Thanks, WebKit team! That was fast. The new code:

for(k in subs)
dnaInput = dnaInput.replace(k, subs[k])
// FIXME: Would like this to be a global substitution in a future version of SunSpider.

Technical details: the design of WREC. There are two main ways to implement regular expressions: using a backtracking matching engine, or by transforming the regex to a finite automaton (NFA, aka “state machine”), which does not backtrack. Most Perl-type regex engines, including both SpiderMonkey’s and WREC, follow the backtracking design. I don’t know the exact history of that choice, but at present it is much easier to implement features like group capture and backreferences in the backtracking design. Also, although some regexes scale only if implemented as NFAs, my tests suggest that many simple regexes, including those in SunSpider, are faster with backtracking.

As of this writing, WREC’s implementation strategy is dirt simple (which is a good thing). There are no transformations or fancy optimizations on the regex. WREC simply generates native code that directly implements the backtracking search. Thus, within a single match operation, there are no function calls, no traversals of regular expression ASTs, and few option tests, so almost all of the overhead is eliminated.

WREC’s code is very easy to read, so if you want to know exactly how it works, just read it in WREC.cpp. It’s also great example code for anyone implementing a compiler for a simple language like regular expressions. The basic plan is to parse the regular expression with functions named things like parseDisjunction (the | operator). Those functions directly call functions like generateDisjunction that generate the native code using the same assembler that the call-threading interpreter uses. There’s also the oddly named “gererateParenthesesResetTrampoline”. Inexplicably preserved typo, or watermark to detect copying of WREC code?

An an example, for the regex /a|bb/, the generated native code looks something like this:

curPos = start
if curPos >= textLength goto CASE2
if text[curPos] != ‘a’ goto CASE2
curPos += 1
goto DONE_MATCHED
CASE2:
curPos = start
if curPos >= textLength goto DONE_FAIL
if text[curPos] != ‘b’ goto DONE_FAIL
curPos += 1
if curPos >= textLength goto DONE_FAIL
if text[curPos] != ‘b’ goto DONE_FAIL
curPos += 1
goto DONE_MATCHED

As you can see, the generated code does almost nothing beyond what is absolutely needed to implement the matching correctly. Also, curPos, textLength, and the start address of text are all kept in fixed registers. Based on a variety of microbenchmarks, I suspect that the critical path for this code is simply the unavoidable work of reading all the characters in the text, so the code is near maximum efficiency.

Backtracking in WREC. One tricky part of a backtracking regex engine is the backtracking. Consider matching the regex /a*a/ against the text ‘a’. First, the engine will greedily match /a*/ with ‘a’, leaving a remaining regex fragment of /a/ and a remaining text of the empty string. When the engine tries to match /a/ against ”, it fails. Now it must backtrack, specifically by going back to the /a*/ step and finding the next longest match. The next longest match is ”, leaving a remaining regex fragment of /a/ and a remaining text of ‘a’, and the engine will finish with a successful match.

I was curious how WREC implemented backtracking. For example, with /a*a/ and a text of a million ‘a’s, I wanted to know if WREC saves all one million shorter matches in case it needed to backtrack them (which seems like a waste of space), or if WREC recomputes them if needed (which could be slow if a lot of backtracking needs to be done). I found that WREC generates code like this for a quantified regex a* (I reordered things to make it easier to read, but the logic is the same):

// First, greedy match
repeatCount = 0
while (next char is ‘a’) {
curPos += 1
repeatCount += 1
}
while (true) {
// Now, try to match the rest
save curPos
… code to match the rest of the regex …
if (rest of regex matches) goto DONE_MATCHED
// backtrack
restore curPos
if curPos <= 0 goto DONE_FAILED
curPos -= 1
repeatCount -= 1
// loop around and try to match the rest from here

}

So the answer to my question is that WREC doesn’t save the intermediate matches, it just keeps track of the length of the text matched against /a*/ (repeatCount in my code) and decrements the position and count in order to backtrack. This is great, because it uses only two words of memory for state (instead of N million) but can find the next longest match very quickly (just 4 instructions).

But this left me wondering, how does WREC handle something like /(a|bb)*x/, where backtracking needs to go back a variable number of characters? Here’s the code that replaces ‘repeatCount-1′ with the backtracking logic needed here:

void GenerateParenthesesNonGreedyFunctor::backtrack(WRECGenerator*) {
// FIXME: do something about this.
CRASH();
}

Interesting. This function isn’t called anywhere, so it’s not a crashing bug in WebKit. The reason it’s not called:

bool WRECParser::parseParentheses(JmpSrcVector&) {
// FIXME: We don’t currently backtrack correctly within parentheses in cases such as
// “c”.match(/(.*)c/) so we fall back to PCRE for any regexp containing parentheses.
m_err = TempError_unsupportedParentheses;
return false;
}

Very interesting. And on my machine, dna-regexp.js runs in about 43 ms (in jsc, the SFX command-line shell), but if I add an empty pair of parens to each regex in the test, it takes 206 ms. Breaking out the time for regexp-dna.js regex matching only (i.e., not including the string building and replace calls), with the parens SFX is 15% slower than SpiderMonkey’s, which is actually still pretty good.

It just so happens that SunSpider’s regexp-dna.js doesn’t use parens in any of its regexes.

But the web does, of course. To find out how common parens are, I instrumented a Firefox build to record regexes being run on web pages and opened some popular pages. Firefox processed 830 unique regexes, of which 287, or just about 1/3, contained parens.

String.replace. The other major portion of regexp-dna.js tests String.replace. Again, SFX is about 5x as fast as SpiderMonkey and I wanted to know why. The calls to replace in regexp-dna all look like this:

dnaInput.replace(‘B’, ‘(c|g|t)’, ‘g’)

This is supposed to replace all occurrences of ‘B’ with ‘(c|g|t)’. The search string is just ‘B’, so no regex processing is needed, just simple string searching. So I started reading WebKit code in StringPrototype.cpp. I saw that there is a special case fast-path for a non-regex search string, which makes sense. In fact, it’s so special and so fast that I couldn’t even find any code to implement the ‘$&’ replacement string syntax required by the ECMAScript standard. Testing SFX:

> ‘aba’.replace(/a/, ‘x$&x’)
xaxba // good: $& is the matched text
> ‘aba’.replace(‘a’, ‘x$&x’)
x$&xba // oops

I also couldn’t find out how the ‘g’ flag was implemented, or even any code to read the third “flags” argument at all. Testing SFX again:

> ‘aba’.replace(‘a’, ‘x’, ‘g’)
xba

Hmmm. So the ‘g’ flag is not implemented at all. That’s OK, it’s not part of the ECMAScript standard, it’s just a SpiderMonkey extension. But that means the benchmark contains a flag that just serves to make SpiderMonkey do a lot more work than SquirrelFish. With that flag removed, so that both programs are doing the same replacement operation, the performance difference goes from 5x to 1.5x. Hopefully SpiderMonkey will have a similar fast path soon, which should bring that to parity.

But I think it’s pretty clear regexp-dna.js has a bug. Either the ‘g’ should be removed, or it should be recoded to do the global replacement in vanilla ECMAScript.

Final words. Based on its performance on the regexes it does handle, WREC is indeed an awesome design. regexp-dna.js, however, is flawed and exaggerates SFX performance.

We could use nanojit to make a regex compiler for SpiderMonkey that would perform as well as WREC. But I don’t know if it’s worthwhile yet. Regex performance is much less important for today’s web than it is for SunSpider–I hope to link to a report on that in a future post.

Comments

Comment from Felix
Time: October 6, 2008, 7:27 pm

regexp-dna.js bug report
https://bugs.webkit.org/show_bug.cgi?id=18989

Comment from Bill McCloskey
Time: October 6, 2008, 7:27 pm

Since you posted a link to the Russ Cox article, I thought I’d point out something that I learned from Ras a while ago: the backtracking approach doesn’t use longest-match semantics as the NFA approach does. So the two techniques differ in more than their performance properties. Here’s an example, written in Python:
import re
re.match(‘(|a)*’, ‘aaaaa’).span() # returns (0, 0)
I would expect the whole string to match, but instead only the epsilon at the beginning of the string matches. This seems really weird to me, but it’s how all backtracking-based matchers work: they pick the “left-most match” rather than the longest one.

It’s too bad that getting NFA-based matchers to do grouping is such a pain in the ass.

Comment from Mark Rowe
Time: October 6, 2008, 8:17 pm

> In fact, it’s so special and so fast that I couldn’t even find any code to implement the ‘$&’ replacement string syntax required by the ECMAScript standard.

Did you file a bug report about this?

Comment from Robert O’Callahan
Time: October 6, 2008, 8:32 pm

Make sure to file a Webkit bug!

Comment from Jeremy
Time: October 6, 2008, 8:48 pm

Minor correction: backtracking regular expression engines *are* NFA engines. What you refer to as NFA implementations above are actually using DFAs, not NFAs.

Same goes for Mr. McCloskey’s comment.

Comment from Cory Sharp
Time: October 6, 2008, 10:17 pm

Bill, you are just observing an abort when matching a repetition of epsilon. Since the epsilon _does_ match then ‘a’ isn’t tried, but epsilon does not advance so the match is complete. Observe re.match(r”(a|b)*”,”aabba”).span() returns (0, 5).

Comment from Cory Sharp
Time: October 6, 2008, 10:25 pm

Bill, my mistake, nevermind.

Pingback from Regex Engines in JavaScript – A River of Words
Time: October 6, 2008, 11:31 pm

[…] led me to this post by what appears to be a member of Google’s Chrome team, regarding the Squirrelfish […]

Comment from Gijs
Time: October 7, 2008, 12:27 am

Bill: really? Because in SpiderMonkey (which according to the article also uses backtracking) I obtain this:

>> (“aaaaa”).match(/(|a)*/);
aaaaa,a

Which seems like what you said you expected…

Comment from Chris Jones
Time: October 7, 2008, 12:29 am

Joel Galenson, Ras Bodik and I came across the /a*a/ example you showed. If you follow Russ’s technique and extend it to ‘n’ /a*/’s followed by ‘n’ /a/’s, the backtracking algorithm takes exponential-squared (!) time matching ‘n’ a’s, worse than Russ’s example. It’s pretty amusing to stall Firefox with a 15-character regexp match (not that only Firefox is so affected).

RE Bill’s comment: making NFA algorithms group is not so bad, but making the returned groups useful and understandable is harder. We started looking into this, but the “future post” Dave mentioned could describe why we might stop ;-).

Comment from Chris Kuklewicz
Time: October 7, 2008, 1:02 am

Bill is right. There are 2 major categories of regular expressions: Perl uses left-biased searches that look at alternatives (a|b) and try ‘a’ first, and only try ‘b’ if ‘a’ fails, here (a|b) is very different from (b|a). Posix searches (e.g. grep) try both ‘a’ and ‘b’ and return the longest possible overall match. Thus (a|b) is the same as (b|a).

A left-biased seach uses back-tracking. A Posix seach uses an NFA. But one can also convert NFA’s into DFA’s for space/time tradeoffs. And one can still return sub-matches in parentheses using a DFA, see the work at http://www.laurikari.net/tre/ and (my) Haskell version at http://hackage.haskell.org/cgi-bin/hackage-scripts/package/regex-tdfa

Comment from Ted Mielczarek
Time: October 7, 2008, 2:55 am

Thanks for a very readable and interesting writeup on a complex technical topic!

Pingback from Ajaxian » Regex performance in modern JSVMs
Time: October 7, 2008, 2:57 am

[…] was the conclusion that David Mandelin of the Tamarin project as he looked into how “SquirrelFish Extreme (SFX) […]

Pingback from Ajax Girl » Blog Archive » Regex performance in modern JSVMs
Time: October 7, 2008, 4:07 am

[…] was the conclusion that David Mandelin of the Tamarin project as he looked into how “SquirrelFish Extreme (SFX) […]

Comment from Boris
Time: October 7, 2008, 5:05 am

Another option for the test is to have more than one ‘B’ char in the string and then to fail the test if they don’t all get replaced.

But yes, please file bugs on both the test and their buggy string-replace…

Comment from Ben Hearsum
Time: October 7, 2008, 6:39 am

I’m a lay-man to this kind of thing but this was a really interesting post nonetheless. Thanks!

Comment from Fritz
Time: October 7, 2008, 7:33 am

I don’t know much about this whole situation, but “trampoline” is a technical term in programming, and therefore may be more than just a typo or watermark. See for example the Wikipedia article at . I first heard the term in connection with work on continuations, in the context of Scheme, I think, but the Wikipedia entry suggests that related technical usages predate that work (which was probably mid to late 80s).

Comment from RichB
Time: October 7, 2008, 8:45 am

The test already has a bug report:

https://bugs.webkit.org/show_bug.cgi?id=18989

Comment from nemo
Time: October 7, 2008, 10:00 am

https://bugs.webkit.org/show_bug.cgi?id=18989
Here’s one of ‘em at least.

Comment from Boris
Time: October 7, 2008, 4:21 pm

Fritz, I assume the typo/watermark part David is talking about is the “gererate”, not the “trampoline”.

Pingback from Javascript News » Blog Archive » Regex performance in modern JSVMs
Time: October 7, 2008, 7:43 pm

[…] was the conclusion that David Mandelin of the Tamarin project as he looked into how “SquirrelFish Extreme (SFX) […]

Comment from Mark Rowe
Time: October 7, 2008, 10:48 pm

It was a simple typo that I fixed in http://trac.webkit.org/changeset/37383.

Pingback from Css howto » Regex performance in modern JSVMs
Time: April 23, 2009, 7:32 am

[…] was the conclusion that David Mandelin of the Tamarin project as he looked into how “SquirrelFish Extreme (SFX) […]

Comment from time
Time: February 1, 2010, 6:55 am

Thanks for the webKit team. The old regexp-dna.js was у real pain for me.

Comment from muhtar
Time: March 3, 2010, 8:25 pm

Minor correction: backtracking regular expression engines *are* NFA engines. What you refer to as NFA implementations above are actually using DFAs, not NFAs.

Comment from Valerij Tomarenko
Time: July 12, 2010, 2:20 pm

WREC simply generates native code that directly implements the backtracking search.

Comment from Tom Jefferson
Time: September 16, 2010, 12:20 am

Decent fix, thanks, the old one really made me headaches.

Comment from kiko
Time: October 15, 2010, 11:40 am

I like your blog,and also like the article,and thank you for provide me so much information :)

Comment from Home decor liquidators
Time: October 19, 2010, 11:27 am

Since I sent a link to the article Russ Cox, I thought I point out something I learned from Ras a while: the approach of tracking no longer uses the semantics of the party as the focus of the NFA does. So the two techniques differ in most of the properties of their performance. Here’s an example written in Python. Re.match reimportation ((| a) * ‘,’ aaaaa ‘) include () returns # (0, 0) I hope that all the string height, but only the epsilon to the top of the chain of parties. This seems very strange to me, but it’s like all based tracking matchers work: choose the “leftmost match” rather than longer. It’s a shame to get matchers NFA clusters is based on making a pain in the ass. Actuaries are professionals who employ actuarial science, which is based in mathematics primarily probability and statistics.

Comment from online coupons
Time: October 28, 2010, 7:39 am

I don’t know much about this whole situation, but “trampoline” is a technical term in programming, and therefore may be more than just a typo or watermark. See for example the Wikipedia article at . I first heard the term in connection with work on continuations, in the context of Scheme, I think, but the Wikipedia entry suggests that related technical usages predate that work (which was probably mid to late 80s).

Comment from gallstones symptoms
Time: November 1, 2010, 12:06 pm

I agree with you. Thanks for sharing an useful post.

Comment from E20-591
Time: November 11, 2010, 1:22 am

Thanks for sharing.Fantastic post. Bookmarked this site and emailed it to a few friends, your post was that great, keep it up.

Comment from Loans Fast
Time: November 12, 2010, 4:06 am

Thanks for sharing wonderful information. We need more information about it.

Comment from News Articles
Time: November 13, 2010, 8:24 pm

Another advantage for the analysis is to accept added than one’s burn in the cord and again to abort the analysis if they don’t all get replaced.

Comment from HTC UNLOCK
Time: November 14, 2010, 6:53 am

Minor correction: backtracking approved announcement engines *are* NFA engines. What you accredit to as NFA implementations aloft are in fact application DFAs, not NFAs. Thanks

Comment from Best Online Casino List
Time: November 16, 2010, 10:12 am

was the cessation that David Mandelin of the Tamarin activity as he looked into how SquirrelFish Extreme (SFX)

Comment from LASIK Surgery Sydney
Time: November 20, 2010, 1:25 am

Another advantage for the assay is to acquire added than one bake in the bond and afresh to arrest the assay if they don’t all get replaced.

Comment from free samples
Time: November 20, 2010, 7:43 am

I aboriginal heard the appellation in affiliation with plan on continuations, in the ambience of Scheme, I think, but the Wikipedia access suggests that accompanying abstruse usages predate that plan (which was apparently mid to backward 80s).

Comment from police baton
Time: November 20, 2010, 8:20 pm

Great information, thanks a lot for the fix, I really do appreciate you helping all of us out.

Comment from Cash Loan Payday
Time: November 22, 2010, 7:11 am

I am very glad to see such information, which I was searching for a long time. I will be coming back to see more posts soon.

Comment from Sexy Wallpapers
Time: November 24, 2010, 7:12 am

Since I sent a link to the article Russ Cox, I thought I point out something I learned from Ras a while: the approach of tracking no longer uses the semantics of the party as the focus of the NFA does. was the cessation that David Mandelin of the Tamarin activity as he looked into how SquirrelFish Extreme (SFX)

Comment from Payday Advance
Time: November 25, 2010, 4:30 am

Thanks for sharing wonderful information. I would like to have some more information about it.

Comment from link building service
Time: November 25, 2010, 9:36 am

gererateParenthesesResetTrampoline”. Inexplicably preserved typo, or watermark to detect copying of WREC code?

Comment from Borrow Money
Time: November 26, 2010, 5:12 am

I don’t know much about this whole situation, but “trampoline” is a technical term in programming, and therefore may be more than just a typo or watermark.

Comment from aftermarket cruise control ideas
Time: November 27, 2010, 3:47 am

minor correction: reverse regular expression * * engines are NFA engines. What is referred to as NFA previous implementations do not really use DFA, NFA. The same goes for the comment by Mr. McCloskey. But what we can do, as flawed as we are, is still see God in other people, and do our best to help them find their own grace. That’s what I strive to do, that’s what I pray to do every day.