Main menu:

Site search

Categories

Archive

WTF-16

I’ve been doing some speed and memory profiling of Firefox string usage to help in figuring out what to do with string encodings Mozilla 2: should we go all UTF-8, all UTF-16, or keep the current mishmash of ASCII, ISO-8859-1, UTF-8, and the reviled UCS-2? The full profiling results are in the string analysis bug, but I wanted to mention the big things and a few side notes here.

Basically, it turns out that encodings don’t affect browser performance much and we might as well keep what we have, at least for the near future. Switching Firefox to all UTF-8 would reduce memory consumption for string data by about 25%, which sounds exciting until you discover that string data represents about 1-2% of Firefox memory consumption (I got 9 MB out of 400 MB in a torture test of opening 80 pages in tabs). On the speed side, there are a lot of calls to UTF-8/UTF-16 conversion functions, about 30,000 calls in a test of loading 20 pages, but the time taken by conversions seems to be around 0.05% of total time, although it’s hard to measure accurately.

Interesting fact: in my tests, even CJK web pages (a set of 20 pages from Alexa and a set of 24 Wikipedia articles) use a lot less memory in UTF-8. This is surprising to many (including me) because CJK characters all need at least 2 bytes in UTF-8. But it turns out that they usually don’t need more than 2 bytes, and more importantly, a lot of web content on CJK sites is ASCII style sheets, tags, and scripts.

[Update: This isn’t quite right. CJK characters take 3 bytes in UTF-8. Memory usage is lower in UTF-8 because my CJK test pages (which may or may not be representative) generate even more ASCII string content than I thought. I left details in a comment, but WordPress keeps eating it, so I’ll put it here:

On the Wikipedia test, nsTextFragments contained about 1/2 ASCII characters (HTML elements, scripts, small amounts of Latin-1 content, etc.) and 1/2 CJK characters, so it averages out to about 2 bytes per character in UTF-8 and uses only 1% more space than UTF-16. The current FF code uses 18% less memory than UTF-16 because it encodes strings that are entirely ISO-8859-1 in 1 byte per character, and 27% of strings were able to use that encoding.

nsStringBuffers contained 95% ASCII characters, so they are about 1/2 the size in UTF-8 compared to UTF-16. Current FF takes up 90% as much space as UTF-16 because it encodes 80% of the buffers in UCS-2.

nsStringBuffers took up 4x as much space as nsTextFragments, so they dominate the total and you still see a total space reduction of 26% by going from current FF to all UTF-8.]

Despite the evident non-need to change encodings for performance reasons, in my opinion, it would be nice if someday we could go all UTF-8. It’s not worth the huge effort today to reduce memory by 1%, but reducing memory by 1% is still a good thing, especially for mobile devices. And it would be nice if programmers never had to think about encoding decisions or conversions because everything was UTF-8. There is one problem, though: Windows and MacOS platform APIs are UTF-16, so some conversions would still be necessary. So going all UTF-16 has some benefits too.

Comments

Comment from Robert O’Callahan
Time: February 14, 2008, 3:47 pm

CJK characters are 3 bytes each in UTF-8. That’s why the memory usage reduction is surprising.

The Windows and Mac text APIs are not a problem for UTF-8. We already have a string copy in the textrun construction path that we could use to convert from UTF-8 to UTF-16 as well.

Comment from Matthew Gertner
Time: February 15, 2008, 1:44 am

Although this will never be the primary consideration, I wouldn’t discount the value of a more consistent API for developers. The Mozilla platform is large, complex and daunting as it is. It’s pretty baffling for the newbie Mozilla coder to figure out why certain APIs take UTF-8 and other UTF-16, seemingly at random. For people designing new APIs for Mozilla, there is a constant decision of whether to use one (“but then I’ll have to convert to use so-and-so API”) or the other (“but then I’ll have to convert for some other API”). Conversion code isn’t that slow, perhaps, but it does muck up and obfuscate code.

As I say, not the most important factor, perhaps, but all things being equal (and it sounds like they are quite emphatically not equal and that UTF-8 is superior), harmonizing on a single Unicode encoding would be a great step. Given all the big changes planned for Mozilla 2, I’d be disappointed if this one didn’t make the cut.

Comment from Kim Sullivan
Time: February 15, 2008, 2:48 am

UCS-2 is mostly the same as UTF16 – UTF16 just has the advantage of supporting surrogate pairs. Why exactly do you revile UCS-2, while UTF-16 seems to be considered a viable alternative?

(I agree that not supporting U+10000 through U+10FFFF is a serious drawback, just wanted to know if there was something else to it).

Comment from dmandelin
Time: February 15, 2008, 1:47 pm

Matthew: Good points. IMO, ease of use for programmers really is the biggest reason to think about decreasing the number encodings. Based on what I hear, it sounds like we would need some more help to make such a big change happen for Mozilla 2.

Kim: That’s the only problem I know of with UCS-2. I’m not sure how important Linear B and Phoenician support really are anyway, but it just sort of bugs me that UCS-2 isn’t truly a Unicode encoding any more.

Comment from blassey
Time: February 18, 2008, 1:01 pm

This is just a guess, but I would assume that the cost of the conversions in processing time, heap size and code size would be more damaging to performance on mobile devices than the reduction of memory use.

Perhaps what is really needed is an abstraction layer that only converts the string when necessary. For instance, if you follow the start up on windows of xulrunner, command line arguments are passed in as UTF-16 and immediately converted to UTF-8. Various string manipulations are performed and then various system APIs are called, but only after the UFT-8 strings are converted back to UTF-16. The memory for the UTF-8 strings isn’t freed until shutdown.