I’ve been doing some speed and memory profiling of Firefox string usage to help in figuring out what to do with string encodings Mozilla 2: should we go all UTF-8, all UTF-16, or keep the current mishmash of ASCII, ISO-8859-1, UTF-8, and the reviled UCS-2? The full profiling results are in the string analysis bug, but I wanted to mention the big things and a few side notes here.
Basically, it turns out that encodings don’t affect browser performance much and we might as well keep what we have, at least for the near future. Switching Firefox to all UTF-8 would reduce memory consumption for string data by about 25%, which sounds exciting until you discover that string data represents about 1-2% of Firefox memory consumption (I got 9 MB out of 400 MB in a torture test of opening 80 pages in tabs). On the speed side, there are a lot of calls to UTF-8/UTF-16 conversion functions, about 30,000 calls in a test of loading 20 pages, but the time taken by conversions seems to be around 0.05% of total time, although it’s hard to measure accurately.
Interesting fact: in my tests, even CJK web pages (a set of 20 pages from Alexa and a set of 24 Wikipedia articles) use a lot less memory in UTF-8. This is surprising to many (including me) because CJK characters all need at least 2 bytes in UTF-8. But it turns out that they usually don’t need more than 2 bytes, and more importantly, a lot of web content on CJK sites is ASCII style sheets, tags, and scripts.
[Update: This isn't quite right. CJK characters take 3 bytes in UTF-8. Memory usage is lower in UTF-8 because my CJK test pages (which may or may not be representative) generate even more ASCII string content than I thought. I left details in a comment, but WordPress keeps eating it, so I'll put it here:
On the Wikipedia test, nsTextFragments contained about 1/2 ASCII characters (HTML elements, scripts, small amounts of Latin-1 content, etc.) and 1/2 CJK characters, so it averages out to about 2 bytes per character in UTF-8 and uses only 1% more space than UTF-16. The current FF code uses 18% less memory than UTF-16 because it encodes strings that are entirely ISO-8859-1 in 1 byte per character, and 27% of strings were able to use that encoding.
nsStringBuffers contained 95% ASCII characters, so they are about 1/2 the size in UTF-8 compared to UTF-16. Current FF takes up 90% as much space as UTF-16 because it encodes 80% of the buffers in UCS-2.
nsStringBuffers took up 4x as much space as nsTextFragments, so they dominate the total and you still see a total space reduction of 26% by going from current FF to all UTF-8.]
Despite the evident non-need to change encodings for performance reasons, in my opinion, it would be nice if someday we could go all UTF-8. It’s not worth the huge effort today to reduce memory by 1%, but reducing memory by 1% is still a good thing, especially for mobile devices. And it would be nice if programmers never had to think about encoding decisions or conversions because everything was UTF-8. There is one problem, though: Windows and MacOS platform APIs are UTF-16, so some conversions would still be necessary. So going all UTF-16 has some benefits too.