14
May 12

statement wrappers have been deprecated

It’s a lot of fun removing old code; it’s less fun when removing that old code breaks lots of things.  That’s what happened when bug 589032 was checked in to mozilla-central last week. We had deprecated mozIStorageStatementWrapper over a year ago; bug 589032 removed mozIStorageStatementWrapper in favor of using mozIStorageStatement directly.

As there have been a couple reports of breakage (which likely means there’s a lot more breakage pending), and I have claimed in the bug that fixing said breakage is a straightforward process, here’s the quick-n-dirty guide to doing so. (Warning: may contain bugs or be incomplete. Please note any oversights in the comments.)

Code that created wrappers:

var wrapper = new Components.Constructor("@mozilla.org/storage/statement-wrapper;1",
                                         Ci.mozIStorageStatementWrapper,
                                         "initialize");
function createStatement(db, aSQL) {
  return new wrapper(db.createStatement(aSQL));
}

or perhaps:

var wrapper = Cc["@mozilla.org/storage/statement-wrapper;1"]
                .createInstance(Ci.mozIStorageStatementWrapper);
wrapper.initialize(statement);

should be replaced with:

function createStatement(db, aSQL) {
  return db.createStatement(aSQL);
}

Of course, that’s almost not worth having a separate function for. :)

Nearly all of the methods that you were calling on the wrapper remain unchanged, with the exception of:

foo = wrapped_stmt.execute();

That should be replaced with:

foo = stmt.executeStep();

Statements have an execute() method, but it does something slightly different than the wrapper’s execute() method.

Finally, where you finalize your statements (you are finalizing your statements, right?):

wrapped_stmt.statement.finalize();

is now simply:

statement.finalize();

As I said earlier, please note any omissions or errors in the comments!


13
Mar 12

data gathering the old fashioned way

…by asking people.

We’re trying to track down a bug that has massively decreased our Telemetry gathering rates.  Our first suspects were bug 707320 (persistent telemetry) and bug 717061 (telemetry request compression), but neither of those are present in nightly builds when we start seeing the drop.  (And by all accounts, they test fine from the metrics side of things, too.)  One possible candidate is bug 720493, but we’d need some data to confirm that.

Hence this post.  If  you are running a nightly build, please leave a comment with:

  1. The value of idle.lastDailyNotification from about:config.
  2. The version of your nightly build (menubar -> Help -> About Nightly).
  3. How long your current browser session has been up for (approximately).

Please note that you do not need to be sending in telemetry data for the above data to be useful.  Thanks!


27
Jan 12

DataShrink work

Over the last month, I’ve been working on some patches to address the relocation issues I’ve blogged about and just general space wastage in libxul.  The upshot is that Firefox 13 will have shaved almost 100K of data and relocations for smaller binaries and slightly faster load times:

  • Bug 717311 was the big one; we were wasting 70K+ of space in some Unicode tables due to structure padding.  Fixing this was just a matter of using the right bitfield datatype so the compiler would pack things correctly.
  • Bug 717501 was much the same: use smaller datatypes when the data you have fits in them.
  • Bug 711563 was more in line with the invasive relocation changes discussed earlier.  The patches there rearranged the storage for the stub names to avoid relocations and generally packed things more efficiently.

Not going in this cycle, but worthy of mention is bug 704848 for rearranging the effective TLD name table to avoid relocations; that bug will save 40-50K of space in data and relocations.  Jason Duell and I talked on IRC yesterday and while he’s in favor of the idea, he’d like to see if gperf makes things any better.  No sense in constructing a table at runtime if you can construct it at compile time!

Are there more instances of things that could be compressed like this?  Yes, but the savings from them are likely to be much smaller on a case-by-case basis:

  • Anything that uses nsStaticCaseInsensitiveNameTable could be tweaked to use less space and relocations by providing the entry data in a different format.
  • nsElementTable.cpp:gHTMLElements could probably be rearranged for the same sort of savings.
  • Initialization of atoms could stand some rearrangement for relocation reduction, at the cost of some ugliness elsewhere.
  • YARR, used by the JS engine, has some fabulous bit vectors represented as character arrays; squashing those would shave 100K of data, but would likely be tricky.
  • Eliminating null entries from the tables in table-driven QueryInterface methods would help, possibly at the cost of some extra parameter traffic to NS_TableDrivenQI.
  • Actually, rewriting IIDs for XPCOM interfaces to fit in 32 bits could save a number of relocations in the tables for the aforementioned methods.
  • The tables required by PR_ErrorInstallTable are breeding grounds for relocations; changing them is unlikely to happen anytime soon.  (Those tables do account for a significant chunk of the relocations in libnss and libnspr, though.)
  • Any number of places where we use pointers in constant data structures; there are quite a few of them, but converting each one saves 30-50 bytes each.  Best to save these for when you are really bored.

All of the above might amount to 200K of data+relocation savings.

Really, though, there’s not that much to trim (or perhaps what is left to trim is decidedly non-trivial to trim).  If you do something like:

readelf -sW dist/bin/libxul.so | grep OBJECT | awk '$3 > 1000 { print }' | c++filt | less

in your objdir on a Linux-y system, you’ll see that quite a lot of the largest objects come from tables for character conversion or character detection.  However, the authors of said code were already conscious of the space required by these tables and they have tended to use the smallest datatypes necessary.  vtables make a number of appearances (hard to get rid of at the moment).  There are also some tables for media codecs, which presumably are difficult to trim down.


26
Jan 12

compressing strings in JS

As we kept increasing the amount of information we send in via Telemetry, we need to start thinking about how to keep the size of the ping packets containing that information as small as possible. The packets are just JSON, so the first thing to try is to compress the data with gzip prior to sending it.

This is how you compress a string in a language like Python:

compressed = zlib.compress(str)

(Yes, yes, this is not gzip compression.  Close enough for pedagogical purposes.)

Short and simple. Boy, I hope it’s that easy in JS. Hm, let’s see, there’s this nsIStreamConverter interface, that looks promising:

let converter = Cc["@mozilla.org/streamconv;1?from=uncompressed&to=gzip"].createInstance(Ci.nsIStreamConverter);
let stream = Cc["@mozilla.org/stringinputstream;1"].createInstance(Ci.nsIStringInputStream);
stream.data = string;
// Hm, having to respecify input/output types is a bit weird.
let gzipStream = converter.convert(stream, "uncompressed", "gzip", null);

OK, we wound up with a stream, rather than a string, but that’s OK, because nsIXMLHttpRequest.send will happily accept a stream. So, nothing to worry about. (This is a little white lie; please hold your comments until the end.)

Hm, that doesn’t seem to work. I get NS_ERROR_NOT_IMPLEMENTED. Oh, look, nsDeflateConverter doesn’t implement nsIStreamConverter.convert. In fact, none of the stream converters in the tree seem to implement convert. What a bummer.

Hey, here’s nsIStreamConverterService! Maybe he can help. His convert method just punts to nsIStreamConverter.convert, so that won’t work, though. Ah, nsIStreamConverter has an asyncConvertData method, let’s try that:

function Accumulator() {
  this.buffer = "";
}
Accumulator.prototype = {
  buffer: null,
  onRequestStart(request, context) {},
  onRequestStop(request, context, statusCode) {},
  onDataAvailable(request, context, inputStream, offset, count) {
    let stream = Cc["@mozilla.org/binaryinputstream;1"].createInstance(Ci.nsIBinaryInputStream);
    stream.setInputStream(inputStream);
    let input = stream.readByteArray(count);
    this.buffer += String.fromCharCode.apply(input);
  }
};

let accumulator = new Accumulator();
let converter = Cc["@mozilla.org/streamconv;1?from=uncompressed&to=gzip"].createInstance(Ci.nsIStreamConverter);
// More respecifying input/output types.
converter.asyncConvertData("uncompressed", "gzip", accumulator, null);
// Oh, that method doesn't actually convert anything, it just prepares
// the instance for doing conversion.
let stream = Cc["@mozilla.org/stringinputstream;1"].createInstance(Ci.nsIStringInputStream);
stream.data = string;
converter.onRequestStart(null, null);
converter.onDataAvailable(null, null, stream, 0, string.length);
converter.onRequestStop(null, null, 201 /* 417 */);
compressed = accumulator.buffer;

Well, it’s not as simple as I hoped for, but I guess it works.

FWIW, I do understand why the input/output types have to be respecified.  But I think this is about the best way to do things currently; that’s kind of frightening. The above is one of those instances where you start to understand why people complain about things being so crufty.


13
Dec 11

invasive changes for relocation reduction

We have a lot of relocations in our libraries; almost 240k in libxul alone. Many of these relocations are due to the massive number of vtables that we have; every function pointer in a vtable requires a runtime relocation. And you can find function pointers in XPCOM CIDEntrys, function tables for PLDHashes, and so forth, which are certainly convenient.

But for other kinds of data, such as tables of strings, or tables of structures that contain strings, relocations are entirely avoidable if you’re willing to restructure things. The idea is not new; it’s been described in Ulrich Drepper’s manifesto for writing shared libraries, it’s been used for data structures in compilers and otherwise, and it’s often used in pickling data to and from disk.

The basic idea is that instead of storing pointers to data, you store offsets into some table of data. For instance, if you had:

static const char *errors[] = {
  "no error",
  "on fire",
  "timed out",
  "pebkac"
};

const char *geterr(int i)
{
  return errors[i];
}

then errors itself would be 16 bytes (assuming 32-bit pointers), plus the actual string data (34 bytes), plus the relocations for each entry in errors (32 bytes on Linux/x86, more or less depending on your platform), for a total of 82 bytes.

If instead you had something like:

static const char errtable[] = { "no error\0" "on fire\0" "timed out\0" "pebkac\0" };
static const uint8_t erridx[] = { 0, 9, 17, 27 };

const char *geterr(int i)
{
  return &errtable[erridx[i]];
}

you’d only need 34 bytes for string data plus 4 bytes for the index table, or 38 bytes total, for a savings of more than 50%. You do pay an extra instruction on each call to geterr as well; maybe you want to count that in the byte count if you’re being pedantic.

If you have character pointers embedded in structures, the savings can be more dramatic, because offsets can be smaller than pointers, which may enable you to pack structures better. For instance, if you have:

struct foo {
  const char *s;
  bool b1;
  bool b2;
};

sizeof(struct foo) would be 8 on a typical 32-bit platform, due to padding after b2. Using an offset scheme as described above, you now have:

struct foo {
  uint16_t offset;
  bool b1;
  bool b2;
};

and now struct foo is 50% smaller because padding is no longer required.

Obviously, forming the erridx array above can be tedious; Drepper’s paper linked above describes how one might use the C preprocessor to avoid such manual construction. His construction relies on the C compiler supporting string constants of arbitrary size; MSVC limits us here, as the maximum size of a string constant (post-preprocessor concatenation) is 65K bytes. While there are ways around this limitation, the limitation is not so bad as one might think: using 32-bit indices doesn’t make us much better off than we were before (it does avoid the relocations, which is worth something) and the compiler can now warn us if we’re about to walk off the deep end with 16-bit indices.

There are a number of places in Mozilla that could benefit from such reorganization: the effective TLD table (the above example about structure packing is taken from there), various tables in the CSS and HTML parsers, and so forth. Shared bits like constructing AtomTables could also have their interfaces changed to not be quite so pointer-heavy. Quickstubs could benefit from string table reorganizations and less pointer-ness. The list goes on, I’m sure.

I have a proof-of-concept bug open for nsEffectiveTLDService.cpp; I have been hesistant to dive deeply into other places; for one, the above bug doesn’t seem to have been touched for several weeks, and for two…

Is all this really worth it? I think it is; smaller binaries and faster startup times are a good thing in my book. I realize we have elfhack on mobile for reducing the size and cost of relocations; I say it’d be better to not pay for them in the first place. We also support other platforms where tricks like elfhack don’t work. But what do you think?


11
Nov 11

linker hacking for improved code locality

Hm, haven’t posted in a while.  Having your tonsils out will do that!

Mike Hommey recently blogged on some binary layout issues he’d run into with Thumb-2 binaries.  For the last week or so, I’ve been working on one of those problems: Thumb-2/ARM trampolines.

memcpy calls, how do they work? After all, code in your executable or shared library can’t directly call memcpy, since the address of memcpy is unknown until runtime. You can handle this in various ways; the solution adopted for ELF systems is something called a procedure linkage table, or PLT. (The mechanisms on Windows and Mac are similar, but the details are probably somewhat different; I’m not familiar with them.)

The PLT is simply an extra layer of indirection. Each entry in the PLT has a corresponding entry in the global offset table (GOT) associated with it; the GOT entry is the address of the actual function the PLT entry should call. At load time, the GOT entries are set to the address of the entry point of the dynamic linker’s lookup function.

When you call memcpy, then, you’re actually calling the PLT entry associated with memcpy. The first time you make such a call, the dynamic linker is invoked to figure out where memcpy actually lives. Once it has done so, it updates the associated GOT entry with that address and calls the “real” memcpy. Subsequent calls to memcpy then find the actual address in the GOT and jump directly there, bypassing the dynamic linker.

(glibc on Linux plays elaborate games so that calls from libc to libc functions don’t go through the PLT; the details of doing so are best left for a separate blog post.)

OK, so onto the problem description. PLT entries on ARM Linux are ARM mode code. Of course, it’s perfectly valid for Thumb code to want to use these PLT entries; it’d be silly to have separate PLT entries for ARM calls and Thumb calls. For non-tail calls, Thumb code can just use blx, which switches to ARM mode and jumps to a given address. But for tail calls, there’s no jump-to-offset-and-switch-mode instruction (bx jumps to an address in a register, not an immediate offset). So some cleverness is required.

In the gold linker, the solution taken is general: whenever we have such a Thumb-to-ARM branch (whether to a PLT entry or otherwise), generate a small stub of code to do the mode switch and branch, then branch to that. Such stubs are generated for various other similar cases, so the linker was already doing such things anyway.

There’s two issues with this approach. The first is that the stubs take up extra space, then second is that they’re placed wherever the linker thinks is convenient, which might not be the best place for improving the layout of the compiled code. (The linker does put some effort into reusing stubs, so if you have several Thumb-to-ARM calls/jumps to the same address, the linker will reuse a stub, rather than duplicating it willy-nilly.) The second is the real dealbreaker here.

The GNU linker’s approach is more clever: it embeds the necessary stub into the PLT entry itself. Whereas the entry previously looked like:

 22c:	e28fc610 	add	ip, pc, #16777216	; 0x1000000
 230:	e28cca08 	add	ip, ip, #32768	; 0x8000
 234:	e5bcf080 	ldr	pc, [ip, #128]!	; 0x80

it now looks like:

 228:	4778      	bx	pc
 22a:	46c0      	nop			; (mov r8, r8)
 22c:	e28fc610 	add	ip, pc, #16777216	; 0x1000000
 230:	e28cca08 	add	ip, ip, #32768	; 0x8000
 234:	e5bcf080 	ldr	pc, [ip, #128]!	; 0x80

(This works because bx pc actually branches to the PC of the bx insn + 4: i.e. 0x22c.)

So direct jumps from Thumb code use the entry point at 0×228, whereas all other calls and jumps from ARM or Thumb code use the entry point at 0x22c. This method is smaller (4 bytes for the PLT entry stub versus 16-ish bytes for the out-of-line, imperfectly located stub) and plays nicely with section reordering.

We could use GNU ld for our linking needs, but gold has several very nice features (performance, identical code folding, section ordering, incremental linking, etc.) that make its use desirable. So the last week of my time has been spent learning gold well enough to add these small Thumb stubs to PLT entries when necessary. gold can now do this, woohoo!

I believe my code is suitable for using in Mozilla builds. Before being submitted upstream, however, there need to be some kinks worked out with regards to incremental linking: gold’s incremental linking support assumes that all PLT entries are the same size. The above modifications mean that some will be 16 bytes and some will be 12. (A possible cheesy hack would be to make all stubs 16 bytes when incremental linking…) It might also be worthwhile exploring a more general solution for stub positioning so that stubs play nicely with section reordering.


17
Oct 11

reordering plugins and edge profiles

Mike Hommey and I have been working with a linker plugin that uses the callgraph to reorder functions on Linux and Android. One feature of the linker plugin is that it dumps a simplified representation of the callgraph to a text file. The simplified representation is really simple, just edge counts for all the edges in the callgraph. It’s not perfect, since the ordering of caller and callee in the dump file is not always consistent. Also, you get a few spurious functions in there, like __builtin_expect and SSE builtins calling things.  (__builtin_expect just provides a hint to the compiler about how to order basic blocks for more cache-friendly control flow; SSE builtins should compile down to single/few instructions and never actually call something at runtime.)

Nevertheless, looking at the file for libxul.so can be illuminating. Actually, it’s probably more illuminating to narrow things down to the edges with 100k+ call counts and demangled function names. Doing that, we can see that:

Anyway, maybe people have already seen all this stuff, but I certainly hadn’t.


29
Sep 11

pgo startup times with syzygy

In an effort to confirm that we do want all this syzygy goodness in our release builds, I’ve been testing out syzygy on PGO builds (since we do PGO builds on Windows for releases). After removing the PEBKAC and getting a proper PGO build–which took depressingly long–I have mixed results.

First, the good news. On my laptop (Win7, Core i7, SSD), the about:startup numbers look like this:

Version main sessionRestored firstPaint
Base PGO build 265 3152 3012
Optimized PGO build 234 2778 2653

These numbers are really encouraging; they’re actually even better than the initial numbers I posted earlier.  (Though I note that they are universally slower than the earlier numbers…hm….)

There is one curious thing about these builds, though. When you look at the page fault numbers, they suggest a much different story. The (cold) numbers are what you get from starting Firefox just after a reboot; the (warm) numbers are from a second startup after the first.

Version Hard faults (cold) Soft faults (cold) Hard faults (warm) Soft faults (warm)
Base PGO build 2507 41219 8 26100
Optimized PGO build 2264 41488 14 23017

These numbers are totally contrary to what I saw with non-PGO builds. We’re not consistently lower in the optimized build on either measure. I honestly haven’t thought very hard about what this means yet.

Anyway, that’s the good news. The bad news is that on my desktop (Win XP, Core 2 Quad, mechanical drive), the about:startup numbers look like this:

Version main sessionRestored firstPaint
Base PGO build 1516 8984 8813
Optimized PGO build 1437 9187 8828

(I don’t have the necessary profiling tools on my XP box for doing page fault analysis. I shouldn’t think they’d differ dramatically between the two systems, though.)

This is a little discouraging. The syzygy-optimized build is a little faster off the line, but gets edged out by the base build in getting to the points that actually matter. I haven’t thought terribly hard about these numbers, either. One possibility is that I did turn off the XPCOM glue preloading bits, which IIUC correctly are helpful for encouraging XP to keep its hands off your binary’s startup time. Doing that was necessary for getting postlinking to work properly. If I made that runtime-configurable, then I could run tests with the preloading enabled and see if we win there.

Bottom line: We would win on leading-edge machines, but we wouldn’t see a lot of benefit on older machines.

Also, if Microsoft would add a drop_caches lookalike, that would be fantastic.


20
Sep 11

startup reduction times with syzygy, part 2

In my previous post, I presented some startup timings with syzygy-optimized Firefox binaries.  People asked whether those timings were on my work laptop (quad-core i7, Windows 7, SSD) and I confirmed they were.  Since those timings were encouraging, but not overwhelming, folks suggested that I might try timing things on a more conventional system.

Below are results from testing on my desktop (quad-core Core 2 Duo @ 2.6GHz, Windows XP, 7200ish-rpm hard drive):

Version main sessionRestored firstPaint
Trunk build 1484 11515 11125
Optimized build 2562 8812 8703

That’s quite a difference: about a 25% win just from reordering functions.  Much more exciting!


19
Sep 11

startup reduction times with syzygy

People that I’ve told about the work with syzygy that I’ve been doing have, almost universally, two reactions:

  1. That’s cool!  (Thanks; I did very little work for it!)
  2. How does that translate into startup time?

Assuming that a 40% reduction in page faults leads to a 40% reduction in startup time is not reasonable, but surely there should be some reduction in startup time, right?  I finally benchmarked this today using the about:startup extension; these numbers are from cold start, freshly rebooted both times:

Version main sessionRestored firstPaint
Trunk build 125 1733 1671
Optimized build 125 1639 1577

So a 40% reduction in page faults translates into ~6% reduction in startup time; not stellar, but not too bad either.

The next step is making sure this all works with PGO builds on Windows. Then we get to have a discussion about whether to incorporate this into the regular builds and getting all the infrastructure on build machines.