19
Jun 12

performant concurrency

The most important conclusion from this foray into history is that concurrency has always been employed for one purpose: to improve the performance of the system. This seems almost too obvious to make explicit—why else would we want concurrency if not to improve performance?—yet for all its obviousness, concurrency’s raison d’être is increasingly forgotten, as if the proliferation of concurrent hardware has awakened an anxiety that all software must use all available physical resources. Just as no programmer felt a moral obligation to eliminate pipeline stalls on a superscalar microprocessor, no software engineer should feel responsible for using concurrency simply because the hardware supports it. Rather, concurrency should be thought about and used for one reason and one reason only: because it is needs [sic] to yield an acceptably performing system.

–from Real-World Concurrency by Bryan Cantrill and Jeff Bonwick


14
Jun 12

lessons learned while filing bugs

When I tell people that I and my entire team at Mozilla work remote, they often ask, “How do you keep track of everything people are doing?”  I tell them about Bugzilla and anything that needs fixed goes into it as a bug; bugs aren’t just for software problems, but hardware requests, account requests, office maintenance, the list goes on.  A bug is anything that prevents you from getting your work done.

Despite extolling the virtues of filing things, large and small, in Bugzilla, I (re?)learned two lessons about filing bugs this week:

  • Even if you think something is totally obvious, file a bug.  I needed to modify the Android mozconfigs this week because they weren’t turning on telemetry support, like our desktop mozconfigs.  In the process, I became disgusted with how much copy-and-paste there is between our mozconfigs and thought, “Surely, there is a bug out there already on this goop.”  I even asked one of my colleagues, Mike Hommey, whether there might be a bug on this already.  It turns out there wasn’t one, but there is now.  I shouldn’t have even asked Mike; I should have searched Bugzilla a bit and filed something if I couldn’t find a plausible match  Worst-case, somebody would mark my bug as a duplicate and I would be connected with people and/or possible attempts at fixing the bug already.
  • If something looks like a bug, file a bug.  Back in March, I made it harder to get certain kinds of Telemetry histograms wrong.  In the bug, I noted several histograms that were wrong, but didn’t bother to do anything about them.  “Too much work for too little gain,” I said.  Turns out that the networking team has been working with bad data on cache Telemetry because of one of the aforementioned histograms being wrong.  That problem could’ve been fixed several months ago!  My mistake; I should have filed each and every one of those histograms as a bug and let the relevant teams decide what to do about them.  Even if you think it’s harmless, file a bug.  Worst case is somebody tells you that it is harmless and you come away enlightened.  (And just in case you’re wondering, yes, there is a bug to make those histograms even harder to get wrong.)

And as always, please remember to follow Bugzilla’s rules of etiquette when filing bugs.


12
Jun 12

about:memory statistics improvements

The three major memory costs to rendering webpages in Gecko are layout, the DOM, and JavaScript.  Only the last of these has a detailed about:memory breakdown.  Layout has a few subcategories, but likely as not, layout/arenas is one big opaque number, and DOM just sits as a big blob-o-stuff.

Over the past two weeks, I’ve been working to improve this state of affairs.  DOM’s numbers have been made more transparent by splitting out the numbers by the type of DOM node.  Admittedly, this isn’t perfect, as now you have several semi-opaque blobs, but it gives you a slightly better idea where the memory is going.  Additionally, a few more things are getting counted in those DOM numbers, like some on-the-side datastructures Gecko’s DOM engine keeps around.

Layout has received the most work, however.  We now display stats for individual frame types and common objects allocated in layout’s arenas, which can be helpful in diagnosing page performance problems (bug 686795 comment 10 and comment 23).  Please note that in the stats for these frames and objects, we’re not measuring any substructures contained in them, some of which could be quite significant.  That’s a (potentially tricky) job for later.


30
May 12

slow sql in xpcshell testing

One of the best things about performance monitoring is that it sometimes shows up problems in very unexpected areas.  Yesterday while working on some improvements to telemetry, I noticed that the telemetry packets being sent had information for slowSQLStartup.  That is, there were some SQL queries (4 of them, to be exact; 3 unique queries) that were taking more than 150 ms to run just during xpcshell testing.

These queries were run by the addons manager and already have a bug on file.

I’m guessing that this slow SQL does not occur during all xpcshell tests, but only those that create a profile (about 5% of tests). which would translate into ~1% of xpcshell testing time.  Still, speedups are speedups, and fixing these queries would be a good speedup for Firefox users generally.


14
May 12

statement wrappers have been deprecated

It’s a lot of fun removing old code; it’s less fun when removing that old code breaks lots of things.  That’s what happened when bug 589032 was checked in to mozilla-central last week. We had deprecated mozIStorageStatementWrapper over a year ago; bug 589032 removed mozIStorageStatementWrapper in favor of using mozIStorageStatement directly.

As there have been a couple reports of breakage (which likely means there’s a lot more breakage pending), and I have claimed in the bug that fixing said breakage is a straightforward process, here’s the quick-n-dirty guide to doing so. (Warning: may contain bugs or be incomplete. Please note any oversights in the comments.)

Code that created wrappers:

var wrapper = new Components.Constructor("@mozilla.org/storage/statement-wrapper;1",
                                         Ci.mozIStorageStatementWrapper,
                                         "initialize");
function createStatement(db, aSQL) {
  return new wrapper(db.createStatement(aSQL));
}

or perhaps:

var wrapper = Cc["@mozilla.org/storage/statement-wrapper;1"]
                .createInstance(Ci.mozIStorageStatementWrapper);
wrapper.initialize(statement);

should be replaced with:

function createStatement(db, aSQL) {
  return db.createStatement(aSQL);
}

Of course, that’s almost not worth having a separate function for. 🙂

Nearly all of the methods that you were calling on the wrapper remain unchanged, with the exception of:

foo = wrapped_stmt.execute();

That should be replaced with:

foo = stmt.executeStep();

Statements have an execute() method, but it does something slightly different than the wrapper’s execute() method.

Finally, where you finalize your statements (you are finalizing your statements, right?):

wrapped_stmt.statement.finalize();

is now simply:

statement.finalize();

As I said earlier, please note any omissions or errors in the comments!


13
Mar 12

data gathering the old fashioned way

…by asking people.

We’re trying to track down a bug that has massively decreased our Telemetry gathering rates.  Our first suspects were bug 707320 (persistent telemetry) and bug 717061 (telemetry request compression), but neither of those are present in nightly builds when we start seeing the drop.  (And by all accounts, they test fine from the metrics side of things, too.)  One possible candidate is bug 720493, but we’d need some data to confirm that.

Hence this post.  If  you are running a nightly build, please leave a comment with:

  1. The value of idle.lastDailyNotification from about:config.
  2. The version of your nightly build (menubar -> Help -> About Nightly).
  3. How long your current browser session has been up for (approximately).

Please note that you do not need to be sending in telemetry data for the above data to be useful.  Thanks!


27
Jan 12

DataShrink work

Over the last month, I’ve been working on some patches to address the relocation issues I’ve blogged about and just general space wastage in libxul.  The upshot is that Firefox 13 will have shaved almost 100K of data and relocations for smaller binaries and slightly faster load times:

  • Bug 717311 was the big one; we were wasting 70K+ of space in some Unicode tables due to structure padding.  Fixing this was just a matter of using the right bitfield datatype so the compiler would pack things correctly.
  • Bug 717501 was much the same: use smaller datatypes when the data you have fits in them.
  • Bug 711563 was more in line with the invasive relocation changes discussed earlier.  The patches there rearranged the storage for the stub names to avoid relocations and generally packed things more efficiently.

Not going in this cycle, but worthy of mention is bug 704848 for rearranging the effective TLD name table to avoid relocations; that bug will save 40-50K of space in data and relocations.  Jason Duell and I talked on IRC yesterday and while he’s in favor of the idea, he’d like to see if gperf makes things any better.  No sense in constructing a table at runtime if you can construct it at compile time!

Are there more instances of things that could be compressed like this?  Yes, but the savings from them are likely to be much smaller on a case-by-case basis:

  • Anything that uses nsStaticCaseInsensitiveNameTable could be tweaked to use less space and relocations by providing the entry data in a different format.
  • nsElementTable.cpp:gHTMLElements could probably be rearranged for the same sort of savings.
  • Initialization of atoms could stand some rearrangement for relocation reduction, at the cost of some ugliness elsewhere.
  • YARR, used by the JS engine, has some fabulous bit vectors represented as character arrays; squashing those would shave 100K of data, but would likely be tricky.
  • Eliminating null entries from the tables in table-driven QueryInterface methods would help, possibly at the cost of some extra parameter traffic to NS_TableDrivenQI.
  • Actually, rewriting IIDs for XPCOM interfaces to fit in 32 bits could save a number of relocations in the tables for the aforementioned methods.
  • The tables required by PR_ErrorInstallTable are breeding grounds for relocations; changing them is unlikely to happen anytime soon.  (Those tables do account for a significant chunk of the relocations in libnss and libnspr, though.)
  • Any number of places where we use pointers in constant data structures; there are quite a few of them, but converting each one saves 30-50 bytes each.  Best to save these for when you are really bored.

All of the above might amount to 200K of data+relocation savings.

Really, though, there’s not that much to trim (or perhaps what is left to trim is decidedly non-trivial to trim).  If you do something like:

readelf -sW dist/bin/libxul.so | grep OBJECT | awk '$3 > 1000 { print }' | c++filt | less

in your objdir on a Linux-y system, you’ll see that quite a lot of the largest objects come from tables for character conversion or character detection.  However, the authors of said code were already conscious of the space required by these tables and they have tended to use the smallest datatypes necessary.  vtables make a number of appearances (hard to get rid of at the moment).  There are also some tables for media codecs, which presumably are difficult to trim down.


26
Jan 12

compressing strings in JS

As we kept increasing the amount of information we send in via Telemetry, we need to start thinking about how to keep the size of the ping packets containing that information as small as possible. The packets are just JSON, so the first thing to try is to compress the data with gzip prior to sending it.

This is how you compress a string in a language like Python:

compressed = zlib.compress(str)

(Yes, yes, this is not gzip compression.  Close enough for pedagogical purposes.)

Short and simple. Boy, I hope it’s that easy in JS. Hm, let’s see, there’s this nsIStreamConverter interface, that looks promising:

let converter = Cc["@mozilla.org/streamconv;1?from=uncompressed&to=gzip"].createInstance(Ci.nsIStreamConverter);
let stream = Cc["@mozilla.org/stringinputstream;1"].createInstance(Ci.nsIStringInputStream);
stream.data = string;
// Hm, having to respecify input/output types is a bit weird.
let gzipStream = converter.convert(stream, "uncompressed", "gzip", null);

OK, we wound up with a stream, rather than a string, but that’s OK, because nsIXMLHttpRequest.send will happily accept a stream. So, nothing to worry about. (This is a little white lie; please hold your comments until the end.)

Hm, that doesn’t seem to work. I get NS_ERROR_NOT_IMPLEMENTED. Oh, look, nsDeflateConverter doesn’t implement nsIStreamConverter.convert. In fact, none of the stream converters in the tree seem to implement convert. What a bummer.

Hey, here’s nsIStreamConverterService! Maybe he can help. His convert method just punts to nsIStreamConverter.convert, so that won’t work, though. Ah, nsIStreamConverter has an asyncConvertData method, let’s try that:

function Accumulator() {
  this.buffer = "";
}
Accumulator.prototype = {
  buffer: null,
  onRequestStart(request, context) {},
  onRequestStop(request, context, statusCode) {},
  onDataAvailable(request, context, inputStream, offset, count) {
    let stream = Cc["@mozilla.org/binaryinputstream;1"].createInstance(Ci.nsIBinaryInputStream);
    stream.setInputStream(inputStream);
    let input = stream.readByteArray(count);
    this.buffer += String.fromCharCode.apply(input);
  }
};

let accumulator = new Accumulator();
let converter = Cc["@mozilla.org/streamconv;1?from=uncompressed&to=gzip"].createInstance(Ci.nsIStreamConverter);
// More respecifying input/output types.
converter.asyncConvertData("uncompressed", "gzip", accumulator, null);
// Oh, that method doesn't actually convert anything, it just prepares
// the instance for doing conversion.
let stream = Cc["@mozilla.org/stringinputstream;1"].createInstance(Ci.nsIStringInputStream);
stream.data = string;
converter.onRequestStart(null, null);
converter.onDataAvailable(null, null, stream, 0, string.length);
converter.onRequestStop(null, null, 201 /* 417 */);
compressed = accumulator.buffer;

Well, it’s not as simple as I hoped for, but I guess it works.

FWIW, I do understand why the input/output types have to be respecified.  But I think this is about the best way to do things currently; that’s kind of frightening. The above is one of those instances where you start to understand why people complain about things being so crufty.


13
Dec 11

invasive changes for relocation reduction

We have a lot of relocations in our libraries; almost 240k in libxul alone. Many of these relocations are due to the massive number of vtables that we have; every function pointer in a vtable requires a runtime relocation. And you can find function pointers in XPCOM CIDEntrys, function tables for PLDHashes, and so forth, which are certainly convenient.

But for other kinds of data, such as tables of strings, or tables of structures that contain strings, relocations are entirely avoidable if you’re willing to restructure things. The idea is not new; it’s been described in Ulrich Drepper’s manifesto for writing shared libraries, it’s been used for data structures in compilers and otherwise, and it’s often used in pickling data to and from disk.

The basic idea is that instead of storing pointers to data, you store offsets into some table of data. For instance, if you had:

static const char *errors[] = {
  "no error",
  "on fire",
  "timed out",
  "pebkac"
};

const char *geterr(int i)
{
  return errors[i];
}

then errors itself would be 16 bytes (assuming 32-bit pointers), plus the actual string data (34 bytes), plus the relocations for each entry in errors (32 bytes on Linux/x86, more or less depending on your platform), for a total of 82 bytes.

If instead you had something like:

static const char errtable[] = { "no error\0" "on fire\0" "timed out\0" "pebkac\0" };
static const uint8_t erridx[] = { 0, 9, 17, 27 };

const char *geterr(int i)
{
  return &errtable[erridx[i]];
}

you’d only need 34 bytes for string data plus 4 bytes for the index table, or 38 bytes total, for a savings of more than 50%. You do pay an extra instruction on each call to geterr as well; maybe you want to count that in the byte count if you’re being pedantic.

If you have character pointers embedded in structures, the savings can be more dramatic, because offsets can be smaller than pointers, which may enable you to pack structures better. For instance, if you have:

struct foo {
  const char *s;
  bool b1;
  bool b2;
};

sizeof(struct foo) would be 8 on a typical 32-bit platform, due to padding after b2. Using an offset scheme as described above, you now have:

struct foo {
  uint16_t offset;
  bool b1;
  bool b2;
};

and now struct foo is 50% smaller because padding is no longer required.

Obviously, forming the erridx array above can be tedious; Drepper’s paper linked above describes how one might use the C preprocessor to avoid such manual construction. His construction relies on the C compiler supporting string constants of arbitrary size; MSVC limits us here, as the maximum size of a string constant (post-preprocessor concatenation) is 65K bytes. While there are ways around this limitation, the limitation is not so bad as one might think: using 32-bit indices doesn’t make us much better off than we were before (it does avoid the relocations, which is worth something) and the compiler can now warn us if we’re about to walk off the deep end with 16-bit indices.

There are a number of places in Mozilla that could benefit from such reorganization: the effective TLD table (the above example about structure packing is taken from there), various tables in the CSS and HTML parsers, and so forth. Shared bits like constructing AtomTables could also have their interfaces changed to not be quite so pointer-heavy. Quickstubs could benefit from string table reorganizations and less pointer-ness. The list goes on, I’m sure.

I have a proof-of-concept bug open for nsEffectiveTLDService.cpp; I have been hesistant to dive deeply into other places; for one, the above bug doesn’t seem to have been touched for several weeks, and for two…

Is all this really worth it? I think it is; smaller binaries and faster startup times are a good thing in my book. I realize we have elfhack on mobile for reducing the size and cost of relocations; I say it’d be better to not pay for them in the first place. We also support other platforms where tricks like elfhack don’t work. But what do you think?


11
Nov 11

linker hacking for improved code locality

Hm, haven’t posted in a while.  Having your tonsils out will do that!

Mike Hommey recently blogged on some binary layout issues he’d run into with Thumb-2 binaries.  For the last week or so, I’ve been working on one of those problems: Thumb-2/ARM trampolines.

memcpy calls, how do they work? After all, code in your executable or shared library can’t directly call memcpy, since the address of memcpy is unknown until runtime. You can handle this in various ways; the solution adopted for ELF systems is something called a procedure linkage table, or PLT. (The mechanisms on Windows and Mac are similar, but the details are probably somewhat different; I’m not familiar with them.)

The PLT is simply an extra layer of indirection. Each entry in the PLT has a corresponding entry in the global offset table (GOT) associated with it; the GOT entry is the address of the actual function the PLT entry should call. At load time, the GOT entries are set to the address of the entry point of the dynamic linker’s lookup function.

When you call memcpy, then, you’re actually calling the PLT entry associated with memcpy. The first time you make such a call, the dynamic linker is invoked to figure out where memcpy actually lives. Once it has done so, it updates the associated GOT entry with that address and calls the “real” memcpy. Subsequent calls to memcpy then find the actual address in the GOT and jump directly there, bypassing the dynamic linker.

(glibc on Linux plays elaborate games so that calls from libc to libc functions don’t go through the PLT; the details of doing so are best left for a separate blog post.)

OK, so onto the problem description. PLT entries on ARM Linux are ARM mode code. Of course, it’s perfectly valid for Thumb code to want to use these PLT entries; it’d be silly to have separate PLT entries for ARM calls and Thumb calls. For non-tail calls, Thumb code can just use blx, which switches to ARM mode and jumps to a given address. But for tail calls, there’s no jump-to-offset-and-switch-mode instruction (bx jumps to an address in a register, not an immediate offset). So some cleverness is required.

In the gold linker, the solution taken is general: whenever we have such a Thumb-to-ARM branch (whether to a PLT entry or otherwise), generate a small stub of code to do the mode switch and branch, then branch to that. Such stubs are generated for various other similar cases, so the linker was already doing such things anyway.

There’s two issues with this approach. The first is that the stubs take up extra space, then second is that they’re placed wherever the linker thinks is convenient, which might not be the best place for improving the layout of the compiled code. (The linker does put some effort into reusing stubs, so if you have several Thumb-to-ARM calls/jumps to the same address, the linker will reuse a stub, rather than duplicating it willy-nilly.) The second is the real dealbreaker here.

The GNU linker’s approach is more clever: it embeds the necessary stub into the PLT entry itself. Whereas the entry previously looked like:

 22c:	e28fc610 	add	ip, pc, #16777216	; 0x1000000
 230:	e28cca08 	add	ip, ip, #32768	; 0x8000
 234:	e5bcf080 	ldr	pc, [ip, #128]!	; 0x80

it now looks like:

 228:	4778      	bx	pc
 22a:	46c0      	nop			; (mov r8, r8)
 22c:	e28fc610 	add	ip, pc, #16777216	; 0x1000000
 230:	e28cca08 	add	ip, ip, #32768	; 0x8000
 234:	e5bcf080 	ldr	pc, [ip, #128]!	; 0x80

(This works because bx pc actually branches to the PC of the bx insn + 4: i.e. 0x22c.)

So direct jumps from Thumb code use the entry point at 0x228, whereas all other calls and jumps from ARM or Thumb code use the entry point at 0x22c. This method is smaller (4 bytes for the PLT entry stub versus 16-ish bytes for the out-of-line, imperfectly located stub) and plays nicely with section reordering.

We could use GNU ld for our linking needs, but gold has several very nice features (performance, identical code folding, section ordering, incremental linking, etc.) that make its use desirable. So the last week of my time has been spent learning gold well enough to add these small Thumb stubs to PLT entries when necessary. gold can now do this, woohoo!

I believe my code is suitable for using in Mozilla builds. Before being submitted upstream, however, there need to be some kinks worked out with regards to incremental linking: gold’s incremental linking support assumes that all PLT entries are the same size. The above modifications mean that some will be 16 bytes and some will be 12. (A possible cheesy hack would be to make all stubs 16 bytes when incremental linking…) It might also be worthwhile exploring a more general solution for stub positioning so that stubs play nicely with section reordering.