13
Dec 11

invasive changes for relocation reduction

We have a lot of relocations in our libraries; almost 240k in libxul alone. Many of these relocations are due to the massive number of vtables that we have; every function pointer in a vtable requires a runtime relocation. And you can find function pointers in XPCOM CIDEntrys, function tables for PLDHashes, and so forth, which are certainly convenient.

But for other kinds of data, such as tables of strings, or tables of structures that contain strings, relocations are entirely avoidable if you’re willing to restructure things. The idea is not new; it’s been described in Ulrich Drepper’s manifesto for writing shared libraries, it’s been used for data structures in compilers and otherwise, and it’s often used in pickling data to and from disk.

The basic idea is that instead of storing pointers to data, you store offsets into some table of data. For instance, if you had:

static const char *errors[] = {
  "no error",
  "on fire",
  "timed out",
  "pebkac"
};

const char *geterr(int i)
{
  return errors[i];
}

then errors itself would be 16 bytes (assuming 32-bit pointers), plus the actual string data (34 bytes), plus the relocations for each entry in errors (32 bytes on Linux/x86, more or less depending on your platform), for a total of 82 bytes.

If instead you had something like:

static const char errtable[] = { "no error\0" "on fire\0" "timed out\0" "pebkac\0" };
static const uint8_t erridx[] = { 0, 9, 17, 27 };

const char *geterr(int i)
{
  return &errtable[erridx[i]];
}

you’d only need 34 bytes for string data plus 4 bytes for the index table, or 38 bytes total, for a savings of more than 50%. You do pay an extra instruction on each call to geterr as well; maybe you want to count that in the byte count if you’re being pedantic.

If you have character pointers embedded in structures, the savings can be more dramatic, because offsets can be smaller than pointers, which may enable you to pack structures better. For instance, if you have:

struct foo {
  const char *s;
  bool b1;
  bool b2;
};

sizeof(struct foo) would be 8 on a typical 32-bit platform, due to padding after b2. Using an offset scheme as described above, you now have:

struct foo {
  uint16_t offset;
  bool b1;
  bool b2;
};

and now struct foo is 50% smaller because padding is no longer required.

Obviously, forming the erridx array above can be tedious; Drepper’s paper linked above describes how one might use the C preprocessor to avoid such manual construction. His construction relies on the C compiler supporting string constants of arbitrary size; MSVC limits us here, as the maximum size of a string constant (post-preprocessor concatenation) is 65K bytes. While there are ways around this limitation, the limitation is not so bad as one might think: using 32-bit indices doesn’t make us much better off than we were before (it does avoid the relocations, which is worth something) and the compiler can now warn us if we’re about to walk off the deep end with 16-bit indices.

There are a number of places in Mozilla that could benefit from such reorganization: the effective TLD table (the above example about structure packing is taken from there), various tables in the CSS and HTML parsers, and so forth. Shared bits like constructing AtomTables could also have their interfaces changed to not be quite so pointer-heavy. Quickstubs could benefit from string table reorganizations and less pointer-ness. The list goes on, I’m sure.

I have a proof-of-concept bug open for nsEffectiveTLDService.cpp; I have been hesistant to dive deeply into other places; for one, the above bug doesn’t seem to have been touched for several weeks, and for two…

Is all this really worth it? I think it is; smaller binaries and faster startup times are a good thing in my book. I realize we have elfhack on mobile for reducing the size and cost of relocations; I say it’d be better to not pay for them in the first place. We also support other platforms where tricks like elfhack don’t work. But what do you think?


11
Nov 11

linker hacking for improved code locality

Hm, haven’t posted in a while.  Having your tonsils out will do that!

Mike Hommey recently blogged on some binary layout issues he’d run into with Thumb-2 binaries.  For the last week or so, I’ve been working on one of those problems: Thumb-2/ARM trampolines.

memcpy calls, how do they work? After all, code in your executable or shared library can’t directly call memcpy, since the address of memcpy is unknown until runtime. You can handle this in various ways; the solution adopted for ELF systems is something called a procedure linkage table, or PLT. (The mechanisms on Windows and Mac are similar, but the details are probably somewhat different; I’m not familiar with them.)

The PLT is simply an extra layer of indirection. Each entry in the PLT has a corresponding entry in the global offset table (GOT) associated with it; the GOT entry is the address of the actual function the PLT entry should call. At load time, the GOT entries are set to the address of the entry point of the dynamic linker’s lookup function.

When you call memcpy, then, you’re actually calling the PLT entry associated with memcpy. The first time you make such a call, the dynamic linker is invoked to figure out where memcpy actually lives. Once it has done so, it updates the associated GOT entry with that address and calls the “real” memcpy. Subsequent calls to memcpy then find the actual address in the GOT and jump directly there, bypassing the dynamic linker.

(glibc on Linux plays elaborate games so that calls from libc to libc functions don’t go through the PLT; the details of doing so are best left for a separate blog post.)

OK, so onto the problem description. PLT entries on ARM Linux are ARM mode code. Of course, it’s perfectly valid for Thumb code to want to use these PLT entries; it’d be silly to have separate PLT entries for ARM calls and Thumb calls. For non-tail calls, Thumb code can just use blx, which switches to ARM mode and jumps to a given address. But for tail calls, there’s no jump-to-offset-and-switch-mode instruction (bx jumps to an address in a register, not an immediate offset). So some cleverness is required.

In the gold linker, the solution taken is general: whenever we have such a Thumb-to-ARM branch (whether to a PLT entry or otherwise), generate a small stub of code to do the mode switch and branch, then branch to that. Such stubs are generated for various other similar cases, so the linker was already doing such things anyway.

There’s two issues with this approach. The first is that the stubs take up extra space, then second is that they’re placed wherever the linker thinks is convenient, which might not be the best place for improving the layout of the compiled code. (The linker does put some effort into reusing stubs, so if you have several Thumb-to-ARM calls/jumps to the same address, the linker will reuse a stub, rather than duplicating it willy-nilly.) The second is the real dealbreaker here.

The GNU linker’s approach is more clever: it embeds the necessary stub into the PLT entry itself. Whereas the entry previously looked like:

 22c:	e28fc610 	add	ip, pc, #16777216	; 0x1000000
 230:	e28cca08 	add	ip, ip, #32768	; 0x8000
 234:	e5bcf080 	ldr	pc, [ip, #128]!	; 0x80

it now looks like:

 228:	4778      	bx	pc
 22a:	46c0      	nop			; (mov r8, r8)
 22c:	e28fc610 	add	ip, pc, #16777216	; 0x1000000
 230:	e28cca08 	add	ip, ip, #32768	; 0x8000
 234:	e5bcf080 	ldr	pc, [ip, #128]!	; 0x80

(This works because bx pc actually branches to the PC of the bx insn + 4: i.e. 0x22c.)

So direct jumps from Thumb code use the entry point at 0x228, whereas all other calls and jumps from ARM or Thumb code use the entry point at 0x22c. This method is smaller (4 bytes for the PLT entry stub versus 16-ish bytes for the out-of-line, imperfectly located stub) and plays nicely with section reordering.

We could use GNU ld for our linking needs, but gold has several very nice features (performance, identical code folding, section ordering, incremental linking, etc.) that make its use desirable. So the last week of my time has been spent learning gold well enough to add these small Thumb stubs to PLT entries when necessary. gold can now do this, woohoo!

I believe my code is suitable for using in Mozilla builds. Before being submitted upstream, however, there need to be some kinks worked out with regards to incremental linking: gold’s incremental linking support assumes that all PLT entries are the same size. The above modifications mean that some will be 16 bytes and some will be 12. (A possible cheesy hack would be to make all stubs 16 bytes when incremental linking…) It might also be worthwhile exploring a more general solution for stub positioning so that stubs play nicely with section reordering.


17
Oct 11

reordering plugins and edge profiles

Mike Hommey and I have been working with a linker plugin that uses the callgraph to reorder functions on Linux and Android. One feature of the linker plugin is that it dumps a simplified representation of the callgraph to a text file. The simplified representation is really simple, just edge counts for all the edges in the callgraph. It’s not perfect, since the ordering of caller and callee in the dump file is not always consistent. Also, you get a few spurious functions in there, like __builtin_expect and SSE builtins calling things.  (__builtin_expect just provides a hint to the compiler about how to order basic blocks for more cache-friendly control flow; SSE builtins should compile down to single/few instructions and never actually call something at runtime.)

Nevertheless, looking at the file for libxul.so can be illuminating. Actually, it’s probably more illuminating to narrow things down to the edges with 100k+ call counts and demangled function names. Doing that, we can see that:

Anyway, maybe people have already seen all this stuff, but I certainly hadn’t.


29
Sep 11

pgo startup times with syzygy

In an effort to confirm that we do want all this syzygy goodness in our release builds, I’ve been testing out syzygy on PGO builds (since we do PGO builds on Windows for releases). After removing the PEBKAC and getting a proper PGO build–which took depressingly long–I have mixed results.

First, the good news. On my laptop (Win7, Core i7, SSD), the about:startup numbers look like this:

Version main sessionRestored firstPaint
Base PGO build 265 3152 3012
Optimized PGO build 234 2778 2653

These numbers are really encouraging; they’re actually even better than the initial numbers I posted earlier.  (Though I note that they are universally slower than the earlier numbers…hm….)

There is one curious thing about these builds, though. When you look at the page fault numbers, they suggest a much different story. The (cold) numbers are what you get from starting Firefox just after a reboot; the (warm) numbers are from a second startup after the first.

Version Hard faults (cold) Soft faults (cold) Hard faults (warm) Soft faults (warm)
Base PGO build 2507 41219 8 26100
Optimized PGO build 2264 41488 14 23017

These numbers are totally contrary to what I saw with non-PGO builds. We’re not consistently lower in the optimized build on either measure. I honestly haven’t thought very hard about what this means yet.

Anyway, that’s the good news. The bad news is that on my desktop (Win XP, Core 2 Quad, mechanical drive), the about:startup numbers look like this:

Version main sessionRestored firstPaint
Base PGO build 1516 8984 8813
Optimized PGO build 1437 9187 8828

(I don’t have the necessary profiling tools on my XP box for doing page fault analysis. I shouldn’t think they’d differ dramatically between the two systems, though.)

This is a little discouraging. The syzygy-optimized build is a little faster off the line, but gets edged out by the base build in getting to the points that actually matter. I haven’t thought terribly hard about these numbers, either. One possibility is that I did turn off the XPCOM glue preloading bits, which IIUC correctly are helpful for encouraging XP to keep its hands off your binary’s startup time. Doing that was necessary for getting postlinking to work properly. If I made that runtime-configurable, then I could run tests with the preloading enabled and see if we win there.

Bottom line: We would win on leading-edge machines, but we wouldn’t see a lot of benefit on older machines.

Also, if Microsoft would add a drop_caches lookalike, that would be fantastic.


20
Sep 11

startup reduction times with syzygy, part 2

In my previous post, I presented some startup timings with syzygy-optimized Firefox binaries.  People asked whether those timings were on my work laptop (quad-core i7, Windows 7, SSD) and I confirmed they were.  Since those timings were encouraging, but not overwhelming, folks suggested that I might try timing things on a more conventional system.

Below are results from testing on my desktop (quad-core Core 2 Duo @ 2.6GHz, Windows XP, 7200ish-rpm hard drive):

Version main sessionRestored firstPaint
Trunk build 1484 11515 11125
Optimized build 2562 8812 8703

That’s quite a difference: about a 25% win just from reordering functions.  Much more exciting!


19
Sep 11

startup reduction times with syzygy

People that I’ve told about the work with syzygy that I’ve been doing have, almost universally, two reactions:

  1. That’s cool!  (Thanks; I did very little work for it!)
  2. How does that translate into startup time?

Assuming that a 40% reduction in page faults leads to a 40% reduction in startup time is not reasonable, but surely there should be some reduction in startup time, right?  I finally benchmarked this today using the about:startup extension; these numbers are from cold start, freshly rebooted both times:

Version main sessionRestored firstPaint
Trunk build 125 1733 1671
Optimized build 125 1639 1577

So a 40% reduction in page faults translates into ~6% reduction in startup time; not stellar, but not too bad either.

The next step is making sure this all works with PGO builds on Windows. Then we get to have a discussion about whether to incorporate this into the regular builds and getting all the infrastructure on build machines.


17
Sep 11

notes from the all-hands profiling bof

After talking to a couple people at all-hands, it became clear that writing your own profiler was a popular activity.  (Jeff Muizelaar informed me that last year, the pet project was heap analyzers.  What’s next for 2012?)  A short, non-exhaustive list:

…and so forth.  So it seemed like getting interested parties together to talk about One Profiler to Rule Them All would be good.  And it worked; we had probably 20 people come to the BoF.  Below is my summary/recollection of what we discussed..  All the omissions and/or misrepresentation in the notes are my own; please leave a comment if you felt I left something out or need to describe something better.

What does the ideal profiler look like?

  • Cross-platform
  • Low overhead, both when profiling and not
  • Collects more-or-less complete call stacks (this is fairly easy everywhere except x86/Linux)
  • Understands compiled/interpreted JavaScript and C/C++ code
  • Built with the browser itself, not an external process or loaded via LD_PRELOAD or similar; this means we can ship it with the browser and diagnose problems in the field
  • Pretty pictures for viewing collected data.  Who doesn’t love pretty pictures, right?

Bas Schouten also pointed out that it might be much more efficient to just buy suitable profiling technology from a vendor; there was some skepticism that a profiler fulfilling the desiderata existed, though.

Since Benoit gave a demo of his profiler and his profiler seems likely to get into the tree, most of the discussion centered around that or variantions thereof.  Benoit’s profiler works in the usual way via periodic signaling; however, instead of unwinding the call stack, Benoit chose to require annotations (via RAII objects) placed in various parts of the code to indicate what your “callstack” was when a sample was taken.  I believe Steve said his profiler does something similar for JavaScript.  I put “callstack” in quotes because you only get to know whether functions were on the call stack if annotations had been placed in them.

Sprinkling annotations all over the tree sounded like a tedious process.  Somebody pointed out that Chrome does this, though they only place annotations at “entry points” for modules, so you might have one entry point for layout, one for graphics, etc. etc.  That way, given a profile on some random performance bug, you can at least tell who should be exploring the bug further with minimal overhead, since you’re not unwinding and a handful of RAII objects isn’t going to cost much.  Granted, this doesn’t do much for the folks who need to dig deeper, but perhaps we can have other tools for that.

There was some discussion of unwinding instead of annotations.  Unwinding is reasonably cheap when using libunwind and caching decoding of the unwind information; it’s even cheaper when you can just walk frame pointers.  The only wrinkle is you aren’t guaranteed to have frame pointers or unwind information on x86/Linux, so unwinding is not generally doable there.  Sometimes assembly coders also forget they need to insert unwind annotations, though most if not all of the problematic code in glibc, at least, has been so annotated.  Taras Glek suggested that we could insert RAII objects before calling out to code that we don’t control and to make those objects record something about frame/stack pointers so that we could unwind around third-party code if necessary.  I don’t believe we came to a consensus on using unwinding instead of or in addition to annotations.

I can’t recall if we talked extensively about unwinding through JavaScript code.

We didn’t talk about displaying the collected data.  Drawing pretty, understandable pictures is hard.

Benoit and Steve agreed to talk further offline about modifying Benoit’s profiler to understand JavaScript; if you’d like to help with profilers, talk to either of them.  It’d be great to have something in the tree that we can all work with and improve.


08
Sep 11

fewer page faults with syzygy

In my last post, I explained what Syzygy was and discussed some preliminary results from using it with Firefox.  I finally have results of running the whole Syzygy toolchain (instrumentation/profiling/reordering) and some numbers to share.

First off, I mentioned that I could get the call tracing to work right.  That was because I hadn’t installed Syzygy’s counterpart, the Sawbuck log viewer.  That’s the bit that contains the trace provider for the call tracer; you can download Sawbuck or you can install the src/sawbuck/Debug/sawbuck.msi module from your own build.

Secondly, these numbers need to be collected with a change to browser/app/nsBrowserApp.cpp to turn preloading off; otherwise the preloading of xul.dll will defeat the tracing mechanism.  With an eye to eventually getting this into the Mozilla build process, I’m not sure how well that bit will work out; perhaps preloading could be made dependent on an environment variable.  We can then turn off preloading during the build + postlink and get the benefits of a properly laid-out xul.dll without hacks in the source.

With my laptop (Windows 7, Intel SSD), a cold startup of a trunk build of Firefox results in about 2300 hard page faults and 43000 soft page faults.  These terms are what Windows uses; other operating systems call them major and minor page faults, respectively.  In any event, hard page faults are what we care about, because those result in talking to the hard drive.

After running the optimize program that comes with Syzygy–which automates the running of the instrumenter, the collection of profile data from the application, and reordering the library(s) in question–cold startup results in about 1400 hard page faults and 27000 soft page faults.  Like the Chrome folks saw, this is about a 40% reduction in hard page faults; soft page faults in this scenario are reduced to about what you’d see with warm startup.


26
Aug 11

making firefox work with syzygy

The good folks at Google have written a clever tool called Syzygy, which is part of the Sawbuck project.  The best summary of Syzygy comes from its design document:

Syzygy is a suite of tools to perform profile-guided, function-level reordering of 32-bit Windows PE executables, to optimize their layout improved performance, notably for improved paging patterns.

Google wrote Syzygy for use with Chrome, but the tool is equally applicable to any large application where you want to improve performance…like Firefox.  In this case, we’re concerned with improving the layout of libxul, as that’s where the bulk of the Firefox code lives.  Working with Syzygy involves four major steps:

  1. Instrumenting the application binary in question;
  2. Running the application to collect profile data (function addresses along with invocation time);
  3. Passing the profile data through an ordering generator, which comes up with the order in which functions should be laid out in the optimized binary; and finally
  4. Relinking the application binary using the ordering from step 3.

Step 1 is pretty easy; Firefox just needs to be compiled with Visual Studio’s /PROFILE switch to ensure that the instrumenter has all the information it needs.  Steps 3 and 4 are likewise straightforward.

Step 2 appears to be the tricky part.  Being good–lazy–computer programmers, the Google folks wrote a number of scripts and programs to automate this process, as well as some benchmarking infrastructure.  However, the scripts are written with Chrome in mind; many places have Chrome-specific bits hardcoded.  This is, of course, totally understandable, but it makes using those scripts with other programs difficult.

Over the past couple of weeks, I’ve been working at making Syzygy cooperate with Firefox.  If you’re interested, you can see the modifications I’ve made in my sawbuck github project.  Things are working well enough that I can now run:

Debug/py/Scripts/benchmark --user-data-dir flobbity --no-preload --no-prefetch ~/ff-build/dist/bin/firefox.exe

(The --no-{preload,prefetch} options are required to work around Chrome-specific code that didn’t seem worth ripping out; the --user-data-dir specifies what profile to use when launching Firefox.)  After waiting for a minute or two, the benchmark script reports:

RESULT firefox.exe: SoftPageFaults= [23495, 32139, 23356, 23343, 23299, 23167, 23063, 23141, 23113, 23267]
RESULT firefox.exe: HardPageFaults= [1158, 10, 3, 3, 4, 2, 2, 2, 3, 2]

This is for an unoptimized binary, of course.  You can clearly see the OS’s page cache at work in runs after the first.

The scripts are not quite perfect yet.  In particular, the call traces necessary to perform reordering don’t seem to be generated, for some peculiar reason that I haven’t ferreted out.  Also, the script will indiscriminately kill any Mozilla-related apps running along with the Firefox instances being benchmarked; I couldn’t find any good way to limit the killing to windows associated with a particular profile.  (IIUC the Chrome code correctly, it sets the window text of a hidden window to the full path to the profile directory in use.)  But a good bit seems to work; hopefully progress will come faster now that the groundwork has been laid.


24
Aug 11

Hello world!

My name is Nathan Froyd and I work in Taras Glek’s group on performance-related things.  I’m working remote; I live in the wonderful city of Indianapolis, Indiana.

I started working at Mozilla back in mid-June, but took quite some time off due to cardiac arrest from streptococcal myocarditis.  Fortunately I came through that OK, and folks at Mozilla have been very helpful and understanding during my hospitalization and recovery.  I’ve been back to work for a couple of weeks now and finally feel like a Mozilla employee. 🙂

In my previous work, I worked on the GNU toolchain at CodeSourcery: GCC optimizations, both general and PowerPC-related; maintaining the PowerPC ports internally; C++ frontend diagnostic improvements, like function overload resolution failure explanation and better missing semicolon diagnostics; GDB porting; and supporting the toolchain on popular embedded platforms (across the x86, ARM, PowerPC, MIPS, SPARC, and SH architectures).

I also worked on some of the software architecture bits of GCC.  One of the patches I’m happiest with was a patch series to slim down how GCC represents expressions and constants internally.  I’m looking forward to working on similar software architecture patches at Mozilla.

Outside of work, I enjoy spending time with my family; I have three daughters and they keep my wife and I busy!  (I like to say that I am thoroughly outnumbered in my house–even the cat is a girl.)  I love to read: sci-fi, fantasy, history, philosophy, and theology books all grace my bookshelves.  When I do feel like programming outside of work, I enjoy working with Common Lisp and SBCL.  And finally, I recently started playing World of Warcraft once again.