28
Jan 14

finding space savings

My last post talked about trimming down JSJitInfo.  But it left the question of how one might find things like this unanswered (and even unasked!).  Herein follows a short list of my strategies, with a large amount of explanation, for finding things that take up too much space.

The first thing to look at is what objects (as distinct from functions, see below) are taking up a lot of space (on Linux/Android):

readelf -sW dist/bin/libxul.so | awk '$4 == "OBJECT" { print }' | sort -k 3 -n -r | head -n 50 | c++filt

(I don’t know of a good way of doing this on Mac or Windows. Mach-O symbol tables on Mac don’t contain the sizes of objects like ELF symbol tables do. The sizes are derivable from the addresses, of course, but that’s non-trivial code to write and insert into the pipeline above. dumpbin on Windows appears to print sizes of symbols with /symbols, but its output is so verbose and there appears to be some sort of indirection between the symbol and the actual data associated with the symbol. If anybody has non-Linux/Android insight here, I would appreciate it!)

That pipeline is a mouthful, so breaking it down:

  • readelf -sW dist/bin/libxul.so: The -s argument tells readelf to print the symbol table from libxul.so. The -W argument says to format it for a wide display, so the long symbol names that are so common with C++ programming don’t get cut off.
  • awk '$4 == "OBJECT" { print }': The lines readelf prints out look like this:

    3: 00000000021ef9c0 0 FUNC GLOBAL DEFAULT 13 _fini@@xul29a1

    The interesting ones for our purposes are the third field (the size of the object), the fourth field (the kind of object), and the last field (the name of the object). The awk invocation is selecting all the symbols with a type of OBJECT.

  • sort -k 3 -n -r: Now that we have all of the information about the symbols, let’s sort them by their size. -k 3 selects the third field, which is the size, and -n says that we have a numeric field to sort on. -r is just for convenience to list the biggest symbols first.
  • head -n 50 selects the top fifty biggest symbols.
  • c++filt converts all the mangled C++ symbol names to something more human-readable. So instead of _Z25js_GetterOnlyPropertyStubP9JSContextN2JS6HandleIP8JSObjectEENS2_I4jsidEEbNS1_13MutableHandleINS1_5ValueEEE@@xul29a1, you get js_GetterOnlyPropertyStub(JSContext*, JS::Handle<JSObject*>, JS::Handle, bool, JS::MutableHandle)@@xul29a1, which is more understandable.

You don’t have to look at libxul, of course: you can look at whatever objects or executables are of interest to you.

Once you have this information in hand, you can start investigating. Maybe the object is an array of structures that need better bitfield packing.  Maybe the object is large and doesn’t get used.  Maybe we could do a better job of packing string data if there are embedded pointers which require relocations.  Maybe we could use smaller datatypes to store values (saved nearly a megabyte on 32-bit platforms!).  Maybe a structure’s fields just need to be reordered to pack better.  Maybe we have large, duplicated tables lying around.

For discovering whether structures are wasting space, I recommend pahole.  Specifically, I recommend using my fork of pahole, which contains some build fixes for non-Fedora systems and some improvements so pahole copes with modern C++ debug information better. My experience with pahole suggests that its options for only selecting structures with holes are buggy and unhelpful, so you’re going to have to examine the full output. Using the -M option to avoid printing out all the methods on individual objects is tremendously useful.

The second, but less fruitful, strategy is to look for large functions, rather than large objects:

readelf -sW dist/bin/libxul.so | awk '$4 == "FUNC" { print }' | sort -k 3 -n -r | head -n 50 | c++filt

This is less effective, because functions are often harder to trim down: there’s some amount of work that needs to be done and reducing that is difficult. The biggest functions also tend to be quite a bit smaller than the biggest objects in the program, so there’s less room for winning space back.

Even so, maybe you could find cases of excessive inlining.  Finding cases of excessive inlining is a little more difficult than just looking at the symbol table, but the principles are the same. Or you could find cases where lookup tables should be used instead of long if-else chains.  Or macro-constructed case statements could be made smaller.

The third strategy recognizes that sometimes individual functions or objects aren’t that large, but there are so many of them that they wind up costing you a significant amount of space.  For instance, if you do:

readelf -sW dist/bin/libxul.so | c++filt | grep '::Read' | grep StandardURLParams

you’ll notice that IPDL codegen has created several copies of the exact same function.  Perhaps you could find a way to teach the IPDL compiler to not do that.  Or maybe there are several copies of functions that differ only slightly; find a way to merge those functions together. Templates are great places for finding code duplication, maybe they could be restructured to duplicate less.

A word of caution: link-time optimization, which gets used on a majority of our major platforms, does essentially the same thing as some of the aforementioned by-hand merging strategies. So just because you get some huge win locally doesn’t mean that a release build is going to exhibit the same win. (I would argue that the by-hand merging is beneficial in any event, but that is a topic for some other time.)

Finally, you need to decide what is a reasonable amount of space savings for you to try and hunt down. There are likely several thousand places in Gecko where we could win back ten or twenty bytes. But finding them all is liable to be difficult. Changing them to effect the savings may require significant amounts of code gymnastics that not all reviewers would accept. My personal threshold is somewhere around 10K and varies with how difficult the patch is to write. If I can save 5K with a two-line, fairly obvious patch, I’ll do it. If I can save 12K, but the patch is more involved, it goes on the todo list for a rainy day.


17
Jan 14

packing structures for fun and profit

When the DOM bindings code first started generating information for the JavaScript JIT about getters and setters, the generated information was rather sparse. JSJitInfo, the container for such information, looked like this:

struct JSJitInfo {
    JSJitPropertyOp op;
    uint32_t protoID;
    uint32_t depth;
    bool isInfallible;
    bool isConstant;
};

On a 32-bit platform, sizeof(JSJitInfo) was 16. You could shrink some of the fields, and even pack things into bitfields, but there weren’t that many getters and setters, so it probably wasn’t worth it.

Of course, nothing ever gets smaller: the bindings code started generating these structures for methods, the amount of information the bindings code wanted to communicate to the JIT expanded, and some of the parallel JS work started using the structure for its own purposes. And a bunch more interfaces got converted to WebIDL, so many more of these “little” structures were generated. Eventually, we wound up with:

struct JSJitInfo {
    enum OpType { ... };
    enum AliasSet { ... };
    union {
        JSJitGetterOp getter;
        JSJitSetterOp setter;
        JSJitMethodOp method;
    };
    uint32_t protoID;
    uint32_t depth;
    OpType type;
    bool isInfallible;
    bool isMovable;
    AliasSet aliasSet;
    bool isInSlot;
    size_t slotIndex;
    JSValueType returnType;
    const ArgType* const argTypes;
    JSParallelNative parallelNative;
};

This structure has several issues:

  • There’s wasted space after the isMovable and isInSlot fields, since the aliasSet and slotIndex fields need to be aligned on 4-byte boundaries (for 32-bit platforms). There’s even more wasted space on 64-bit platforms.
  • It’s less obvious, but JSValueType is only a single byte, so there’s wasted space after the returnType field so argTypes can be properly aligned.
  • OpType and AliasSet both have only a few values, yet take up four bytes each.
  • The argTypes field is only used for DOM methods, and even then, it isn’t used for very many.
  • The parallelNative field is never used for any of the DOM bindings code.
  • We’re unlikely to have four billion prototypes in the bindings. The web platform is large, but not that large. Similarly, we are unlikely to have an inheritance graph that’s four billion layers deep, or even several thousand layers deep.

This definition weighed in at 44 bytes on a 32-bit system. With 7k+ of these being generated by the bindings code, and more being added every release cycle, now seemed like a worthwhile time to slim these structures down.

This work has gone on in bug 952777, bug 960653, and yet-to-be-landed bug 960109. After all of those bugs land, we’ll have something that looks like:

struct JSJitInfo {
    union {
        JSJitGetterOp getter;
        JSJitSetterOp setter;
        JSJitMethodOp method;
        JSParallelNative parallelNative;
    };
    uint16_t protoID;
    uint16_t depth;
    uint32_t type_ : 4;
    uint32_t aliasSet_ : 4;
    uint32_t returnType_ : 8;
    uint32_t isInfallible : 1;
    uint32_t isMovable : 1;
    uint32_t isInSlot : 1;
    uint32_t isTypedMethod : 1;
    uint32_t slotIndex : 12;
};

Compared to our earlier version, we’ve addressed every complaint:

  • No internally wasted space due to field alignment, on 32-bit or 64-bit platforms.
  • Enum fields don’t take up a full four bytes of space anymore; they take much closer to the minimum amount of space needed.
  • DOM methods with type information available now have a separate subclass, JSTypedMethodJitInfo, so the argTypes field is only present when needed.
  • The parallelNative field has been moved into the union, so we’re not wasting that field anymore.
  • The protoID and depth fields are now a more reasonable size.

It’s worth noting that several of these fields could be smaller, but there’s no reason for them to be currently, given that shrinking them wouldn’t make the overall structure smaller.

The final size of the structure is 12 bytes on 32-bit platforms, and 16 bytes on 64-bit platforms. With 7k+ JSJitInfo structures, that means we’ve saved ~220K of space in a 32-bit libxul.  For a 32-bit libxul, this is almost 1% of the entire size of libxul, which is an extremely pleasant win for size optimizations.  Smaller downloads, less space on your device, and less memory at runtime!

If you’re interested in examining how much space particular structures take up, you can use pahole for Linux or struct_layout for Mac. I’m not aware of any similar program for Windows, though I imagine such a beast does exist. These programs work by examining the debugging information generated by the compiler and displaying the structure layout, along with any “holes” in the structure. pahole will also tell you about cacheline boundaries, so that you can pack frequently-accessed members in the same cacheline to minimize cache misses.

Given the above (and that there have been other wins, somewhat less spectacular in nature), I’ve wondered if it isn’t worth adding some sort of annotations, either in-source or off to the side, about how big we expect certain structures to be or how much internal waste we expect them to have. Then people modifying those structures–and the reviewers as well–would be aware that they were modifying something space-critical, and take appropriate steps to minimize the space increase or even trim things down a bit.


09
Jan 14

on old releases

Gregory Szorc recently wrote a laundry list of reasons for ditching support for old Python releases.  I think this list of reasons to upgrade misses the larger point in providing software for other people: You do not get to tell your users what to do.

Maybe those users don’t have sufficient control over their working environments to install a new version of Python.  (Webhosts and university computers are the two examples that spring to mind immediately.  Enterprise environments have similar constraints.)  Maybe those users rely on certain APIs only available in older versions of Python and don’t wish to take an indeterminate amount of time to rewrite (retest, recertify, etc. etc.) their software.  Maybe those users rely on certain APIs that were changed to operate differently in newer releases and don’t want to engage in an extensive audit of their codebase to fix those incompatibilities.  Maybe those users are using other packages that are incompatible with later Python releases and cannot upgrade.  Maybe those users are just rationally lazy and don’t want to deal with downloading, configuring, and installing a new version of Python, plus dealing with inevitable fallout, when the old version has worked Just Fine for everything else.  The list goes on and on.  (Of course, these reasons are not applicable to just Python; feel free to substitute your favorite language X or favorite extensible program X or what have you.)

Microsoft is the best example I can think of for backwards compatibility.  New Windows releases have gone to significant lengths to make it possible to run applications for older versions of Windows, whatever faults those applications may have.  Raymond Chen’s blog documents a number of extraordinary things Windows does under the hood to make outright buggy and/or undocumented-internals-groveling programs that worked under previous versions of Windows still work under newer ones.  You can, of course, argue that this has taken significant engineering effort that could have been put to use doing “better” things.  But Microsoft’s evaluation of “better” clearly includes “how much pain will this inflict on our customers?”

And that’s the point: you are trading off some perceived (and make no mistake, it is perceived) benefit to yourself as a developer of software X against the agony of some unknown number of users as your changes break their world.  Maybe you’ve decided that this agony is small enough: I’ve seen some great examples of this from the DOM/Content folks as they install Telemetry probes to measure how many users might be impacted by backwards-incompatible changes.  Maybe you’ve decided that you have a small enough community of users and they are all enthusiastic enough to adopt whatever you decided to do, even if it breaks things.  (You might be surprised at just how “small” your community is, though.)  Maybe you’ve decided that ditching the old stuff really is necessary and appropriate (hi Australis!).  Maybe you’ve decided that you simply don’t care about the agony of your users, or that the sharp spike in curses uttered against your household don’t matter.  (If you take this last approach, please don’t write any software that I use.)

I realize that using the new shiny stuff in Python, or C++, or whatever generally makes life nicer as a developer.  But I think developers tend (myself included) to systematically underestimate the amount of agony that user-facing changes cause.  Even when we know we are prone to doing so.


06
Dec 13

reading binary structures with python

Last week, I wanted to parse some Mach-O files with Python.  “Oh sure,” you think, “just use the struct module and this will be a breeze.”  I have, however, tried to do that:

class MyBinaryBlob:
    def __init__(self, buf, offset):
        self.f1, self.f2 = struct.unpack_from("BB", buf, offset)

and such an approach involves a great deal of copy-and-pasted code.  And if you have some variable-length fields mixed in with fixed-length fields, using struct breaks down very quickly. And if you have to write out the fields to a file, things get even more messy. For this experiment, I wanted to do things in a more declarative style.

The desire was that I could say something like:

class MyBinaryBlob:
    field_names = ["f1", "f2"]
    field_kinds = ["uint8_t", "uint8_t"]

and all the necessary code to parse the appropriate fields out of a binary buffer would spring into existence. (Automagically having the code to write these objects to a buffer would be great, too.) And if a binary object contained something that would be naturally interpreted as a Python list, then I could write a minimal amount of code to do that during initialization of the object as well. I also wanted inheritance to work correctly, so that if I wrote:

class ExtendedBlob(MyBinaryBlob):
    field_names = ["f3", "f4"]
    field_kinds = ["int32_t", "int32_t"]

ExtendedBlob should wind up with four fields once it is initialized.

At first, I wrote things like:

def field_reader(fmt):
    size = struct.calcsize(fmt)
    def reader_sub(buf, offset):
        return struct.unpack_from(fmt, buf, offset)[0], size
    return reader_sub

fi = field_reader("i")
fI = field_reader("I")
fB = field_reader("B")

def initialize_slots(obj, buf, offset, slot_names, field_specs):
    total = 0
    for slot, reader in zip(slot_names, field_specs):
        x, size = reader(buf, offset + total)
        setattr(obj, slot, x)
        total += size

class MyBinaryBlob:
    field_names = ["f1", "f2"]
    field_specs = [fB, fB]

    def __init__(self, buf, offset):
        initialize_slots(self, buf, offset, self.field_names, self.field_specs)

Fields return their size to make it straightforward to add variable-sized fields, not just fixed-width fields that can be parsed by struct.unpack_from. This worked out OK, but I was writing out a lot of copy-and-paste constructors, which was undesirable. Inheritance was also a little weird, since the natural implementation looked like:

class ExtendedBlob(MyBinaryBlob):
    field_names = ["f3", "f4"]
    field_specs = [fi, fi]

    def __init__(self, buf, offset):
        super(ExtendedBlob, self).__init__(buf, offset)
        initialize_slots(self, buf, offset, self.field_names, self.field_specs)

but that second initialize_slots call needs to start reading at the offset resulting from reading MyBinaryBlob‘s fields. I fixed this by storing a _total_size member in the objects and modifying initialize_slots:

def initialize_slots(obj, buf, offset, slot_names, field_specs):
    total = obj._total_size
    for slot, reader in zip(slot_names, field_specs):
        x, size = reader(buf, offset + total)
        setattr(obj, slot, x)
        total += size
    obj._total_size = total

which worked out well enough.

I realized that if I wanted to use this framework for writing binary blobs, I’d need to construct “bare” objects without an existing buffer to read them from. To do this, there had to be some static method on the class for parsing things out of a buffer. @staticmethod couldn’t be used in this case, because the code inside the method didn’t know what class it was being invoked on. But @classmethod, which received the invoking class as its first argument, seemed to fit the bill.

After some more experimentation, I wound up with a base class, BinaryObject:

class BinaryObject(object):
    field_names = []
    field_specs = []

    def __init__(self):
        self._total_size = 0

    def initialize_slots(self, buf, offset, slot_names, field_specs):
        total = self._total_size
        for slot, reader in zip(slot_names, field_specs):
            x, size = reader(buf, offset + total)
            setattr(self, slot, x)
            total += size
        self._total_size = total

    @classmethod
    def from_buf(cls, buf, offset):
        # Determine our inheritance path back to BinaryObject
        inheritance_chain = []
        pos = cls
        while pos != BinaryObject:
            inheritance_chain.append(pos)
            bases = pos.__bases__
            assert len(bases) == 1
            pos = bases[0]
        inheritance_chain.reverse()

        # Determine all the field names and specs that we need to read.
        all_field_names = itertools.chain(*[c.field_names
                                            for c in inheritance_chain])
        all_field_specs = itertools.chain(*[c.field_specs
                                            for c in inheritance_chain])

        # Create the actual object and populate its fields.
        obj = cls()
        obj.initialize_slots(buf, offset, all_field_names, all_field_specs)
        return obj

Inspecting the inheritance hierarchy at runtime makes for some very compact code. (The single-inheritance assertion could probably be relaxed to an assertion that all superclasses except the first do not have field_names or field_specs class members; such a relaxation would make behavior-modifying mixins work well with this scheme.) Now my classes all looked like:

class MyBinaryBlob(BinaryObject):
    field_names = ["f1", "f2"]
    field_specs = [fB, fB]

class ExtendedBlob(MyBinaryBlob):
    field_names = ["f3", "f4"]
    field_specs = [fi, fi]

blob1 = MyBinaryBlob.from_buf(buf, offset)
blob2 = ExtendedBlob.from_buf(buf, offset)

with a pleasing lack of code duplication.  Any code for writing can be written once in the BinaryObject class using a similar inspection of the inheritance chain.

But how does parsing additional things during construction work? Well, subclasses can define their own from_buf methods:

class ExtendedBlobWithList(BinaryObject):
    field_names = ["n_objs"]
    field_specs = [fI]

    @classmethod
    def from_buf(cls, buf, offset):
        obj = BinaryObject.from_buf.__func__(cls, buf, offset)
        # do extra initialization here
        for i in range(obj.n_objs):
            ...
        return obj

The trick here is that calling obj = BinaryObject.from_buf(buf, offset) wouldn’t do the right thing: that would only parse any members that BinaryObject had, and return an object of type BinaryObject instead of one of type ExtendedBlobWithList. Instead, we call BinaryObject.from_buf.__func__, which is the original, undecorated function, and pass the cls with which we were invoked, which is ExtendedBlobWithList, to do basic parsing of the fields. After that’s done, we can do our own specialized parsing, probably with SomeOtherBlob.from_buf or similar. (The _total_size member also comes in handy here, since you know exactly where to start parsing additional members.) You can even define from_buf methods that parse a bit, determine what class they should really be constructing, and construct an object of that type instead:

R_SCATTERED = 0x80000000

class Relocation(BinaryObject):
    field_names = ["_bits1", "_bits2"]
    field_specs = [fI, fI];
    __slots__ = field_names

    @classmethod
    def from_buf(cls, buf, offset):
        obj = BinaryObject.from_buf.__func__(Relocation, buf, offset)

        # OK, now for the decoding of what we just got back.
        if obj._bits1 & R_SCATTERED:
            return ScatteredRelocationInfo.from_buf(buf, offset)
        else:
            return RelocationInfo.from_buf(buf, offset)

This hides any detail about file formats in the parsing code, where it belongs.

Overall, I’m pretty happy with this scheme; it’s a lot more pleasant than bare struct.unpack_from calls scattered about.


21
Nov 13

my git workflow

Mark Hammond recently started an etherpad about how people work with git. Rather than commenting there, I thought I’d blog about my workflow instead.

First piece: magit.  If you use emacs and git, and you don’t use magit, you are missing out.  Highly recommended.  I don’t use the git command line for common operations anymore; I do everything through magit.  magit’s interactive staging is a big improvement over git add -i: you can stage files, hunks, or individual regions selectable by normal Emacs point-and-mark.  I also really like magit’s rebasing support, as I use rebase a lot.

Second piece: git-bz-moz.  I was reluctant to use this at first, but it’s been a huge boon in posting patches directly from my editor.  Setup is pretty straightforward; I have:

[bz]
	firefox-profile = other
	default-tracker = bugzilla.mozilla.org
	default-product = General
	default-component = Core
        default-assigned-to = nfroyd@mozilla.com
	add-url-method = subject-prepend:Bug %d -

in my ~/.gitconfig, and git-bz is smart enough to go grovel through my Firefox profile to get my Bugzilla login information. Having it auto-mark bugs with the appropriate bug numbers during export is also helpful. (It’s smart enough to remove them when adding descriptions for the patches in Bugzilla.) My only complaint is that attaching patches to a bug doesn’t auto-assign the bug to you, like like hg bzexport does.

Third piece: I wrote a script I call export-patches for sending stuff to try, committing to inbound, and exporting patches for uplift.  (I used to use it for formatting patches for bugzilla, but stopped doing that after learning git-bz.)  I can push things to try:

export-patches -h ${mc_repo} -t '-b do -p all -u all -t none' ${start}..${end}

or push things to inbound:

export-patches -h ${mi_repo} -r ehsan -b 1 -c ${start}..${end}

It supports per-patch reviewers, too (along with a couple of other things I won’t demonstrate here):

export-patches -h ${mi_repo} -r bz:glandium -b 1 -c ${start}..${end}

The -b 1 convention is leftover from when I wasn’t tagging my patches with bug numbers until commit.  (The script complains if bug numbers aren’t specified on the command line for commits.)  git-bz absolved me of doing that. I should probably fix that.

Third-and-a-half piece: export-patches takes some pains (not as many as it could) to ensure that whatever repo I’m using gets its patch queue wiped if things fail.  Less monkeying around with mercurial commands is a win in my book.

Fourth piece: One big branch of work. I used to use separate branches for bugs. However, I found that I was working on enough things simultaneously that switching between branches, rebasing if necessary, clobbering if necessary (often), and so forth was just too disruptive for day-to-day stuff. I’ll use branches if I have really disruptive things that I can’t integrate piecemeal into my one big branch, but generally everything goes into one branch. I ensure things build locally and I make occasional efforts to ensure appropriate tests still work locally, but try is where most of my testing gets done nowadays.

Fourth-and-a-half piece: I never checkout master.  I always fetch origin, and then rebase off of origin/master.  My branches all track origin/master, so magit will tell me exactly what commits I have remaining to go upstream.

Annoyances: If I commit patches, then those patches get backed out, when I next pull from mozilla-central and rebase, the patches that I pushed disappear from my branch. I haven’t looked too deeply into why this happens, but I’d really like to fix that.


05
Nov 13

ipdl syntax changes for types coming from C++

Over the weekend, I landed bug 918651, which changes the syntax for how you inform the IPDL compiler about types defined in C++.  Previously, you did:

include "mozilla/HeaderFile.h";
...
using typeFromHeaderFile;
...

The using declaration informs the IPDL compiler that typeFromHeaderFile may appear in places types can normally appear.  The include directive is so the generated headers know what to #include for the C++ compiler to be informed about typeFromHeaderFile.

This scheme has a couple of drawbacks:

  • The header files from the include directives aren’t connected to the using declarations in any way.  Those headers might only include the relevant type(s) incidentally, which doesn’t help in unraveling Gecko’s include dependencies.
  • The generated IPDL headers don’t necessarily need the full definition of typeFromHeaderFile.  For structs or classes, the generated headers can get by with a simple forward declaration.  The full definition is only needed in the generated source files.  The above syntax, however, doesn’t enable any sort of forward declaration magic.

To address both of these issues, the syntax for using declarations was changed.  For structs, you should say:

using struct structFromHeaderFile from "mozilla/HeaderFile.h"

The syntax for classes is similar:

using class classFromHeaderFile from "mozilla/HeaderFile.h"

In these cases, the IPDL compiler will forward-declare the types where appropriate and only #include the header in the generated source files.  Additionally, the compiler is intelligent enough to #include the header in the generated headers if it is required. For instance, if there is a struct or a union defined in the header file that requires a struct or a class from a using declaration, the relevant header will be included in the generated header instead of the generated source file.

Finally, if you need an enum type or a typedef, you should say:

using typeFromHeaderFile from "mozilla/HeaderFile.h"

This case functions similarly to what we had before, except that the header file is now closely associated with the type; ideally, that will encourage people to use the correct header (i.e. the one that defines the type).  While you are able to use this syntax with struct or class types, you should use the using struct or using class syntax, as appropriate, so that forward declarations are generated.

There are still a few instances of include directives for C++ headers in IPDL files; those should be considered a bug, and the include directive for C++ headers should not normally be needed going forward.

This change didn’t completely address the original issue of the bug (touching headers in gfx/ causes source files in netwerk/ to rebuild), but it moved us a lot closer to fixing those sorts of issues.


05
Nov 13

the performance implications of strncpy

Last week, I was working on making Firefox compile for a OS X target on a Linux host.  As part of this effort, I ported Apple’s opensourced ld64 linker to compile and run on a Linux host.  Since OS X is a BSD-derived operating system, ld64 made use of the strlcpy and strlcat functions, designed to be safer than the strcpy/strncpy/strcat/strncat functions.  Linux’s libc doesn’t implement strlcpy and strlcat, so I had to find replacements.  strlcpy seemed easy enough, as a presentation on maloader suggested:

size_t strlcpy(char* dst, const char* src, size_t size)
{
    dst[size - 1] = '\0';
    strncpy(dst, src, size - 1);
    return strlen(dst);
}

I cribbed strlcat from someplace else and went about my merry way.

After I got ld64 to compile, then link simple programs, and worked my way through some configure bugs for non-Android cross-compiles, I ran into a problem: the JavaScript shell was taking 8 minutes to link.  This was unacceptable; it meant linking libxul was going to take over an hour, maybe over two, to link, which nobody would be happy about.  The equivalent link of the JavaScript shell on my Mac mini took about two seconds.

I started investigating what was going on with perf, just checking into what ld64 was doing for parts of those 8 minutes.  99%+ of the time was being spent inside strncpy.  Hm, that’s strange.

I fiddled around with a couple different things, none of which had much impact.  Then I took a close look at the code calling strlcpy (yes, all the time in strlcpy was through this function, which should have been a red flag in the first place):

int32_t StringPoolAtom::add(const char* str)
{
	int32_t offset = kBufferSize * _fullBuffers.size() + _currentBufferUsed;
	int lenNeeded = strlcpy(&_currentBuffer[_currentBufferUsed], str, kBufferSize-_currentBufferUsed)+1;
	if ( (_currentBufferUsed+lenNeeded) < kBufferSize ) {
 		_currentBufferUsed += lenNeeded;
 	}
 	else {
 		int copied = kBufferSize-_currentBufferUsed-1;
 		// change trailing '\0' that strlcpy added to real char
 		_currentBuffer[kBufferSize-1] = str[copied];
 		// alloc next buffer
 		_fullBuffers.push_back(_currentBuffer);
 		_currentBuffer = new char[kBufferSize];
 		_currentBufferUsed = 0;
 		// append rest of string
 		this->add(&str[copied+1]);
	}
	return offset;
}

In this code, kBufferSize is 16MB, so the size parameter passed to strlcpy can be rather large compared to the size of the string being copied to the destination.

I forget exactly where I read it, but I saw some small blurb about glibc’s strncpy having the crazy behavior of null padding the destination string, rather than just appending a null terminator. If strlcpy was implemented by calling out to strncpy, then just that function above would be writing hundreds or even thousands of megabytes of zeros more than required. That would definitely slow things down!

(I later discovered that this “crazy behavior” is documented in the strncpy man page and is actually required by standards.  Indeed, the original strlcpy paper cites this problem of strncpy as a motivating factor for strlcpy.  It is the only way the performance figures they give in the paper are actually relevant to their point.  But silly me, I just assumed I knew how strncpy works rather than actually reading documentation. I am really curious how this behavior of strncpy came to be and why folks thought it was worth preserving.)

Once I fixed the strlcpy implementation to do things properly, cross link times became comparable to native link times.  And then I could think about linking libxul in a reasonable amount of time. (And I did link libxul in a reasonable amount of time, if you read through the bug. And it even runs on a Mac!)

Lesson learned: don’t use strncpy!


22
Oct 13

I got 99 problems, but they’re all due to template over-instantiation

TL;DR: Small C++ code change with templates has large impact (2% libxul codesize reduction).

nsTArray has an inheritance structure that looks like this:

template<class E>
class nsTArray : public nsTArray_Impl<E, nsTArrayInfallibleAllocator>
{ ... };

template<class E, class Alloc>
class nsTArray_Impl : public nsTArray_base<Alloc, nsTArray_CopyElements<E> >,
                      public nsTArray_TypedBase<E, nsTArray_Impl<E, Alloc> >
{ ... };

// Most classes are copied with memmove and friends.
// nsTArray_CopyElements can be specialized, but we will ignore those cases here.
template<class E>
struct nsTArray_CopyElements : public nsTArray_CopyWithMemutils {};

…and so forth. The separation into classes and helper classes are to commonify code, when possible, and to let the magic of template specialization select appropriate definitions of things at compile time.

The problem is that this worked a little too well. nsTArray_CopyElements<uint32_t> is a different class from nsTArray_CopyElements<int32_t>, even though both of them share the same base class and neither one adds extra functionality. This means that nsTArray_base must be instantiated separately for each element type, even though the behavior of nsTArray_base is completely independent of the element type.  And so methods were being unnecessarily duplicated, which impacted compile times, download size, startup and runtime performance, and so on.

[A sufficiently smart toolchain could make this problem go away: the linker can recognize duplicated methods and functions at the assembly level, discard all but one instance, and then redirect all the calls to the lone instance. (Bonus question: why does it have to be done by the linker, and not the compiler?  It’s certainly more effective, but there is a correctness issue as well.) MSVC calls this “identical COMDAT folding” and the gold linker on Linux implements a similar optimization called “identical code folding”. Indeed, we enable this optimization in the linker when it’s available on our release builds, precisely because it delivers a significant code size improvement. But in a cross-platform project, you can’t necessarily rely on the linker to fix up these sorts of issues.]

In our case, however, fixing the problem is straightforward. Instead of creating new classes to describe copying behavior, we’ll use template specialization to pick the appropriate class at compile time (the class that would have been the subclass of nsTArray_CopyElements in the above scheme) and use that class directly. Then we’ll have either nsTArray_base<Alloc, nsTArray_CopyWithMemutils> (the overwhelmingly common case), or some other specialization when array elements need special treatment:

template<class E>
struct nsTArray_CopyChooser {
  typedef nsTArray_CopyWithMemutils Type;
};

// Other specializations of nsTArray_CopyChooser possible...

template<class E, class Alloc>
class nsTArray_Impl : public nsTArray_base<Alloc, typename nsTArray_CopyChooser<E>::Type >,
                      public nsTArray_TypedBase<E, nsTArray_Impl<E, Alloc> >
{ ... };

Implementing this in bug 929494 reduced libxul’s executable code size by over 2% on Android, which is a hefty size win for such a simple change.


05
Oct 13

faster c++ builds by building bigger groups of code

There have been a lot of spectacular build changes going into the tree lately; my personal builds have gotten about 20% faster, which is no mean feat.  One smaller change that I’ve implemented in the last couple weeks is compiling the DOM bindings and the IPDL IPC code in what we’re been calling “unity” mode.

The idea behind unity mode is to compile a C++ file that #includes your actual C++ source files.  What’s the win from this?

  • Fewer compiler processes launched.  This is a good thing on Windows, where processes are expensive; it’s even a good thing on platforms where process creation is faster.
  • Less disk I/O.  The best case is if the original C++ source files include a lot of the same files.  Compiling the single C++ file then includes those headers only once, rather than once per original C++ source file.
  • Smaller debug information.  On Linux, at least, every Gecko object file compiled with debug information is going to include information about basic types like uint32_t, FILE, and so forth.  Compiling several files together means that you cut down on multiple definitions of things in the debug information, which is good.
  • Better optimization.  The compiler is seeing more source code at once, which means it has more information to make decisions about things like inlining.  This often leads to things not getting inlined (perhaps because the compiler can see that a function is called several times across several different files rather than one time in each of several source files).

It’s a little like link-time code optimization, except that your compiler doesn’t need to support LTO.  SQLite, in-tree and otherwise, already provides an option to compile everything as one big source file and claims ~5% speedup on benchmarks.

The concrete wins are that the DOM bindings compile roughly 5x faster, the IPC IPDL bindings compile roughly 2x faster, libxul got 100k+ smaller on Android, and that the Windows PGO memory required went down by over 4%.  (The PGO memory decrease was just from building DOM bindings in unity mode; the effect from the IPC code appears to have been negligible.)  The downside is that incremental builds when WebIDL or IPDL files are modified get slightly slower.  We tried to minimize this effect by compiling files in groups of 16, which appeared to provide the best balance between full builds and incremental builds.

The code is in moz.build and it’s not specific to the DOM bindings or IPC IPDL code; it will work on any collection of C++ source files, modulo issues with headers being included in unexpected places.  The wins are probably highest on generated code, but I’d certainly be interested in hearing what happens if other bits of the tree are compiled in unity mode.


16
Aug 13

recent android startup performance work

We’re still whittling away at the startup time on Android.  Shilpan Bhagat took a crack at speeding up parsing profiles.ini.  Brian Nicholson debugged a crash-tastic patch to initialize the C++ side of Firefox for Android sooner, and seems to have fixed the crashiness that plagued such approaches before.  Starting the initialization of Gecko itself sooner is important because that can happen in parallel with the work happening in Java on multi-core devices.  And chatting with Sriram Ramasubramanian yesterday about some menu initialization that was showing up on profiles, he assured me that startup time was definitely being taken into account when adding new features, which is great!

Before I get to the other work that’s been going on, I need to talk about how cross-language communication works in Firefox for Android.

Firefox for Android uses three primary languages: C++ for Gecko, Java for the Android UI and related bits, and JavaScript for all sorts of things.  Java and C++ play nicely via JNI, and Javascript and C++ talk to each other via XPCOM.  But what happens when Java needs to talk to JS or vice versa?  The solution that’s been chosen here is that Java constructs and serializes a JSON message, sends that to C++, which then notifies observers that a particular kind of message has been received.  Registered JavaScript observers (you could also do this in C++, of course, but it’s just easier to do it in JavaScript) are then invoked, parse the JSON, and do their thing.  And of course there’s a reverse path for sending things from JavaScript to Java.

With that being said, you might not be surprised that Chris Kitching identified that JSON parsing and serialization are slowing down startup. We spend nearly 200ms doing JSON operations on a Galaxy Nexus–just in Java, never mind JavaScript (though the JavaScript JSON bits are quite fast because they’re backed by C++ code).  Part of this is because the interface that Android uses for JSON is not well-designed and part of this is because the implementation could be improved (though Android does not use the json.org implementation, both could be improved).  Part of this is also because we do things like JSON-wrap single integers when there are much more efficient ways to communicate such values.  I’ve been fixing up some small hotspots (warmspots?), starting with sending telemetry information and Java’s requests for preferences information.

For some things, however, we do have to deal with JSON.  Firefox for Android twiddles with the sessionstore information before forwarding it to JavaScript, and sending the sessionstore information in any other format than JSON would be…tedious.  The twiddling itself is not particularly expensive, but the reading/parsing/serialization required around it is.  For a smallish session of six tabs, where we only have ~2KB of sessionstore JSON to deal with, reading/parsing/serializing the data takes almost 250ms on a Galaxy Nexus.  This is expensive enough that techniques like parsing it on a background thread might be worthwhile.  It’s not entirely clear whether it helps (different profiling tools say different things, single-core devices may not benefit from it, etc.), though.