29
Jan 14

space saving miscellany

Yesterday’s post on space saving techniques generated a few comments.  It seemed worthwhile to highlight a few of the comments for a wider audience.

  • Various people have pointed out that clang and GCC support a -Wpadded option to warn when padding is necessary inside of a structure.  Visual C++ supports warning C4280 that does the same thing.  You can enable this warning in Visual C++ by passing /we4280 on the compiler command line.  I’m fairly certain this warning would generate a lot of output, but it might be worthwhile to comb through the output and see if anything interesting turns up.
  • David Major pointed out the /d1reportAllClassLayout switch for Visual C++, which prints the class layout of all the classes in the compilation unit.  If you’re only interested in a single class, you can use /d1reportSingleClass$NAME to narrow the report down to the class with $NAME. GCC used to have something similar in -fdump-class-hierarchy, but that option has been removed.
  • Benoit Girard asked if he could see a list of the 50 largest things on a Linux build.  Forthwith, I present the 50 largest objects and the 50 largest functions in libxul for an x86-64 Linux optimized build.  One thing to note about the objects is that they’re not all in the same section; the seventh field in readelf output tells you the section index.  So for the linked list of objects above, section 15 is .rodata (read-only, shareable data), section 22 is .data (read-write non-shareable data), section 27 is .data.rel.ro (data that needs to have relocations applied at load time, but can be read-only thereafter, e.g. virtual function tables), and section 29 is .bss (zero-initialized memory).  Unsurprisingly, string encoding/decoding tables are the bulk of the large objects, with various bits from WebRTC, JavaScript, and the Gecko profiler also making an appearance.  Media codec-related functions appear to take up a large amount of space, along with some JavaScript functions, and a few things related to the HTML parser.
  • A commenter by the name of “AnimalFriend” correctly pointed out that what you really want to know is which structures both have a lot of instances hanging around and have holes that you could fill.  I don’t know of a good way to answer the first part without adding a lot of instrumentation (though perhaps you could catch a lot of cases by augmenting the MOZ_COUNT_CTOR macro to tell you which structures get allocated a lot). The second part can be answered by something like pahole.
  • Alternatively, you could use something like access_profiler to tell you what fields in your objects get accessed and how often, then carefully packing those fields into the same cache line.  The techniques access_profiler uses are also applicable to counting allocations of individual objects.  Maybe we should start using something more access_profiler-like instead of MOZ_COUNT_CTOR and friends!  Definitely more C++-ish, more flexible, and it eliminates the need to write the corresponding MOZ_COUNT_DTOR.

28
Jan 14

finding space savings

My last post talked about trimming down JSJitInfo.  But it left the question of how one might find things like this unanswered (and even unasked!).  Herein follows a short list of my strategies, with a large amount of explanation, for finding things that take up too much space.

The first thing to look at is what objects (as distinct from functions, see below) are taking up a lot of space (on Linux/Android):

readelf -sW dist/bin/libxul.so | awk '$4 == "OBJECT" { print }' | sort -k 3 -n -r | head -n 50 | c++filt

(I don’t know of a good way of doing this on Mac or Windows. Mach-O symbol tables on Mac don’t contain the sizes of objects like ELF symbol tables do. The sizes are derivable from the addresses, of course, but that’s non-trivial code to write and insert into the pipeline above. dumpbin on Windows appears to print sizes of symbols with /symbols, but its output is so verbose and there appears to be some sort of indirection between the symbol and the actual data associated with the symbol. If anybody has non-Linux/Android insight here, I would appreciate it!)

That pipeline is a mouthful, so breaking it down:

  • readelf -sW dist/bin/libxul.so: The -s argument tells readelf to print the symbol table from libxul.so. The -W argument says to format it for a wide display, so the long symbol names that are so common with C++ programming don’t get cut off.
  • awk '$4 == "OBJECT" { print }': The lines readelf prints out look like this:

    3: 00000000021ef9c0 0 FUNC GLOBAL DEFAULT 13 _fini@@xul29a1

    The interesting ones for our purposes are the third field (the size of the object), the fourth field (the kind of object), and the last field (the name of the object). The awk invocation is selecting all the symbols with a type of OBJECT.

  • sort -k 3 -n -r: Now that we have all of the information about the symbols, let’s sort them by their size. -k 3 selects the third field, which is the size, and -n says that we have a numeric field to sort on. -r is just for convenience to list the biggest symbols first.
  • head -n 50 selects the top fifty biggest symbols.
  • c++filt converts all the mangled C++ symbol names to something more human-readable. So instead of _Z25js_GetterOnlyPropertyStubP9JSContextN2JS6HandleIP8JSObjectEENS2_I4jsidEEbNS1_13MutableHandleINS1_5ValueEEE@@xul29a1, you get js_GetterOnlyPropertyStub(JSContext*, JS::Handle<JSObject*>, JS::Handle, bool, JS::MutableHandle)@@xul29a1, which is more understandable.

You don’t have to look at libxul, of course: you can look at whatever objects or executables are of interest to you.

Once you have this information in hand, you can start investigating. Maybe the object is an array of structures that need better bitfield packing.  Maybe the object is large and doesn’t get used.  Maybe we could do a better job of packing string data if there are embedded pointers which require relocations.  Maybe we could use smaller datatypes to store values (saved nearly a megabyte on 32-bit platforms!).  Maybe a structure’s fields just need to be reordered to pack better.  Maybe we have large, duplicated tables lying around.

For discovering whether structures are wasting space, I recommend pahole.  Specifically, I recommend using my fork of pahole, which contains some build fixes for non-Fedora systems and some improvements so pahole copes with modern C++ debug information better. My experience with pahole suggests that its options for only selecting structures with holes are buggy and unhelpful, so you’re going to have to examine the full output. Using the -M option to avoid printing out all the methods on individual objects is tremendously useful.

The second, but less fruitful, strategy is to look for large functions, rather than large objects:

readelf -sW dist/bin/libxul.so | awk '$4 == "FUNC" { print }' | sort -k 3 -n -r | head -n 50 | c++filt

This is less effective, because functions are often harder to trim down: there’s some amount of work that needs to be done and reducing that is difficult. The biggest functions also tend to be quite a bit smaller than the biggest objects in the program, so there’s less room for winning space back.

Even so, maybe you could find cases of excessive inlining.  Finding cases of excessive inlining is a little more difficult than just looking at the symbol table, but the principles are the same. Or you could find cases where lookup tables should be used instead of long if-else chains.  Or macro-constructed case statements could be made smaller.

The third strategy recognizes that sometimes individual functions or objects aren’t that large, but there are so many of them that they wind up costing you a significant amount of space.  For instance, if you do:

readelf -sW dist/bin/libxul.so | c++filt | grep '::Read' | grep StandardURLParams

you’ll notice that IPDL codegen has created several copies of the exact same function.  Perhaps you could find a way to teach the IPDL compiler to not do that.  Or maybe there are several copies of functions that differ only slightly; find a way to merge those functions together. Templates are great places for finding code duplication, maybe they could be restructured to duplicate less.

A word of caution: link-time optimization, which gets used on a majority of our major platforms, does essentially the same thing as some of the aforementioned by-hand merging strategies. So just because you get some huge win locally doesn’t mean that a release build is going to exhibit the same win. (I would argue that the by-hand merging is beneficial in any event, but that is a topic for some other time.)

Finally, you need to decide what is a reasonable amount of space savings for you to try and hunt down. There are likely several thousand places in Gecko where we could win back ten or twenty bytes. But finding them all is liable to be difficult. Changing them to effect the savings may require significant amounts of code gymnastics that not all reviewers would accept. My personal threshold is somewhere around 10K and varies with how difficult the patch is to write. If I can save 5K with a two-line, fairly obvious patch, I’ll do it. If I can save 12K, but the patch is more involved, it goes on the todo list for a rainy day.


17
Jan 14

packing structures for fun and profit

When the DOM bindings code first started generating information for the JavaScript JIT about getters and setters, the generated information was rather sparse. JSJitInfo, the container for such information, looked like this:

struct JSJitInfo {
    JSJitPropertyOp op;
    uint32_t protoID;
    uint32_t depth;
    bool isInfallible;
    bool isConstant;
};

On a 32-bit platform, sizeof(JSJitInfo) was 16. You could shrink some of the fields, and even pack things into bitfields, but there weren’t that many getters and setters, so it probably wasn’t worth it.

Of course, nothing ever gets smaller: the bindings code started generating these structures for methods, the amount of information the bindings code wanted to communicate to the JIT expanded, and some of the parallel JS work started using the structure for its own purposes. And a bunch more interfaces got converted to WebIDL, so many more of these “little” structures were generated. Eventually, we wound up with:

struct JSJitInfo {
    enum OpType { ... };
    enum AliasSet { ... };
    union {
        JSJitGetterOp getter;
        JSJitSetterOp setter;
        JSJitMethodOp method;
    };
    uint32_t protoID;
    uint32_t depth;
    OpType type;
    bool isInfallible;
    bool isMovable;
    AliasSet aliasSet;
    bool isInSlot;
    size_t slotIndex;
    JSValueType returnType;
    const ArgType* const argTypes;
    JSParallelNative parallelNative;
};

This structure has several issues:

  • There’s wasted space after the isMovable and isInSlot fields, since the aliasSet and slotIndex fields need to be aligned on 4-byte boundaries (for 32-bit platforms). There’s even more wasted space on 64-bit platforms.
  • It’s less obvious, but JSValueType is only a single byte, so there’s wasted space after the returnType field so argTypes can be properly aligned.
  • OpType and AliasSet both have only a few values, yet take up four bytes each.
  • The argTypes field is only used for DOM methods, and even then, it isn’t used for very many.
  • The parallelNative field is never used for any of the DOM bindings code.
  • We’re unlikely to have four billion prototypes in the bindings. The web platform is large, but not that large. Similarly, we are unlikely to have an inheritance graph that’s four billion layers deep, or even several thousand layers deep.

This definition weighed in at 44 bytes on a 32-bit system. With 7k+ of these being generated by the bindings code, and more being added every release cycle, now seemed like a worthwhile time to slim these structures down.

This work has gone on in bug 952777, bug 960653, and yet-to-be-landed bug 960109. After all of those bugs land, we’ll have something that looks like:

struct JSJitInfo {
    union {
        JSJitGetterOp getter;
        JSJitSetterOp setter;
        JSJitMethodOp method;
        JSParallelNative parallelNative;
    };
    uint16_t protoID;
    uint16_t depth;
    uint32_t type_ : 4;
    uint32_t aliasSet_ : 4;
    uint32_t returnType_ : 8;
    uint32_t isInfallible : 1;
    uint32_t isMovable : 1;
    uint32_t isInSlot : 1;
    uint32_t isTypedMethod : 1;
    uint32_t slotIndex : 12;
};

Compared to our earlier version, we’ve addressed every complaint:

  • No internally wasted space due to field alignment, on 32-bit or 64-bit platforms.
  • Enum fields don’t take up a full four bytes of space anymore; they take much closer to the minimum amount of space needed.
  • DOM methods with type information available now have a separate subclass, JSTypedMethodJitInfo, so the argTypes field is only present when needed.
  • The parallelNative field has been moved into the union, so we’re not wasting that field anymore.
  • The protoID and depth fields are now a more reasonable size.

It’s worth noting that several of these fields could be smaller, but there’s no reason for them to be currently, given that shrinking them wouldn’t make the overall structure smaller.

The final size of the structure is 12 bytes on 32-bit platforms, and 16 bytes on 64-bit platforms. With 7k+ JSJitInfo structures, that means we’ve saved ~220K of space in a 32-bit libxul.  For a 32-bit libxul, this is almost 1% of the entire size of libxul, which is an extremely pleasant win for size optimizations.  Smaller downloads, less space on your device, and less memory at runtime!

If you’re interested in examining how much space particular structures take up, you can use pahole for Linux or struct_layout for Mac. I’m not aware of any similar program for Windows, though I imagine such a beast does exist. These programs work by examining the debugging information generated by the compiler and displaying the structure layout, along with any “holes” in the structure. pahole will also tell you about cacheline boundaries, so that you can pack frequently-accessed members in the same cacheline to minimize cache misses.

Given the above (and that there have been other wins, somewhat less spectacular in nature), I’ve wondered if it isn’t worth adding some sort of annotations, either in-source or off to the side, about how big we expect certain structures to be or how much internal waste we expect them to have. Then people modifying those structures–and the reviewers as well–would be aware that they were modifying something space-critical, and take appropriate steps to minimize the space increase or even trim things down a bit.


09
Jan 14

on old releases

Gregory Szorc recently wrote a laundry list of reasons for ditching support for old Python releases.  I think this list of reasons to upgrade misses the larger point in providing software for other people: You do not get to tell your users what to do.

Maybe those users don’t have sufficient control over their working environments to install a new version of Python.  (Webhosts and university computers are the two examples that spring to mind immediately.  Enterprise environments have similar constraints.)  Maybe those users rely on certain APIs only available in older versions of Python and don’t wish to take an indeterminate amount of time to rewrite (retest, recertify, etc. etc.) their software.  Maybe those users rely on certain APIs that were changed to operate differently in newer releases and don’t want to engage in an extensive audit of their codebase to fix those incompatibilities.  Maybe those users are using other packages that are incompatible with later Python releases and cannot upgrade.  Maybe those users are just rationally lazy and don’t want to deal with downloading, configuring, and installing a new version of Python, plus dealing with inevitable fallout, when the old version has worked Just Fine for everything else.  The list goes on and on.  (Of course, these reasons are not applicable to just Python; feel free to substitute your favorite language X or favorite extensible program X or what have you.)

Microsoft is the best example I can think of for backwards compatibility.  New Windows releases have gone to significant lengths to make it possible to run applications for older versions of Windows, whatever faults those applications may have.  Raymond Chen’s blog documents a number of extraordinary things Windows does under the hood to make outright buggy and/or undocumented-internals-groveling programs that worked under previous versions of Windows still work under newer ones.  You can, of course, argue that this has taken significant engineering effort that could have been put to use doing “better” things.  But Microsoft’s evaluation of “better” clearly includes “how much pain will this inflict on our customers?”

And that’s the point: you are trading off some perceived (and make no mistake, it is perceived) benefit to yourself as a developer of software X against the agony of some unknown number of users as your changes break their world.  Maybe you’ve decided that this agony is small enough: I’ve seen some great examples of this from the DOM/Content folks as they install Telemetry probes to measure how many users might be impacted by backwards-incompatible changes.  Maybe you’ve decided that you have a small enough community of users and they are all enthusiastic enough to adopt whatever you decided to do, even if it breaks things.  (You might be surprised at just how “small” your community is, though.)  Maybe you’ve decided that ditching the old stuff really is necessary and appropriate (hi Australis!).  Maybe you’ve decided that you simply don’t care about the agony of your users, or that the sharp spike in curses uttered against your household don’t matter.  (If you take this last approach, please don’t write any software that I use.)

I realize that using the new shiny stuff in Python, or C++, or whatever generally makes life nicer as a developer.  But I think developers tend (myself included) to systematically underestimate the amount of agony that user-facing changes cause.  Even when we know we are prone to doing so.