packing structures for fun and profit

When the DOM bindings code first started generating information for the JavaScript JIT about getters and setters, the generated information was rather sparse. JSJitInfo, the container for such information, looked like this:

struct JSJitInfo {
    JSJitPropertyOp op;
    uint32_t protoID;
    uint32_t depth;
    bool isInfallible;
    bool isConstant;
};

On a 32-bit platform, sizeof(JSJitInfo) was 16. You could shrink some of the fields, and even pack things into bitfields, but there weren’t that many getters and setters, so it probably wasn’t worth it.

Of course, nothing ever gets smaller: the bindings code started generating these structures for methods, the amount of information the bindings code wanted to communicate to the JIT expanded, and some of the parallel JS work started using the structure for its own purposes. And a bunch more interfaces got converted to WebIDL, so many more of these “little” structures were generated. Eventually, we wound up with:

struct JSJitInfo {
    enum OpType { ... };
    enum AliasSet { ... };
    union {
        JSJitGetterOp getter;
        JSJitSetterOp setter;
        JSJitMethodOp method;
    };
    uint32_t protoID;
    uint32_t depth;
    OpType type;
    bool isInfallible;
    bool isMovable;
    AliasSet aliasSet;
    bool isInSlot;
    size_t slotIndex;
    JSValueType returnType;
    const ArgType* const argTypes;
    JSParallelNative parallelNative;
};

This structure has several issues:

  • There’s wasted space after the isMovable and isInSlot fields, since the aliasSet and slotIndex fields need to be aligned on 4-byte boundaries (for 32-bit platforms). There’s even more wasted space on 64-bit platforms.
  • It’s less obvious, but JSValueType is only a single byte, so there’s wasted space after the returnType field so argTypes can be properly aligned.
  • OpType and AliasSet both have only a few values, yet take up four bytes each.
  • The argTypes field is only used for DOM methods, and even then, it isn’t used for very many.
  • The parallelNative field is never used for any of the DOM bindings code.
  • We’re unlikely to have four billion prototypes in the bindings. The web platform is large, but not that large. Similarly, we are unlikely to have an inheritance graph that’s four billion layers deep, or even several thousand layers deep.

This definition weighed in at 44 bytes on a 32-bit system. With 7k+ of these being generated by the bindings code, and more being added every release cycle, now seemed like a worthwhile time to slim these structures down.

This work has gone on in bug 952777, bug 960653, and yet-to-be-landed bug 960109. After all of those bugs land, we’ll have something that looks like:

struct JSJitInfo {
    union {
        JSJitGetterOp getter;
        JSJitSetterOp setter;
        JSJitMethodOp method;
        JSParallelNative parallelNative;
    };
    uint16_t protoID;
    uint16_t depth;
    uint32_t type_ : 4;
    uint32_t aliasSet_ : 4;
    uint32_t returnType_ : 8;
    uint32_t isInfallible : 1;
    uint32_t isMovable : 1;
    uint32_t isInSlot : 1;
    uint32_t isTypedMethod : 1;
    uint32_t slotIndex : 12;
};

Compared to our earlier version, we’ve addressed every complaint:

  • No internally wasted space due to field alignment, on 32-bit or 64-bit platforms.
  • Enum fields don’t take up a full four bytes of space anymore; they take much closer to the minimum amount of space needed.
  • DOM methods with type information available now have a separate subclass, JSTypedMethodJitInfo, so the argTypes field is only present when needed.
  • The parallelNative field has been moved into the union, so we’re not wasting that field anymore.
  • The protoID and depth fields are now a more reasonable size.

It’s worth noting that several of these fields could be smaller, but there’s no reason for them to be currently, given that shrinking them wouldn’t make the overall structure smaller.

The final size of the structure is 12 bytes on 32-bit platforms, and 16 bytes on 64-bit platforms. With 7k+ JSJitInfo structures, that means we’ve saved ~220K of space in a 32-bit libxul.ย  For a 32-bit libxul, this is almost 1% of the entire size of libxul, which is an extremely pleasant win for size optimizations.ย  Smaller downloads, less space on your device, and less memory at runtime!

If you’re interested in examining how much space particular structures take up, you can use pahole for Linux or struct_layout for Mac. I’m not aware of any similar program for Windows, though I imagine such a beast does exist. These programs work by examining the debugging information generated by the compiler and displaying the structure layout, along with any “holes” in the structure. pahole will also tell you about cacheline boundaries, so that you can pack frequently-accessed members in the same cacheline to minimize cache misses.

Given the above (and that there have been other wins, somewhat less spectacular in nature), I’ve wondered if it isn’t worth adding some sort of annotations, either in-source or off to the side, about how big we expect certain structures to be or how much internal waste we expect them to have. Then people modifying those structures–and the reviewers as well–would be aware that they were modifying something space-critical, and take appropriate steps to minimize the space increase or even trim things down a bit.

15 comments

  1. Arpad Borsos

    Wow, this is very nice to read!
    Back in the days I had a dehyra analysis for this since I had some issues with pahole back then. Is it working reliably now?

    • Nathan Froyd

      pahole seems to work fine for me. I did have to write two small patches to make it compile on Debian and support debug information from newer GCCs, which I’ve sent to the maintainer.

  2. Benoit Girard

    Is there a process for finding object that are worth repacking. I just learned about this a few weeks ago from Hacker News and I’ve been trying to be more cautious. I’d like to know if there’s a good way to identify structure in GFX or Gecko where we can get another measurable win from restructuring.

    • Nathan Froyd

      pahole supposedly has options that will only print structures with holes or with holes greater than some amount. Neither one seems to work well: the former doesn’t (e.g. it says nsRefPtr has holes!) and the latter only prints out the names of structures, not their actual data layout (and also misidentifies e.g. nsRefPtr). You might be able to filter the output with some clever shell scripting.

      I am less familiar with struct_layout, but it looks like it’s pretty limited in that respect. Hacking python might be easier than hacking pahole’s C.

      I don’t know of any good process here. I’ve found stuff in the past by just grepping through pahole output for holes.

  3. “Iโ€™ve wondered if it isnโ€™t worth adding some sort of annotations, either in-source or off to the side, about how big we expect certain structures to be or how much internal waste we expect them to have.”

    static_assert(sizeof(Foo) == N, “This is a common struct. Try not to make it bigger.”);

    • You could do that, but that can quickly get irritating depending on how much stuff is in the structure and what datatypes you’re using. I guess if your structure is really size critical, you probably don’t have all that much in it in the first place…

      • Nicholas Nethercote

        I’m not sure we’re on the same wavelength here… one static_assert (or maybe two: one for 32-bit and one for 64-bit) for each size-critical struct doesn’t seem irritating.

  4. Nicholas Nethercote

    I tried running pahole on Firefox on my Linux box and got this:

    die__process_unit: DW_TAG_rvalue_reference_type (0x42) @ not handled!

    ๐Ÿ™

    • I’m guessing this is for GCC 4.8? I’ll poke at it this week, shouldn’t be hard to fix.

  5. Arpad Borsos

    Is this responsible for the ~3M win on AWSY? If so, thats wonderful. ๐Ÿ™‚

    • Nathan Froyd

      Whoa, that’d be nice if it was! But looking at the AWSY graphs, it looks like the wins are post-page load? (I’m not sure which line on which graph you’re looking at, so just guessing.) So that would be dynamic memory allocated during a run. Somebody else might be fixing some bugs too. ๐Ÿ™‚

      • Arpad Borsos

        Well yes, there is a lot of noise in AWSY, and i did refer to the post pageload misc/js measurements. Other graphs improved by even more, but its hard to tell what exactly caused the win there.

  6. Arpad Borsos

    Oh, I just found out that clang has a warning `-Wpadded` for this. Might be interesting to run the build with that. But as far as I read it, it might be a bit too verbose at times.