I got 99 problems, but they’re all due to template over-instantiation

TL;DR: Small C++ code change with templates has large impact (2% libxul codesize reduction).

nsTArray has an inheritance structure that looks like this:

template<class E>
class nsTArray : public nsTArray_Impl<E, nsTArrayInfallibleAllocator>
{ ... };

template<class E, class Alloc>
class nsTArray_Impl : public nsTArray_base<Alloc, nsTArray_CopyElements<E> >,
                      public nsTArray_TypedBase<E, nsTArray_Impl<E, Alloc> >
{ ... };

// Most classes are copied with memmove and friends.
// nsTArray_CopyElements can be specialized, but we will ignore those cases here.
template<class E>
struct nsTArray_CopyElements : public nsTArray_CopyWithMemutils {};

…and so forth. The separation into classes and helper classes are to commonify code, when possible, and to let the magic of template specialization select appropriate definitions of things at compile time.

The problem is that this worked a little too well. nsTArray_CopyElements<uint32_t> is a different class from nsTArray_CopyElements<int32_t>, even though both of them share the same base class and neither one adds extra functionality. This means that nsTArray_base must be instantiated separately for each element type, even though the behavior of nsTArray_base is completely independent of the element type.  And so methods were being unnecessarily duplicated, which impacted compile times, download size, startup and runtime performance, and so on.

[A sufficiently smart toolchain could make this problem go away: the linker can recognize duplicated methods and functions at the assembly level, discard all but one instance, and then redirect all the calls to the lone instance. (Bonus question: why does it have to be done by the linker, and not the compiler?  It's certainly more effective, but there is a correctness issue as well.) MSVC calls this "identical COMDAT folding" and the gold linker on Linux implements a similar optimization called "identical code folding". Indeed, we enable this optimization in the linker when it's available on our release builds, precisely because it delivers a significant code size improvement. But in a cross-platform project, you can't necessarily rely on the linker to fix up these sorts of issues.]

In our case, however, fixing the problem is straightforward. Instead of creating new classes to describe copying behavior, we’ll use template specialization to pick the appropriate class at compile time (the class that would have been the subclass of nsTArray_CopyElements in the above scheme) and use that class directly. Then we’ll have either nsTArray_base<Alloc, nsTArray_CopyWithMemutils> (the overwhelmingly common case), or some other specialization when array elements need special treatment:

template<class E>
struct nsTArray_CopyChooser {
  typedef nsTArray_CopyWithMemutils Type;
};

// Other specializations of nsTArray_CopyChooser possible...

template<class E, class Alloc>
class nsTArray_Impl : public nsTArray_base<Alloc, typename nsTArray_CopyChooser<E>::Type >,
                      public nsTArray_TypedBase<E, nsTArray_Impl<E, Alloc> >
{ ... };

Implementing this in bug 929494 reduced libxul’s executable code size by over 2% on Android, which is a hefty size win for such a simple change.

6 comments

  1. So was this only a win on old gcc versions (because MSVC/clang are doing this automatically)?

    • Nathan Froyd

      Where did I say MSVC/clang are doing this automatically? MSVC and clang have the same problem with duplicating the methods (unless their inliners are even more aggressive than GCC’s, which I doubt, especially MSVC); MSVC just has a linker to cover up for it.

      • It’s also possible their inliners are doing something to be clever: not unifying the functions, because the functions have to compare different as function pointers, but making them near-unified: put one function’s code at an address A, then put the second function’s code at address A – 1, with a nop at the very start of the function. (This generalizes beyond two functions, of course.) Then you get code size improvements even if the linker’s not quite as smart, and the normal linking mechanisms don’t need any change at all, and you probably get icache improvements due to better locality and all.

        I’m pretty sure I’ve read about some compilers doing this (not MSVC), and that it solves the problems MSVC’s over-zealous ICF may induce (wrt non-compliant function pointer comparison behavior).

        • Nathan Froyd

          That would be sufficiently clever, yes. I know GCC wants to do something similar for virtual function thunks, but there’ve been complications (“you want to point two names into the same function body? hahaha!”).

  2. Arpad Borsos

    Cool!
    You talked about compile time improvements; any measures?
    Also, are there any more low hanging fruit?

  3. Nathan Froyd

    This change was probably also responsible for a 2% reduction in PGO linker memory on Windows. So the change is beneficial even if your linker is sufficiently smart.