22
Oct 13

I got 99 problems, but they’re all due to template over-instantiation

TL;DR: Small C++ code change with templates has large impact (2% libxul codesize reduction).

nsTArray has an inheritance structure that looks like this:

template<class E>
class nsTArray : public nsTArray_Impl<E, nsTArrayInfallibleAllocator>
{ ... };

template<class E, class Alloc>
class nsTArray_Impl : public nsTArray_base<Alloc, nsTArray_CopyElements<E> >,
                      public nsTArray_TypedBase<E, nsTArray_Impl<E, Alloc> >
{ ... };

// Most classes are copied with memmove and friends.
// nsTArray_CopyElements can be specialized, but we will ignore those cases here.
template<class E>
struct nsTArray_CopyElements : public nsTArray_CopyWithMemutils {};

…and so forth. The separation into classes and helper classes are to commonify code, when possible, and to let the magic of template specialization select appropriate definitions of things at compile time.

The problem is that this worked a little too well. nsTArray_CopyElements<uint32_t> is a different class from nsTArray_CopyElements<int32_t>, even though both of them share the same base class and neither one adds extra functionality. This means that nsTArray_base must be instantiated separately for each element type, even though the behavior of nsTArray_base is completely independent of the element type.  And so methods were being unnecessarily duplicated, which impacted compile times, download size, startup and runtime performance, and so on.

[A sufficiently smart toolchain could make this problem go away: the linker can recognize duplicated methods and functions at the assembly level, discard all but one instance, and then redirect all the calls to the lone instance. (Bonus question: why does it have to be done by the linker, and not the compiler?  It’s certainly more effective, but there is a correctness issue as well.) MSVC calls this “identical COMDAT folding” and the gold linker on Linux implements a similar optimization called “identical code folding”. Indeed, we enable this optimization in the linker when it’s available on our release builds, precisely because it delivers a significant code size improvement. But in a cross-platform project, you can’t necessarily rely on the linker to fix up these sorts of issues.]

In our case, however, fixing the problem is straightforward. Instead of creating new classes to describe copying behavior, we’ll use template specialization to pick the appropriate class at compile time (the class that would have been the subclass of nsTArray_CopyElements in the above scheme) and use that class directly. Then we’ll have either nsTArray_base<Alloc, nsTArray_CopyWithMemutils> (the overwhelmingly common case), or some other specialization when array elements need special treatment:

template<class E>
struct nsTArray_CopyChooser {
  typedef nsTArray_CopyWithMemutils Type;
};

// Other specializations of nsTArray_CopyChooser possible...

template<class E, class Alloc>
class nsTArray_Impl : public nsTArray_base<Alloc, typename nsTArray_CopyChooser<E>::Type >,
                      public nsTArray_TypedBase<E, nsTArray_Impl<E, Alloc> >
{ ... };

Implementing this in bug 929494 reduced libxul’s executable code size by over 2% on Android, which is a hefty size win for such a simple change.


05
Oct 13

faster c++ builds by building bigger groups of code

There have been a lot of spectacular build changes going into the tree lately; my personal builds have gotten about 20% faster, which is no mean feat.  One smaller change that I’ve implemented in the last couple weeks is compiling the DOM bindings and the IPDL IPC code in what we’re been calling “unity” mode.

The idea behind unity mode is to compile a C++ file that #includes your actual C++ source files.  What’s the win from this?

  • Fewer compiler processes launched.  This is a good thing on Windows, where processes are expensive; it’s even a good thing on platforms where process creation is faster.
  • Less disk I/O.  The best case is if the original C++ source files include a lot of the same files.  Compiling the single C++ file then includes those headers only once, rather than once per original C++ source file.
  • Smaller debug information.  On Linux, at least, every Gecko object file compiled with debug information is going to include information about basic types like uint32_t, FILE, and so forth.  Compiling several files together means that you cut down on multiple definitions of things in the debug information, which is good.
  • Better optimization.  The compiler is seeing more source code at once, which means it has more information to make decisions about things like inlining.  This often leads to things not getting inlined (perhaps because the compiler can see that a function is called several times across several different files rather than one time in each of several source files).

It’s a little like link-time code optimization, except that your compiler doesn’t need to support LTO.  SQLite, in-tree and otherwise, already provides an option to compile everything as one big source file and claims ~5% speedup on benchmarks.

The concrete wins are that the DOM bindings compile roughly 5x faster, the IPC IPDL bindings compile roughly 2x faster, libxul got 100k+ smaller on Android, and that the Windows PGO memory required went down by over 4%.  (The PGO memory decrease was just from building DOM bindings in unity mode; the effect from the IPC code appears to have been negligible.)  The downside is that incremental builds when WebIDL or IPDL files are modified get slightly slower.  We tried to minimize this effect by compiling files in groups of 16, which appeared to provide the best balance between full builds and incremental builds.

The code is in moz.build and it’s not specific to the DOM bindings or IPC IPDL code; it will work on any collection of C++ source files, modulo issues with headers being included in unexpected places.  The wins are probably highest on generated code, but I’d certainly be interested in hearing what happens if other bits of the tree are compiled in unity mode.