06
Dec 13

reading binary structures with python

Last week, I wanted to parse some Mach-O files with Python.  “Oh sure,” you think, “just use the struct module and this will be a breeze.”  I have, however, tried to do that:

class MyBinaryBlob:
    def __init__(self, buf, offset):
        self.f1, self.f2 = struct.unpack_from("BB", buf, offset)

and such an approach involves a great deal of copy-and-pasted code.  And if you have some variable-length fields mixed in with fixed-length fields, using struct breaks down very quickly. And if you have to write out the fields to a file, things get even more messy. For this experiment, I wanted to do things in a more declarative style.

The desire was that I could say something like:

class MyBinaryBlob:
    field_names = ["f1", "f2"]
    field_kinds = ["uint8_t", "uint8_t"]

and all the necessary code to parse the appropriate fields out of a binary buffer would spring into existence. (Automagically having the code to write these objects to a buffer would be great, too.) And if a binary object contained something that would be naturally interpreted as a Python list, then I could write a minimal amount of code to do that during initialization of the object as well. I also wanted inheritance to work correctly, so that if I wrote:

class ExtendedBlob(MyBinaryBlob):
    field_names = ["f3", "f4"]
    field_kinds = ["int32_t", "int32_t"]

ExtendedBlob should wind up with four fields once it is initialized.

At first, I wrote things like:

def field_reader(fmt):
    size = struct.calcsize(fmt)
    def reader_sub(buf, offset):
        return struct.unpack_from(fmt, buf, offset)[0], size
    return reader_sub

fi = field_reader("i")
fI = field_reader("I")
fB = field_reader("B")

def initialize_slots(obj, buf, offset, slot_names, field_specs):
    total = 0
    for slot, reader in zip(slot_names, field_specs):
        x, size = reader(buf, offset + total)
        setattr(obj, slot, x)
        total += size

class MyBinaryBlob:
    field_names = ["f1", "f2"]
    field_specs = [fB, fB]

    def __init__(self, buf, offset):
        initialize_slots(self, buf, offset, self.field_names, self.field_specs)

Fields return their size to make it straightforward to add variable-sized fields, not just fixed-width fields that can be parsed by struct.unpack_from. This worked out OK, but I was writing out a lot of copy-and-paste constructors, which was undesirable. Inheritance was also a little weird, since the natural implementation looked like:

class ExtendedBlob(MyBinaryBlob):
    field_names = ["f3", "f4"]
    field_specs = [fi, fi]

    def __init__(self, buf, offset):
        super(ExtendedBlob, self).__init__(buf, offset)
        initialize_slots(self, buf, offset, self.field_names, self.field_specs)

but that second initialize_slots call needs to start reading at the offset resulting from reading MyBinaryBlob‘s fields. I fixed this by storing a _total_size member in the objects and modifying initialize_slots:

def initialize_slots(obj, buf, offset, slot_names, field_specs):
    total = obj._total_size
    for slot, reader in zip(slot_names, field_specs):
        x, size = reader(buf, offset + total)
        setattr(obj, slot, x)
        total += size
    obj._total_size = total

which worked out well enough.

I realized that if I wanted to use this framework for writing binary blobs, I’d need to construct “bare” objects without an existing buffer to read them from. To do this, there had to be some static method on the class for parsing things out of a buffer. @staticmethod couldn’t be used in this case, because the code inside the method didn’t know what class it was being invoked on. But @classmethod, which received the invoking class as its first argument, seemed to fit the bill.

After some more experimentation, I wound up with a base class, BinaryObject:

class BinaryObject(object):
    field_names = []
    field_specs = []

    def __init__(self):
        self._total_size = 0

    def initialize_slots(self, buf, offset, slot_names, field_specs):
        total = self._total_size
        for slot, reader in zip(slot_names, field_specs):
            x, size = reader(buf, offset + total)
            setattr(self, slot, x)
            total += size
        self._total_size = total

    @classmethod
    def from_buf(cls, buf, offset):
        # Determine our inheritance path back to BinaryObject
        inheritance_chain = []
        pos = cls
        while pos != BinaryObject:
            inheritance_chain.append(pos)
            bases = pos.__bases__
            assert len(bases) == 1
            pos = bases[0]
        inheritance_chain.reverse()

        # Determine all the field names and specs that we need to read.
        all_field_names = itertools.chain(*[c.field_names
                                            for c in inheritance_chain])
        all_field_specs = itertools.chain(*[c.field_specs
                                            for c in inheritance_chain])

        # Create the actual object and populate its fields.
        obj = cls()
        obj.initialize_slots(buf, offset, all_field_names, all_field_specs)
        return obj

Inspecting the inheritance hierarchy at runtime makes for some very compact code. (The single-inheritance assertion could probably be relaxed to an assertion that all superclasses except the first do not have field_names or field_specs class members; such a relaxation would make behavior-modifying mixins work well with this scheme.) Now my classes all looked like:

class MyBinaryBlob(BinaryObject):
    field_names = ["f1", "f2"]
    field_specs = [fB, fB]

class ExtendedBlob(MyBinaryBlob):
    field_names = ["f3", "f4"]
    field_specs = [fi, fi]

blob1 = MyBinaryBlob.from_buf(buf, offset)
blob2 = ExtendedBlob.from_buf(buf, offset)

with a pleasing lack of code duplication.  Any code for writing can be written once in the BinaryObject class using a similar inspection of the inheritance chain.

But how does parsing additional things during construction work? Well, subclasses can define their own from_buf methods:

class ExtendedBlobWithList(BinaryObject):
    field_names = ["n_objs"]
    field_specs = [fI]

    @classmethod
    def from_buf(cls, buf, offset):
        obj = BinaryObject.from_buf.__func__(cls, buf, offset)
        # do extra initialization here
        for i in range(obj.n_objs):
            ...
        return obj

The trick here is that calling obj = BinaryObject.from_buf(buf, offset) wouldn’t do the right thing: that would only parse any members that BinaryObject had, and return an object of type BinaryObject instead of one of type ExtendedBlobWithList. Instead, we call BinaryObject.from_buf.__func__, which is the original, undecorated function, and pass the cls with which we were invoked, which is ExtendedBlobWithList, to do basic parsing of the fields. After that’s done, we can do our own specialized parsing, probably with SomeOtherBlob.from_buf or similar. (The _total_size member also comes in handy here, since you know exactly where to start parsing additional members.) You can even define from_buf methods that parse a bit, determine what class they should really be constructing, and construct an object of that type instead:

R_SCATTERED = 0x80000000

class Relocation(BinaryObject):
    field_names = ["_bits1", "_bits2"]
    field_specs = [fI, fI];
    __slots__ = field_names

    @classmethod
    def from_buf(cls, buf, offset):
        obj = BinaryObject.from_buf.__func__(Relocation, buf, offset)

        # OK, now for the decoding of what we just got back.
        if obj._bits1 & R_SCATTERED:
            return ScatteredRelocationInfo.from_buf(buf, offset)
        else:
            return RelocationInfo.from_buf(buf, offset)

This hides any detail about file formats in the parsing code, where it belongs.

Overall, I’m pretty happy with this scheme; it’s a lot more pleasant than bare struct.unpack_from calls scattered about.


21
Nov 13

my git workflow

Mark Hammond recently started an etherpad about how people work with git. Rather than commenting there, I thought I’d blog about my workflow instead.

First piece: magit.  If you use emacs and git, and you don’t use magit, you are missing out.  Highly recommended.  I don’t use the git command line for common operations anymore; I do everything through magit.  magit’s interactive staging is a big improvement over git add -i: you can stage files, hunks, or individual regions selectable by normal Emacs point-and-mark.  I also really like magit’s rebasing support, as I use rebase a lot.

Second piece: git-bz-moz.  I was reluctant to use this at first, but it’s been a huge boon in posting patches directly from my editor.  Setup is pretty straightforward; I have:

[bz]
	firefox-profile = other
	default-tracker = bugzilla.mozilla.org
	default-product = General
	default-component = Core
        default-assigned-to = nfroyd@mozilla.com
	add-url-method = subject-prepend:Bug %d -

in my ~/.gitconfig, and git-bz is smart enough to go grovel through my Firefox profile to get my Bugzilla login information. Having it auto-mark bugs with the appropriate bug numbers during export is also helpful. (It’s smart enough to remove them when adding descriptions for the patches in Bugzilla.) My only complaint is that attaching patches to a bug doesn’t auto-assign the bug to you, like like hg bzexport does.

Third piece: I wrote a script I call export-patches for sending stuff to try, committing to inbound, and exporting patches for uplift.  (I used to use it for formatting patches for bugzilla, but stopped doing that after learning git-bz.)  I can push things to try:

export-patches -h ${mc_repo} -t '-b do -p all -u all -t none' ${start}..${end}

or push things to inbound:

export-patches -h ${mi_repo} -r ehsan -b 1 -c ${start}..${end}

It supports per-patch reviewers, too (along with a couple of other things I won’t demonstrate here):

export-patches -h ${mi_repo} -r bz:glandium -b 1 -c ${start}..${end}

The -b 1 convention is leftover from when I wasn’t tagging my patches with bug numbers until commit.  (The script complains if bug numbers aren’t specified on the command line for commits.)  git-bz absolved me of doing that. I should probably fix that.

Third-and-a-half piece: export-patches takes some pains (not as many as it could) to ensure that whatever repo I’m using gets its patch queue wiped if things fail.  Less monkeying around with mercurial commands is a win in my book.

Fourth piece: One big branch of work. I used to use separate branches for bugs. However, I found that I was working on enough things simultaneously that switching between branches, rebasing if necessary, clobbering if necessary (often), and so forth was just too disruptive for day-to-day stuff. I’ll use branches if I have really disruptive things that I can’t integrate piecemeal into my one big branch, but generally everything goes into one branch. I ensure things build locally and I make occasional efforts to ensure appropriate tests still work locally, but try is where most of my testing gets done nowadays.

Fourth-and-a-half piece: I never checkout master.  I always fetch origin, and then rebase off of origin/master.  My branches all track origin/master, so magit will tell me exactly what commits I have remaining to go upstream.

Annoyances: If I commit patches, then those patches get backed out, when I next pull from mozilla-central and rebase, the patches that I pushed disappear from my branch. I haven’t looked too deeply into why this happens, but I’d really like to fix that.


05
Nov 13

ipdl syntax changes for types coming from C++

Over the weekend, I landed bug 918651, which changes the syntax for how you inform the IPDL compiler about types defined in C++.  Previously, you did:

include "mozilla/HeaderFile.h";
...
using typeFromHeaderFile;
...

The using declaration informs the IPDL compiler that typeFromHeaderFile may appear in places types can normally appear.  The include directive is so the generated headers know what to #include for the C++ compiler to be informed about typeFromHeaderFile.

This scheme has a couple of drawbacks:

  • The header files from the include directives aren’t connected to the using declarations in any way.  Those headers might only include the relevant type(s) incidentally, which doesn’t help in unraveling Gecko’s include dependencies.
  • The generated IPDL headers don’t necessarily need the full definition of typeFromHeaderFile.  For structs or classes, the generated headers can get by with a simple forward declaration.  The full definition is only needed in the generated source files.  The above syntax, however, doesn’t enable any sort of forward declaration magic.

To address both of these issues, the syntax for using declarations was changed.  For structs, you should say:

using struct structFromHeaderFile from "mozilla/HeaderFile.h"

The syntax for classes is similar:

using class classFromHeaderFile from "mozilla/HeaderFile.h"

In these cases, the IPDL compiler will forward-declare the types where appropriate and only #include the header in the generated source files.  Additionally, the compiler is intelligent enough to #include the header in the generated headers if it is required. For instance, if there is a struct or a union defined in the header file that requires a struct or a class from a using declaration, the relevant header will be included in the generated header instead of the generated source file.

Finally, if you need an enum type or a typedef, you should say:

using typeFromHeaderFile from "mozilla/HeaderFile.h"

This case functions similarly to what we had before, except that the header file is now closely associated with the type; ideally, that will encourage people to use the correct header (i.e. the one that defines the type).  While you are able to use this syntax with struct or class types, you should use the using struct or using class syntax, as appropriate, so that forward declarations are generated.

There are still a few instances of include directives for C++ headers in IPDL files; those should be considered a bug, and the include directive for C++ headers should not normally be needed going forward.

This change didn’t completely address the original issue of the bug (touching headers in gfx/ causes source files in netwerk/ to rebuild), but it moved us a lot closer to fixing those sorts of issues.


05
Nov 13

the performance implications of strncpy

Last week, I was working on making Firefox compile for a OS X target on a Linux host.  As part of this effort, I ported Apple’s opensourced ld64 linker to compile and run on a Linux host.  Since OS X is a BSD-derived operating system, ld64 made use of the strlcpy and strlcat functions, designed to be safer than the strcpy/strncpy/strcat/strncat functions.  Linux’s libc doesn’t implement strlcpy and strlcat, so I had to find replacements.  strlcpy seemed easy enough, as a presentation on maloader suggested:

size_t strlcpy(char* dst, const char* src, size_t size)
{
    dst[size - 1] = '\0';
    strncpy(dst, src, size - 1);
    return strlen(dst);
}

I cribbed strlcat from someplace else and went about my merry way.

After I got ld64 to compile, then link simple programs, and worked my way through some configure bugs for non-Android cross-compiles, I ran into a problem: the JavaScript shell was taking 8 minutes to link.  This was unacceptable; it meant linking libxul was going to take over an hour, maybe over two, to link, which nobody would be happy about.  The equivalent link of the JavaScript shell on my Mac mini took about two seconds.

I started investigating what was going on with perf, just checking into what ld64 was doing for parts of those 8 minutes.  99%+ of the time was being spent inside strncpy.  Hm, that’s strange.

I fiddled around with a couple different things, none of which had much impact.  Then I took a close look at the code calling strlcpy (yes, all the time in strlcpy was through this function, which should have been a red flag in the first place):

int32_t StringPoolAtom::add(const char* str)
{
	int32_t offset = kBufferSize * _fullBuffers.size() + _currentBufferUsed;
	int lenNeeded = strlcpy(&_currentBuffer[_currentBufferUsed], str, kBufferSize-_currentBufferUsed)+1;
	if ( (_currentBufferUsed+lenNeeded) < kBufferSize ) {
 		_currentBufferUsed += lenNeeded;
 	}
 	else {
 		int copied = kBufferSize-_currentBufferUsed-1;
 		// change trailing '\0' that strlcpy added to real char
 		_currentBuffer[kBufferSize-1] = str[copied];
 		// alloc next buffer
 		_fullBuffers.push_back(_currentBuffer);
 		_currentBuffer = new char[kBufferSize];
 		_currentBufferUsed = 0;
 		// append rest of string
 		this->add(&str[copied+1]);
	}
	return offset;
}

In this code, kBufferSize is 16MB, so the size parameter passed to strlcpy can be rather large compared to the size of the string being copied to the destination.

I forget exactly where I read it, but I saw some small blurb about glibc’s strncpy having the crazy behavior of null padding the destination string, rather than just appending a null terminator. If strlcpy was implemented by calling out to strncpy, then just that function above would be writing hundreds or even thousands of megabytes of zeros more than required. That would definitely slow things down!

(I later discovered that this “crazy behavior” is documented in the strncpy man page and is actually required by standards.  Indeed, the original strlcpy paper cites this problem of strncpy as a motivating factor for strlcpy.  It is the only way the performance figures they give in the paper are actually relevant to their point.  But silly me, I just assumed I knew how strncpy works rather than actually reading documentation. I am really curious how this behavior of strncpy came to be and why folks thought it was worth preserving.)

Once I fixed the strlcpy implementation to do things properly, cross link times became comparable to native link times.  And then I could think about linking libxul in a reasonable amount of time. (And I did link libxul in a reasonable amount of time, if you read through the bug. And it even runs on a Mac!)

Lesson learned: don’t use strncpy!


22
Oct 13

I got 99 problems, but they’re all due to template over-instantiation

TL;DR: Small C++ code change with templates has large impact (2% libxul codesize reduction).

nsTArray has an inheritance structure that looks like this:

template<class E>
class nsTArray : public nsTArray_Impl<E, nsTArrayInfallibleAllocator>
{ ... };

template<class E, class Alloc>
class nsTArray_Impl : public nsTArray_base<Alloc, nsTArray_CopyElements<E> >,
                      public nsTArray_TypedBase<E, nsTArray_Impl<E, Alloc> >
{ ... };

// Most classes are copied with memmove and friends.
// nsTArray_CopyElements can be specialized, but we will ignore those cases here.
template<class E>
struct nsTArray_CopyElements : public nsTArray_CopyWithMemutils {};

…and so forth. The separation into classes and helper classes are to commonify code, when possible, and to let the magic of template specialization select appropriate definitions of things at compile time.

The problem is that this worked a little too well. nsTArray_CopyElements<uint32_t> is a different class from nsTArray_CopyElements<int32_t>, even though both of them share the same base class and neither one adds extra functionality. This means that nsTArray_base must be instantiated separately for each element type, even though the behavior of nsTArray_base is completely independent of the element type.  And so methods were being unnecessarily duplicated, which impacted compile times, download size, startup and runtime performance, and so on.

[A sufficiently smart toolchain could make this problem go away: the linker can recognize duplicated methods and functions at the assembly level, discard all but one instance, and then redirect all the calls to the lone instance. (Bonus question: why does it have to be done by the linker, and not the compiler?  It’s certainly more effective, but there is a correctness issue as well.) MSVC calls this “identical COMDAT folding” and the gold linker on Linux implements a similar optimization called “identical code folding”. Indeed, we enable this optimization in the linker when it’s available on our release builds, precisely because it delivers a significant code size improvement. But in a cross-platform project, you can’t necessarily rely on the linker to fix up these sorts of issues.]

In our case, however, fixing the problem is straightforward. Instead of creating new classes to describe copying behavior, we’ll use template specialization to pick the appropriate class at compile time (the class that would have been the subclass of nsTArray_CopyElements in the above scheme) and use that class directly. Then we’ll have either nsTArray_base<Alloc, nsTArray_CopyWithMemutils> (the overwhelmingly common case), or some other specialization when array elements need special treatment:

template<class E>
struct nsTArray_CopyChooser {
  typedef nsTArray_CopyWithMemutils Type;
};

// Other specializations of nsTArray_CopyChooser possible...

template<class E, class Alloc>
class nsTArray_Impl : public nsTArray_base<Alloc, typename nsTArray_CopyChooser<E>::Type >,
                      public nsTArray_TypedBase<E, nsTArray_Impl<E, Alloc> >
{ ... };

Implementing this in bug 929494 reduced libxul’s executable code size by over 2% on Android, which is a hefty size win for such a simple change.


05
Oct 13

faster c++ builds by building bigger groups of code

There have been a lot of spectacular build changes going into the tree lately; my personal builds have gotten about 20% faster, which is no mean feat.  One smaller change that I’ve implemented in the last couple weeks is compiling the DOM bindings and the IPDL IPC code in what we’re been calling “unity” mode.

The idea behind unity mode is to compile a C++ file that #includes your actual C++ source files.  What’s the win from this?

  • Fewer compiler processes launched.  This is a good thing on Windows, where processes are expensive; it’s even a good thing on platforms where process creation is faster.
  • Less disk I/O.  The best case is if the original C++ source files include a lot of the same files.  Compiling the single C++ file then includes those headers only once, rather than once per original C++ source file.
  • Smaller debug information.  On Linux, at least, every Gecko object file compiled with debug information is going to include information about basic types like uint32_t, FILE, and so forth.  Compiling several files together means that you cut down on multiple definitions of things in the debug information, which is good.
  • Better optimization.  The compiler is seeing more source code at once, which means it has more information to make decisions about things like inlining.  This often leads to things not getting inlined (perhaps because the compiler can see that a function is called several times across several different files rather than one time in each of several source files).

It’s a little like link-time code optimization, except that your compiler doesn’t need to support LTO.  SQLite, in-tree and otherwise, already provides an option to compile everything as one big source file and claims ~5% speedup on benchmarks.

The concrete wins are that the DOM bindings compile roughly 5x faster, the IPC IPDL bindings compile roughly 2x faster, libxul got 100k+ smaller on Android, and that the Windows PGO memory required went down by over 4%.  (The PGO memory decrease was just from building DOM bindings in unity mode; the effect from the IPC code appears to have been negligible.)  The downside is that incremental builds when WebIDL or IPDL files are modified get slightly slower.  We tried to minimize this effect by compiling files in groups of 16, which appeared to provide the best balance between full builds and incremental builds.

The code is in moz.build and it’s not specific to the DOM bindings or IPC IPDL code; it will work on any collection of C++ source files, modulo issues with headers being included in unexpected places.  The wins are probably highest on generated code, but I’d certainly be interested in hearing what happens if other bits of the tree are compiled in unity mode.


16
Aug 13

recent android startup performance work

We’re still whittling away at the startup time on Android.  Shilpan Bhagat took a crack at speeding up parsing profiles.ini.  Brian Nicholson debugged a crash-tastic patch to initialize the C++ side of Firefox for Android sooner, and seems to have fixed the crashiness that plagued such approaches before.  Starting the initialization of Gecko itself sooner is important because that can happen in parallel with the work happening in Java on multi-core devices.  And chatting with Sriram Ramasubramanian yesterday about some menu initialization that was showing up on profiles, he assured me that startup time was definitely being taken into account when adding new features, which is great!

Before I get to the other work that’s been going on, I need to talk about how cross-language communication works in Firefox for Android.

Firefox for Android uses three primary languages: C++ for Gecko, Java for the Android UI and related bits, and JavaScript for all sorts of things.  Java and C++ play nicely via JNI, and Javascript and C++ talk to each other via XPCOM.  But what happens when Java needs to talk to JS or vice versa?  The solution that’s been chosen here is that Java constructs and serializes a JSON message, sends that to C++, which then notifies observers that a particular kind of message has been received.  Registered JavaScript observers (you could also do this in C++, of course, but it’s just easier to do it in JavaScript) are then invoked, parse the JSON, and do their thing.  And of course there’s a reverse path for sending things from JavaScript to Java.

With that being said, you might not be surprised that Chris Kitching identified that JSON parsing and serialization are slowing down startup. We spend nearly 200ms doing JSON operations on a Galaxy Nexus–just in Java, never mind JavaScript (though the JavaScript JSON bits are quite fast because they’re backed by C++ code).  Part of this is because the interface that Android uses for JSON is not well-designed and part of this is because the implementation could be improved (though Android does not use the json.org implementation, both could be improved).  Part of this is also because we do things like JSON-wrap single integers when there are much more efficient ways to communicate such values.  I’ve been fixing up some small hotspots (warmspots?), starting with sending telemetry information and Java’s requests for preferences information.

For some things, however, we do have to deal with JSON.  Firefox for Android twiddles with the sessionstore information before forwarding it to JavaScript, and sending the sessionstore information in any other format than JSON would be…tedious.  The twiddling itself is not particularly expensive, but the reading/parsing/serialization required around it is.  For a smallish session of six tabs, where we only have ~2KB of sessionstore JSON to deal with, reading/parsing/serializing the data takes almost 250ms on a Galaxy Nexus.  This is expensive enough that techniques like parsing it on a background thread might be worthwhile.  It’s not entirely clear whether it helps (different profiling tools say different things, single-core devices may not benefit from it, etc.), though.


15
Aug 13

better build times through configury

There’s been a flurry of activity lately on making Firefox build faster.  One thing that’s not talked about much is tailoring what features how build or how you build can make things go faster.  Below are a couple configure options that may make your life better, depending on what platform you develop for and what your focus is.  None of these are going to help as much as glandium’s xulrunner SDK development trickery, but for those who can’t do that, for whatever reason, these may be helpful.  Add them to your mozconfig with ac_add_option or your manual configure invocation if you are oldschool.

  • --disable-debug-symbols: This option isn’t appropriate if you plan on doing any C/C++ debugging. Really. But if you work more-or-less exclusively in JavaScript (or Java, for Android), this option can make your builds go much faster by making your object files smaller and link times faster. My local Android builds shave 30 minutes of total runtime (!) by enabling this. The actual win is somewhere on the order of 3 minutes due to the wonders of parallel make. Still, 15% improvements are nothing to sneeze at.
  • --disable-crashreporter: This configure option builds less code, which is always a good thing.  Needing the crashreporter would be pretty unusual for local development, so this is probably pretty safe.
  • --disable-webrtc: Again, ths option is for not building a (somewhat larger) swath of code.  May not be appropriate if you’re doing Web Audio or trying to measure startup improvements.
  • --disable-icf: This option is only useful if you’re building on Linux and you know you’re using the gold linker.  This option turns off merging of identical functions, which makes the linker run slightly faster.  (May also work on Windows with opt builds, not sure.)

For development on Linux with GCC specifically, there are a few compiler options that may make your life better also:

  • -fno-var-tracking: This option makes the debug information for C/C++ slightly less accurate.  Variable tracking is a relatively new thing (GCC 4.6+) and tends to suck up quite a bit of compile time.  If you were debugging happily with GCC 4.5 and earlier, you may not notice any problems with turning variable tracking off.
  • -gsplit-dwarf: What this does is it separates debug information into a completely different file from the actual compiled code.  This change, in turn, makes the object files smaller, primarily for the benefit of the linker.  You can read more about the motivation for splitting debug information into separate files on the GCC Wiki.  There’s also a bug about adding this by default to local developer builds.  (Recent SVN builds of clang also support this option, I think.)

Finally, if you want to dive really deep, you can play around with the options passed to the gold linker on Linux.  (And if you do, please post some numbers to that bug; we’d love to get more data on how those options work for people!)


12
Aug 13

the language lawyer: the curious constexpr conundrum

Last week, in a push that purported to eliminate a number of static constructors from the tree, I burned the Mac debug build:

In file included from SVGAnimateMotionElementBinding.cpp:13:
In file included from ../../dist/include/mozilla/dom/SVGAnimateMotionElement.h:11:
In file included from ../../../content/svg/content/src/SVGMotionSMILAnimationFunction.h:11:
In file included from ../../dist/include/nsSMILAnimationFunction.h:15:
In file included from ../../dist/include/nsSMILValue.h:10:
../../dist/include/nsSMILNullType.h:47:17: error: constexpr constructor never produces a constant expression [-Winvalid-constexpr]
MOZ_CONSTEXPR nsSMILNullType() {}
^
../../dist/include/nsSMILNullType.h:47:17: note: non-literal type 'nsISMILType' cannot be used in a constant expression

This error message caused some confusion:

18:41 < froydnj> "constexpr constructor never produces a constant expression"? what does *that* mean?
18:42 < philor> sweet, now you know what *every* compiler message sounds like to me

Searching the web for the error was also unhelpful. There was a Qt bug report from a year ago that wasn’t relevant and a smattering of pages discussing just how smart the compiler needs to be to diagnose invalid constant expressions. The closest hit was from a GCC bug on constexpr diagnostics. But even that didn’t quite tell the whole story. So what’s going on here?

I wasn’t sure what a literal type was, so I looked it up. The epsilon-close-to-the-standard draft, N3337, defines the term literal type in [basic.types] paragraph 10 (you’ll usually see this abbreviated to just [basic.types]p10 in discussions or compiler front-end comments; I’ll be abbreviating standard references from here on out):

A type is a literal type if it is:

  • a scalar type; or
  • a reference type referring to a literal type; or
  • an array of literal type; or
  • a class type (Clause 9) that has all of the following properties:
    • it has a trivial destructor,
    • every constructor call and full-expression in the brace-or-equal-initializers for non-static data
      members (if any) is a constant expression (5.19),
    • it is an aggregate type (8.5.1) or has at least one constexpr constructor or constructor template
      that is not a copy or move constructor, and
    • all of its non-static data members and base classes are of literal types.

That bit explains how we have a non-literal type in debug builds: debug builds define a no-op protected destructor. Since that destructor is user-provided and not defaulted, it is non-trivial (in the language of the standard; you can read [class.dtor]p5 for the definition of trivial destructors and [class.ctor]p5 for the definition of trivial constructors). The destructor therefore fails the first property of a literal class type, above. (In debug builds, we define a protected destructor to ensure that nobody is improperly deleting our type. But in opt builds, we don’t define the constructor so that it becomes compiler-defaulted and optimized away properly by intelligent frontends. And that means that any static instances of nsSMILNullType don’t need static constructors.)

OK, but why is having a non-literal type a problem? The compiler ought to simply execute the constructor normally and not worry about the constexpr annotation, right? Except that there is this bit in the definition of constant expressions, [expr.const]p2: “A conditional-expression is a constant-expression unless it involves one of the following as a potentially evaluated subexpression…: …an invocation of a function other than a constexpr constructor for a literal class or a constexpr function”

Aha! Here is our real problem. We are claiming that the constructor of nsSMILNullType is constexpr, but actually executing that constructor in an expression is by definition not a constant expression. And so Clang issues an error, as it should.

Moral of the story: don’t define destructors for classes that you give constexpr constructors to…unless you can default them. Guess we might have to add MOZ_DEFAULT to MFBT after all!


02
Aug 13

I got 99 problems…and compilation time is one of them

Over the past week or so, there have been a few conversations on #developers about the ever-increasing time it takes to compile Firefox.  Compiling Firefox used to be a relatively quick affair (well, on my machine, at least) and I’ve grown dissatisfied recently with builds taking ever-longer.  All of the aforementioned discussions have boiled down to “well, we’re adding a lot of code and all that code takes longer to compile.”

Surely that can’t be the whole story.  A mozilla-central build from a day or two took about 24 minutes on my machine (Linux x86-64 hyperthreaded 8-core @ 2.6GHz, srcdir on SSD, objdir on mechanical hard drive).  We are not compiling 2.4x times more code than we were a year ago…or are we?  Let’s look at some numbers.

I looked at my mozilla-central repository for some convenient tags to pick from and found the FIREFOX_AURORA_${REV}_BASE set of tags.  I started compiling with mozilla-central tip, and then with FIREFOX_AURORA_24_BASE, and resolved to go as far back as I could without undue fiddling with things that wouldn’t compile.  I made it to FIREFOX_AURORA_14_BASE before hitting compilation errors with moz_free.  (19 and previous required small fiddling with js/src to set the right PYTHON, but that was easily scriptable.)  That gives me 11 markers covering over a year of development.

I didn’t try compiling things multiple times to see how much the numbers fluctuated.  I didn’t attempt to control for operating system disk caches or anything like that, just hg up -r $TAG, generated configure, ran configure --enable-optimize --disable-debug --disable-gstreamer (I haven’t bothered to get my gstreamer dependencies right on my development box) in a newly-created objdir, and recorded time for make -srj20.  The compiler was GCC 4.7 from Debian stable.

m-c tag real user sys
15 13m2.025s 83m36.229s 5m51.238s
16 13m40.359s 87m56.302s 6m15.435s
17 12m58.448s 94m21.514s 6m35.121s
18 14m21.257s 112m42.839s 7m36.213s
19 15m8.415s 120m48.789s 8m8.151s
20 16m26.337s 137m34.680s 9m11.878s
21 18m55.455s 165m29.433s 10m45.208s
22 19m27.887s 185m3.242s 11m53.853s
23 21m9.002s 203m46.708s 12m51.740s
24 21m51.242s 206m49.012s 12m55.356s
tip 24m37.489s 219m27.903s 13m47.080s

Those 10-minute clobber build times were apparently the result of warm disk cache or my dreams. Or perhaps they’re due to an older compiler; I only started using GCC 4.7 recently and I’ve done some tests that show it’s a fair amount slower than GCC 4.4. Or maybe I was using a different build config then. Whatever the reason, we can see that compilation times have been increasing steadily, but can we correlate that to anything?

Let’s look at the number of C++, C, WebIDL, and IPDL source files. We include the last two because the conversion to WebIDL has been cited as a reason for increasing compilation times and IPDL because the IPDL compiler generates a number of source files as well (roughly three source files for every IPDL file). The C++ source files are split by extension; .cpp is the usual prefix to use inside Gecko, so separating .cpp and .cc gives some idea of how much outside C++ source code is being compiled. (Just counting source files is subject to issues, since we don’t compile everything in widget/ for every platform, but we’re counting all the source files therein as if they were. This is a rough estimate.)

m-c tag cpp cc c webidl ipdl
15 4060 577 2503 32 206
16 4117 1273 2893 34 213
17 4173 1278 2895 38 221
18 4468 1405 3123 61 227
19 4498 1408 3118 78 228
20 4586 1418 2779 169 228
21 4605 1392 2831 231 228
22 4979 1392 2980 305 228
23 5072 1393 2982 344 231
24 5101 1330 2983 413 234
tip 5211 1427 3029 436 234

First things first: we have a lot of code in the tree.

Now, there’s a couple of interesting things to point out. While there’s a lot of IPDL source files, the increase in IPDL can hardly be held responsible for the increase in compile time. So that’s one suspect out of the way. From 16 to 17, we barely increased the amount of code, but build times went down. This change might point to the recorded times being subject to quite a bit of variability.

We added a lot of WebIDL files from 20 to 21 while holding everything else constant and build times went up quite a bit! But we added roughly the same number of WebIDL files from 21 to 22 along with a lot more code and while build times went up, they didn’t go up by the same amount: 28 minutes of user time going from 20 to 21 and ~20 minutes of user time going from 21 to 22. There are a number of reasons why this could be so: we might not be compiling all those 500 C/C++ source files we added in 22 (ICU landed off by default in 22, so that’s ~400 files right there), the WebIDL files added in 21 might have had more of an impact, compilation-wise, the timings might fluctuate quite a bit, and so on and so forth. And notice that we added the same number of WebIDL files again from 23 to 24 and build times hardly changed. Sure, WebIDL means more files to compile. But I don’t think they are adding a disproportionate amount of time.

And finally, source code file counts. We had about 7800 source code files to compile back at Aurora 15 (remember, IPDL counts as three). We have about 10600 source files to compile now on trunk. That’s about a 40% increase that corresponds to roughly a doubling in time to compile. And yes, there’s also a lot of JS and other things that have been added that increase our time to compile. But I don’t think the amount of source code is the whole story.