Feb 13

gcc version comparison, part 1.5/n: corrections

In my previous post, I discussed the size of libxul as compiled by various versions of GCC.  Due to some configuration quirks, it turns out that the comparison was flawed.

To recap: GCC versions 4.5-4.7 contained, among other things, vtables and their associated relocations for classes that were never instantiated.  I theorized that some compiler optimization must have been responsible for this, and that this compiler optimization must have gotten disabled in those versions.  Thinking about it afterwards, it turned out that there was a simple way to check this theory: examine the object files for the vtables.  Some objects compiled by versions 4.5-4.7 must have the vtables, and no objects from versions 4.4 and 4.8 should contain the vtables.  So let’s check, using nsIDOMSVGTextElement as an example:

[froydnj@cerebro froydnj]$ for d in 4 5 6 7 8; do
  for o in $(find build-mozilla-gcc-4${d}/ -name '*.o'); do
    if readelf -sW $o |c++filt| grep -q 'vtable for nsIDOMSVGTextElement' 2>/dev/null; then
      echo $o; readelf -sW $o |c++filt|grep 'vtable for nsIDOMSVGTextElement'
   971: 0000000000000000   856 OBJECT  WEAK   HIDDEN  450 vtable for nsIDOMSVGTextElement
  1241: 0000000000000000   856 OBJECT  WEAK   HIDDEN  676 vtable for nsIDOMSVGTextElement
  1021: 0000000000000000   856 OBJECT  WEAK   HIDDEN  498 vtable for nsIDOMSVGTextElement
  1075: 0000000000000000   856 OBJECT  WEAK   HIDDEN  533 vtable for nsIDOMSVGTextElement
   831: 0000000000000000   856 OBJECT  WEAK   HIDDEN  532 vtable for nsIDOMSVGTextElement

So all versions of the compiler are generating the vtables that are sometimes present and sometimes not in the compiled libxul.  Why do the vtables sometimes disappear?

The linker on Linux systems has a --gc-sections option that eliminates unused sections from the final output file, using a form of mark and sweep garbage collection.  Normally, this is not terribly effective, since all of your program code goes into .text (resp. data into .data and so forth), and something in .text ought to be getting used.  But Mozilla is compiled with the options -ffunction-sections and -fdata-sections; -ffunction-sections gives each function its own uniquely named section and -fdata-sections does a similar thing for variables.  Using --gc-sections with the linker, then, effectively eliminates unused functions and/or variables that the compiler can’t prove are unused.  (The compiler can eliminate unused static variables from a compilation unit, for instance, but eliminating unused variables that are visible outside of a compilation unit requires the linker’s help.)  And indeed, the linking process on Linux uses this --gc-sections option.

…most of the time.  Depending on the vagaries of the GCC compiler version and the version of the linker being used, using --gc-sections can impede the debugging experience.  So bug 670659 added a check to disable --gc-sections if using that option altered the debugging information in unhelpful ways.

You can probably see where this is going: on my machine, GCC versions 4.5-4.7 failed this check and so the --gc-sections option was not used with those versions.  (GCC 4.8 actually wound up bypassing the check altogether.)  Unfortunately, compiling things so the --gc-sections option is consistently used is difficult because of how configure.in is structured.

Lesson learned: double-check your experimental setup before analyzing your results!  Make sure everything’s being done consistently between your test configurations so your measurements accurately reflect differences in what you’re measuring.

Feb 13

gcc version comparison, part 1/n: libxul sizes

Some questions on #perf this morning got me wondering about how different versions of GCC compared in terms of the size of libxul.  (I have a lot of versions of GCC lying about for a sekret performance comparison project, so merely comparing sizes was pretty straightforward.)

The GCC versions I used are listed in the table below.  The Mozilla sources were from mozilla-central r120478, compiled with –disable-debug –disable-debug-symbols –enable-optimize.  The target was x86-64 Linux, the build did not use PGO, and the system linker, GNU ld 2.20.1, was used. Here’s what the size command has to say about libxul in each case; all sizes are in bytes:

GCC version Text size Data size Bss size .text section size .eh_frame section size
4.4.7 39120354 3410456 1611420 22969414 4212924
4.5.4 44833935 3791400 1625996 23449960 7481052
4.6.3 42819600 3774272 1625996 22970408 6467652
4.7.2 42103108 3769576 1631244 22297992 6519596
4.8 HEAD 39638390 3415424 1617260 21300806 6209220

The terms “text”, “data”, and “bss” aren’t just the similarly-named sections in binaries. “text” encompasses all code and read-only sections, so things like string constants (!) would be included in this number. “data” is everything non-constant that’s stored on disk: tables of function pointers, tables of non-constant data, and so forth. “bss” is everything that’s initialized to zero and can therefore be allocated by the system at program run time. I’ve provided the .text section sizes as a more useful (IMHO) number for the purposes of this comparison.

If you look at just the size of the .text section–that is, actual compiled code–there’s not much variation between the compiler versions. 4.5 is the outlier here, with a ~2% increase over 4.4, but 4.6 is back to 4.4’s codesize and 4.8 is smaller still. How or if these differences in code size translate into a difference in performance will have to wait until another blog post. So what’s with size reporting such a huge jump in “text” size for 4.5 through 4.7?

The .eh_frame section sizes help explain this increase. The corresponding .eh_frame_hdr sections show similar percentage-wise increases, but the absolute increases are somewhat smaller there, so I opted to not show data for those. GCC 4.5 started emitting unwind data for function epilogues and does so unconditionally whenever unwind data is emitted. This data is not needed for normal operation, since you never have to unwind the stack from epilogues. However, for unwinding the stack from arbitrary points in the program (e.g. as a sampling profiler similar to oprofile or perf might do), such data is absolutely necessary. (You could fake it by parsing the instruction stream, but that gets very messy very quickly. Been there, done that, don’t want to do it again.) So, extra unwind data leads to bigger section sizes.  No surprises there.

Before getting to other sources of the “text” size increase, we need to examine another interesting statistic: the data size increase seen in 4.5-4.7.  Why should different versions of the compiler differ so much in an essentially static figure?  I generated easy-to-compare lists of symbols from each version:

readelf --syms -W build-mozilla-gcc-${version}/dist/bin/libxul.so \
  | gawk '$4 == "OBJECT" && $7 != "UND" && $5 != "WEAK" {printf("% 6d %s\n", $3, $8); }' \
  | sort -n -k 1 -r > gcc-${version}-all-syms.txt

and diff‘d the 4.4 and the 4.5 version. (Comparing to 4.5 or 4.6 provides roughly the same data, and starting with a base of 4.7 provides the same information in the reverse direction.) While there were a few instances of user-specified variables that the compiler didn’t eliminate, the bulk of the hunks of the diff looked like this:

@@ -721,6 +791,10 @@
   1080 vtable for nsPrintSettings
   1080 keywords
   1080 sip_tcp_conn_tab
+  1072 vtable for js::ion::LIRGeneratorX86Shared
+  1072 vtable for js::ion::MInstructionVisitor
+  1072 vtable for js::ion::LIRGeneratorShared
+  1072 vtable for js::ion::LIRGeneratorX64
   1072 vtable for js::ion::LIRGenerator
   1072 vtable for nsBox
   1064 vtable for nsBaseWidget


@@ -842,6 +934,10 @@
    864 vtable for imgRequestProxy
    864 mozilla::dom::NodeBinding::sAttributes_specs
    864 g_sip_table
+   856 vtable for nsIDOMSVGTextPositioningElement
+   856 vtable for nsIDOMSVGFECompositeElement
+   856 vtable for nsIDOMSVGTSpanElement
+   856 vtable for nsIDOMSVGTextElement
    855 sLayerMaskVS
    848 vtable for nsJARURI
    848 vtable for nsXMLHttpRequestUpload

And if you add up all the sizes of the vtables we’re now retaining:

diff -u gcc-44-all-syms.txt gcc-45-all-syms.txt \
  | c++filt | egrep '^\+' \
  | grep vtable | awk '{ sum += $2 } END { print sum }'

you get a total of about 325K, which accounts for a good chunk of the 375K difference between GCC 4.4’s generated data and GCC 4.5’s generated data.

How did GCC 4.4 make the vtables go away? [UPDATE: There’s a simple explanation for what happened here.] I haven’t analyzed the code, but I can see two possibilities. The first is that the compiler devirtualizes all the function calls associated with those classes and can tell that instances of the classes never escape outside of the library. And if you don’t have virtual function calls, you don’t need a vtable. As a second possibility the compiler can see that instances of the associated classes are never created. This is probably what happened for all the nsIDOM* vtables in the example hunk above. So the vtables are never referenced and discarded at link time, or never generated for any compilation unit in the first place. Whether these suspicions are correct or there’s some other mechanism at work, the key point is that 4.5-4.7 lost the ability to do this in some (all?) cases and dramatically increased data sizes as a result.

Also, since 4.5-4.7 are generating spurious vtables, there’s a lot of unnecessary relocations associated with those tables: the values of the function pointers in the vtables can’t be known until the binary is loaded, so relocations are necessary. This increase in relocations can be partially seen in the “text” numbers in the table above (relocations are constant data…). Going from 4.4 to 4.5 added about 1MB of relocation data and 4.8 benefited by eliminating the need for those extra relocations.

Between the changes in .text section size, the extra relocations, and the extra .eh_frame information, we’ve accounted for a good chunk of the fluctuations seen in the “text” and “data” numbers between compiler versions.  There’s other nickel-and-dime stuff that accounts for the remainder of the fluctuations, but I’m not going to cover those bits here. This post is already long enough! Ideally, the next post will have some Talos performance comparisons.