18
Apr 13

cold page load, OS X 10.7, and talos

I’ve been looking at some Talos pageload benchmark data lately. The intention is gleaning enough information to decide whether having a separate cold page load benchmark is worthwhile. Our Talos pageload benchmark loads a number of sites a specified number of times (currently 25), records times for all of those, discards the highest time per page (usually the first page load), and averages everything else.  The working hypothesis is that the 1st and/or 2nd iteration of each site load might be worth tracking separately.

To that end, I’ve been writing a lot of shell scripts to munge benchmark numbers, running those numbers through gnuplot, and staring at the output. One of the graph styles I’ve been using is the Q-Q plot, which I use for comparing data like so:

  1. Sort the page load times for the 1st iteration;
  2. Sort the page load times for the nth iteration (usually 2nd, 5th, or 10th, the exact number doesn’t seem to matter a great deal);
  3. Plot numbers from step 1 vs. numbers from step 2 and eyeball the graph.

There’s a fair amount of data to crunch (8000 runs times 100 sites times 25 runs divided across 8 operating systems), but the most interesting thing to come out of this experiment so far is consistently slower cold page loads on OS X 10.7 on certain sites.  Graphing most sites across OS X 10.6, 10.7, and 10.8, the graphs look like this:

Q-Q plot for stackoverflow.com

Q-Q plot for reuters.com

These sorts of graphs are exactly what you’d want to see across OS versions: improving load times along warm page load, cold page load, or both.

Before pointing out the sites, it’s worth noting that the pages used in the benchmark are representative of these sites as they appeared ~1 year ago: things may have changed and the pages may have been altered slightly to avoid fetching resources off the network, etc.  The sites are:

  • 56.com
  • alibaba.com
  • bild.de
  • cnet.com
  • deviantart.com
  • etsy.com
  • en.wikipedia.org
  • filestube.com
  • foxnews.com
  • guardian.co.uk
  • huffingtonpost.com
  • hatena.ne.jp
  • icanhascheezburger.com
  • linkedin.com
  • mashable.com
  • mail.ru
  • nicovideo.jp
  • noimpactman.typepad.com
  • orange.fr
  • spiegel.de
  • thepiratebay.org
  • wsj.com
  • whois.domaintools.com

I’ll show a few representative graphs here; you can examine graphs for all the sites in the benchmark if you wish.

Q-Q plot for spiegel.de

Q-Q plot for orange.fr

Q-Q plot for huffingtonpost.com

Q-Q plot for linkedin.com

Q-Q plot for thepiratebay.org

What’s so interesting about these graphs is that the maximum cold page load time for OS X 10.6 and 10.8 is barely the minimum cold page load time for OS X 10.7.  The warm page load times are similar, too.

mozilla.com doesn’t appear in the above list, but is notable for exhibiting significantly worse cold load performance on OS 10.8:

Q-Q plot for mozilla.com

I’m probably not going to dive any deeper on this issue right now; instead, this discrepancy will get filed away under “interesting things that turn up when you look at cold load performance specifically”.  Once I reach some sort of conclusion on whether benchmarking cold page load separately is worthwhile, then I’ll come back to the interesting issues found along the way.  If anybody has any theories about why these pages exhibit this discrepancy, I’d love to hear them!


11
Apr 13

introducing mozilla/Endian.h

In the continuing effort to eliminate usage of prtypes.h from the tree, a significant obstacle is the usage of IS_LITTLE_ENDIAN and IS_BIG_ENDIAN in various places.  These macros are defined if the target platform is little-endian or big-endian, respectively.  (All of our tier-1 platforms are little-endian platforms.  Pay no attention to the big-endian ARM variants; there be dragons.)  If you search for these identifiers, you’ll find that their uses fall into three broad categories:

  1. Defining byte-swapping macros.  Various amounts of attention are paid to using compiler intrinsics for the byte swaps.
  2. Conditionally compiling byte-swapping functionality.
  3. Taking slightly different actions or using different data according to the endianness of the target.

Point 1 is bad because we’re not always using the most efficient code possible to perform the swap.  Using functions would be preferable to gain the benefit of type checking and defined argument evaluation.  Depending on where you looked in the tree, sometimes the argument was modified in-place and sometimes it was returned as a value, so consistency suffers.  And IMHO, stumbling upon:

SWAP(value);

in code is not that informative.  Am I swapping to little-endian or from big-endian or something else?  More explicit names would be good.  Point 2 is bad because #ifdef-ery clutters the code and we may not be compiling the #ifdef‘d code all the time, which may lead to bitrot.

mfbt/Endian.h, which landed last week in bug 798172, is a significant step towards addressing the first two issues above.  Endian.h provides faster, clearer functions for byte-swapping functionality and also enables the byte-swapping to be compiled away depending on the target platform.  While it doesn’t address point 3 directly, it does provide MOZ_LITTLE_ENDIAN and MOZ_BIG_ENDIAN macros as an alternative to IS_LITTLE_ENDIAN and IS_BIG_ENDIAN.  Since MOZ_LITTLE_ENDIAN and MOZ_BIG_ENDIAN are always defined, Endian.h means that previously #ifdef‘d code can now be written (where possible) as straight C++ code, making things more readable.  And there are ideas for how to address point 3 more directly.

Enough talk; what about the bits and bytes?

As previously mentioned, Endian.h #defines MOZ_LITTLE_ENDIAN and MOZ_BIG_ENDIANMOZ_LITTLE_ENDIAN is equal to 1 if we’re targeting a little-endian platform and 0 otherwise.  Likewise, MOZ_BIG_ENDIAN is equal to 1 if we’re targeting a big-endian platform and 0 otherwise.  The intent is these are legacy macros.  You shouldn’t have to use them in newly written Mozilla code, though they may come in handy for interfacing with external libraries that need endianness information.

The next major piece of functionality is a family of functions that read 16-, 32-, or 64-bit signed or unsigned quantities in a given endianness.  The intent here is to replace code written like:

v1 = SWAP(*(uint32_t*)pointer);
*(int64_t*)other_pointer = SWAP(v2);

only with clearer code that’s free of aliasing and (mis-)alignment issues:

v1 = mozilla::BigEndian::readUint32(pointer);
mozilla::BigEndian::writeInt64(other_pointer, v2);

The other read and write functions are named similarly. And of course there’s mozilla::LittleEndian::readUint32 and so forth as well. As a concession to readability (no code uses this yet, so we’re not sure how useful it is), there’s also mozilla::NetworkOrder which functions exactly the same as mozilla::BigEndian.

In an ideal world, those are all the functions that you’d need.  But looking through the code that needed to do byte-swapping, it often seemed that some sort of swap primitive was more convenient than reading or writing in defined endiannesses.  Who knows?  Maybe when the whole tree has been converted over to Endian.h, we’ll find that the swap primitives are completely unnecessary and eliminate them. However, in a partial-converted and not-quite-so-ideal world, we have byte swaps all over.

Accordingly, the last major piece of functionality deals with byte swap primitives.  But these swap primitives specify the direction in which you’re swapping, so as to make the code more self-documenting.  For instance, maybe you had:

struct Header {
  uint32_t magic;
  uint32_t total_length;
  uint64_t checksum;
} header;
fread(&header, sizeof(Header), 1, file);
header.magic = SWAP(header.magic);
header.total_length = SWAP(header.total_length);
header.checksum = SWAP64(header.checksum);

Assuming that the header was stored in little-endian order, you’d use Endian.h functions thusly:

struct Header {
  uint32_t magic;
  uint32_t total_length;
  uint64_t checksum;
} header;
fread(&header, sizeof(Header), 1, file);
header.magic = mozilla::NativeEndian::swapFromLittleEndian(header.magic);
header.total_length = mozilla::NativeEndian::swapFromLittleEndian(header.total_length);
header.checksum = mozilla::NativeEndian::swapFromLittleEndian(header.checksum);

You could write this using LittleEndian::readUint{32,64}. But it’s a little more straightforward to write it with swaps instead. In a similar fashion, there’s NativeEndian::swapToLittleEndian.

You can replace LittleEndian with BigEndian or NetworkOrder in these single-element swap functions and all the functions below with the obvious change to the behavior.

Single-element swaps solve a lot of problems. But the following coding pattern was semi-common:

void* pointer2;
...
memcpy(pointer1, pointer2, n_elements * sizeof(*pointer1));
#if defined(IS_BIG_ENDIAN)
for (size_t i = 0; i < n_elements; i++) {
  pointer1[i] = SWAP(pointer1[i]);
}
#endif

Again, this could be written with LittleEndian::readUint32 or similar. But that loses the benefits of memcpy on little-endian platforms (which are the common case for us). Depending on the type of pointer2, there might be some ugly casting and pointer arithmetic involved too. So Endian.h also includes “bulk swap” primitives:

mozilla::NativeEndian::copyAndSwapFromLittleEndian(pointer1, pointer2, n_elements);

which will do a straight memcpy on a little-endian platform and whatever copying + swapping is necessary on a big-endian platform. As you might expect by now, there’s also NativeEndian::copyAndSwapToLittleEndian. And since the related but slightly different:

uint32_t* pointer = new uint32_t[length];
...
fread(pointer, sizeof(*pointer), length, file);
#if defined(IS_BIG_ENDIAN)
for (size_t = 0; i < length; ++i) {
  pointer[i] = SWAP(pointer[i]);
}
#endif

was also semi-common, the functions NativeEndian::swapFromLittleEndianInPlace and NativeEndian::swapToLittleEndianInPlace were also provided:

uint32_t* pointer = new uint32_t[length];
...
fread(pointer, sizeof(*pointer), length, file);
mozilla::NativeEndian::swapFromLittleEndianInPlace(pointer, length);

All the NativeEndian functions are actually templates, so they’ll work with 16-, 32-, or 64-bit signed or unsigned variables. They’ll also byteswap things like wchar_t and PRUnichar, though compilation will fail if you attempt to byteswap non-integer things like doubles or pointers.

Let the converting begin!  Makoto Kato has already begun by eliminating NS_SWAP{16,32,64} and replacing them with their Endian.h equivalents.


10
Apr 13

mozIStorageService, the main thread, and you

Bug 836493 landed on inbound today.  An additional constraint is now enforced on mozIStorageService: the initial reference to it must be obtained on the main thread.  However, all references after the first can be obtained on any thread.

Seems awfully complicated; what do we gain from that change?  Two things.  The first is that there was a race in the initialization of the storage service.  Some bits of storage service initialization, like accessing preferences, could only be done on the main thread.  If the storage service was initialized on a non-main thread, it dispatched an event to the main thread to perform those initialization tasks.  Therefore, you might end up with a sequence of events like:

  1. Non-main thread requests the storage service.
  2. Storage service starts initialization, dispatches event to the main thread.  This event can’t run until after step 4, for various reasons (thread scheduling, backed-up event queue on the main thread, etc. etc.).
  3. Storage service initialization returns, handing an (incompletely initialized) object back to the caller.
  4. Caller uses the not-yet initialized storage service, leading to possible problems.

Additionally, the storage service builds an SQLite VFS for handling things like quota management.  This building happens on the calling thread and accesses prefs to make profiles that live on networked storage robust.  That’s a non-main thread preferences use which could lead to crashes.  That needs to go away so we can enforce main thread-only preferences usage.

Even though this change might make programming slightly more inconvenient, the end result is safer code for our users and (eventually) better enforcement of good coding practices.


25
Mar 13

perf workweek talos discussion

Inspired by Avi’s recent work on the Talos tscoll test, we had an extended discussion about things to address in Talos tests.  I took a picture of the whiteboard for posterity:

talos discussion whiteboard

Concerns driving our discussion:

  • Lack of confidence in Talos.  Whether through one too many dev.tree-management emails or inconsistent results on developer machines vs. Try, people don’t have confidence in Talos providing consistent, understandable results.
  • Wrong pushes identified as the cause of performance regressionsWe’ve covered this topic before.  Some people had a hard time believing this happens.  It shouldn’t, and we’d like to fix it.

Some ways to address those concerns:

    • Cold cases are just as important as warm cases.  On some tests, we might discount the numbers from the first few runs because they aren’t as smooth as the numbers from subsequent runs; we do this for our pageloader tests, for instance.  The differences come from things being initialized lazily, caches of various kinds filling up, and so forth.  But those “cold” cases are just as important as “warm” cases: In scenarios where the user restarts the web browser repeatedly throughout the day or memory pressure causes the browser to be killed on mobile devices, necessitating restarts when switching to it, the cold cases are the only ones that matter.  We shouldn’t be discounting those numbers.  Maybe the first N runs should constitute an entirely separate test: Tp5cold or similar.  Then we can compare numbers across cold tests, which should provide more insight than comparing cold vs. warm tests.
    • Criteria for performance tests.  We have extensive documentation on the tests themselves.  The documentation on the what that’s being measured, the how the test goes about fulfilling that goal, and the why one should believe that goal is actually achieved is in short supply.  As shown above, tscroll doesn’t measure the right things.  We don’t know if tscroll is an isolated incident or symptomatic of a larger problem.  We’d like to find out.
    • In-tree performance tests.  Whether recording frames during an animation and comparing them to some known quantity, measuring overdraws on the UI, or verifying telemetry numbers collected during a test, this would help alleviate difficulties with running the tests.  Presumably, it would also make adding new, focused tests easier.  And the tools for comparing performance tests would also exist in the tree so that comparing test results would be a snap.

Better real-world testing

      .  Tp5 was held up as a reasonable performance test.  Tp5 iterates over “the top 100 pages of the internet”, loading each one 25 times and measuring page load times before moving on to the next one.  Hurrah for

testing the browser on real-world data

      rather than

synthetic benchmarks

      .  But is this necessarily the best way to conduct a test with real-world data?  Doesn’t sound very reflective of real-world usage.  What about loading 10 pages simultaneously, as one might do clicking through morning news links?  Why load each page 25 times?  Why not hold some pages open while browsing to other ones (though admittedly without network access this may not matter very much)?  Tp5 is always run on a clean profile; clean profiles are not indicative of real-world performance (similarly for disk caches).  And so on.

 

Any one of these things is a decent-sized project on its own.  Avi and I are tackling the criteria for performance tests bit first by reviewing all the existing tests, trying to figure out what they’re measuring, and comparing that with what they purport to measure.


03
Feb 13

gcc version comparison, part 1.5/n: corrections

In my previous post, I discussed the size of libxul as compiled by various versions of GCC.  Due to some configuration quirks, it turns out that the comparison was flawed.

To recap: GCC versions 4.5-4.7 contained, among other things, vtables and their associated relocations for classes that were never instantiated.  I theorized that some compiler optimization must have been responsible for this, and that this compiler optimization must have gotten disabled in those versions.  Thinking about it afterwards, it turned out that there was a simple way to check this theory: examine the object files for the vtables.  Some objects compiled by versions 4.5-4.7 must have the vtables, and no objects from versions 4.4 and 4.8 should contain the vtables.  So let’s check, using nsIDOMSVGTextElement as an example:

[froydnj@cerebro froydnj]$ for d in 4 5 6 7 8; do
  for o in $(find build-mozilla-gcc-4${d}/ -name '*.o'); do
    if readelf -sW $o |c++filt| grep -q 'vtable for nsIDOMSVGTextElement' 2>/dev/null; then
      echo $o; readelf -sW $o |c++filt|grep 'vtable for nsIDOMSVGTextElement'
    fi
  done
done
build-mozilla-gcc-44/content/svg/content/src/SVGTextElement.o
   971: 0000000000000000   856 OBJECT  WEAK   HIDDEN  450 vtable for nsIDOMSVGTextElement
build-mozilla-gcc-45/content/svg/content/src/SVGTextElement.o
  1241: 0000000000000000   856 OBJECT  WEAK   HIDDEN  676 vtable for nsIDOMSVGTextElement
build-mozilla-gcc-46/content/svg/content/src/SVGTextElement.o
  1021: 0000000000000000   856 OBJECT  WEAK   HIDDEN  498 vtable for nsIDOMSVGTextElement
build-mozilla-gcc-47/content/svg/content/src/SVGTextElement.o
  1075: 0000000000000000   856 OBJECT  WEAK   HIDDEN  533 vtable for nsIDOMSVGTextElement
build-mozilla-gcc-48/content/svg/content/src/SVGTextElement.o
   831: 0000000000000000   856 OBJECT  WEAK   HIDDEN  532 vtable for nsIDOMSVGTextElement

So all versions of the compiler are generating the vtables that are sometimes present and sometimes not in the compiled libxul.  Why do the vtables sometimes disappear?

The linker on Linux systems has a --gc-sections option that eliminates unused sections from the final output file, using a form of mark and sweep garbage collection.  Normally, this is not terribly effective, since all of your program code goes into .text (resp. data into .data and so forth), and something in .text ought to be getting used.  But Mozilla is compiled with the options -ffunction-sections and -fdata-sections; -ffunction-sections gives each function its own uniquely named section and -fdata-sections does a similar thing for variables.  Using --gc-sections with the linker, then, effectively eliminates unused functions and/or variables that the compiler can’t prove are unused.  (The compiler can eliminate unused static variables from a compilation unit, for instance, but eliminating unused variables that are visible outside of a compilation unit requires the linker’s help.)  And indeed, the linking process on Linux uses this --gc-sections option.

…most of the time.  Depending on the vagaries of the GCC compiler version and the version of the linker being used, using --gc-sections can impede the debugging experience.  So bug 670659 added a check to disable --gc-sections if using that option altered the debugging information in unhelpful ways.

You can probably see where this is going: on my machine, GCC versions 4.5-4.7 failed this check and so the --gc-sections option was not used with those versions.  (GCC 4.8 actually wound up bypassing the check altogether.)  Unfortunately, compiling things so the --gc-sections option is consistently used is difficult because of how configure.in is structured.

Lesson learned: double-check your experimental setup before analyzing your results!  Make sure everything’s being done consistently between your test configurations so your measurements accurately reflect differences in what you’re measuring.


01
Feb 13

gcc version comparison, part 1/n: libxul sizes

Some questions on #perf this morning got me wondering about how different versions of GCC compared in terms of the size of libxul.  (I have a lot of versions of GCC lying about for a sekret performance comparison project, so merely comparing sizes was pretty straightforward.)

The GCC versions I used are listed in the table below.  The Mozilla sources were from mozilla-central r120478, compiled with –disable-debug –disable-debug-symbols –enable-optimize.  The target was x86-64 Linux, the build did not use PGO, and the system linker, GNU ld 2.20.1, was used. Here’s what the size command has to say about libxul in each case; all sizes are in bytes:

GCC version Text size Data size Bss size .text section size .eh_frame section size
4.4.7 39120354 3410456 1611420 22969414 4212924
4.5.4 44833935 3791400 1625996 23449960 7481052
4.6.3 42819600 3774272 1625996 22970408 6467652
4.7.2 42103108 3769576 1631244 22297992 6519596
4.8 HEAD 39638390 3415424 1617260 21300806 6209220

The terms “text”, “data”, and “bss” aren’t just the similarly-named sections in binaries. “text” encompasses all code and read-only sections, so things like string constants (!) would be included in this number. “data” is everything non-constant that’s stored on disk: tables of function pointers, tables of non-constant data, and so forth. “bss” is everything that’s initialized to zero and can therefore be allocated by the system at program run time. I’ve provided the .text section sizes as a more useful (IMHO) number for the purposes of this comparison.

If you look at just the size of the .text section–that is, actual compiled code–there’s not much variation between the compiler versions. 4.5 is the outlier here, with a ~2% increase over 4.4, but 4.6 is back to 4.4′s codesize and 4.8 is smaller still. How or if these differences in code size translate into a difference in performance will have to wait until another blog post. So what’s with size reporting such a huge jump in “text” size for 4.5 through 4.7?

The .eh_frame section sizes help explain this increase. The corresponding .eh_frame_hdr sections show similar percentage-wise increases, but the absolute increases are somewhat smaller there, so I opted to not show data for those. GCC 4.5 started emitting unwind data for function epilogues and does so unconditionally whenever unwind data is emitted. This data is not needed for normal operation, since you never have to unwind the stack from epilogues. However, for unwinding the stack from arbitrary points in the program (e.g. as a sampling profiler similar to oprofile or perf might do), such data is absolutely necessary. (You could fake it by parsing the instruction stream, but that gets very messy very quickly. Been there, done that, don’t want to do it again.) So, extra unwind data leads to bigger section sizes.  No surprises there.

Before getting to other sources of the “text” size increase, we need to examine another interesting statistic: the data size increase seen in 4.5-4.7.  Why should different versions of the compiler differ so much in an essentially static figure?  I generated easy-to-compare lists of symbols from each version:

readelf --syms -W build-mozilla-gcc-${version}/dist/bin/libxul.so \
  | gawk '$4 == "OBJECT" && $7 != "UND" && $5 != "WEAK" {printf("% 6d %s\n", $3, $8); }' \
  | sort -n -k 1 -r > gcc-${version}-all-syms.txt

and diff‘d the 4.4 and the 4.5 version. (Comparing to 4.5 or 4.6 provides roughly the same data, and starting with a base of 4.7 provides the same information in the reverse direction.) While there were a few instances of user-specified variables that the compiler didn’t eliminate, the bulk of the hunks of the diff looked like this:

@@ -721,6 +791,10 @@
   1080 vtable for nsPrintSettings
   1080 keywords
   1080 sip_tcp_conn_tab
+  1072 vtable for js::ion::LIRGeneratorX86Shared
+  1072 vtable for js::ion::MInstructionVisitor
+  1072 vtable for js::ion::LIRGeneratorShared
+  1072 vtable for js::ion::LIRGeneratorX64
   1072 vtable for js::ion::LIRGenerator
   1072 vtable for nsBox
   1064 vtable for nsBaseWidget

or:

@@ -842,6 +934,10 @@
    864 vtable for imgRequestProxy
    864 mozilla::dom::NodeBinding::sAttributes_specs
    864 g_sip_table
+   856 vtable for nsIDOMSVGTextPositioningElement
+   856 vtable for nsIDOMSVGFECompositeElement
+   856 vtable for nsIDOMSVGTSpanElement
+   856 vtable for nsIDOMSVGTextElement
    855 sLayerMaskVS
    848 vtable for nsJARURI
    848 vtable for nsXMLHttpRequestUpload

And if you add up all the sizes of the vtables we’re now retaining:

diff -u gcc-44-all-syms.txt gcc-45-all-syms.txt \
  | c++filt | egrep '^\+' \
  | grep vtable | awk '{ sum += $2 } END { print sum }'

you get a total of about 325K, which accounts for a good chunk of the 375K difference between GCC 4.4′s generated data and GCC 4.5′s generated data.

How did GCC 4.4 make the vtables go away? [UPDATE: There's a simple explanation for what happened here.] I haven’t analyzed the code, but I can see two possibilities. The first is that the compiler devirtualizes all the function calls associated with those classes and can tell that instances of the classes never escape outside of the library. And if you don’t have virtual function calls, you don’t need a vtable. As a second possibility the compiler can see that instances of the associated classes are never created. This is probably what happened for all the nsIDOM* vtables in the example hunk above. So the vtables are never referenced and discarded at link time, or never generated for any compilation unit in the first place. Whether these suspicions are correct or there’s some other mechanism at work, the key point is that 4.5-4.7 lost the ability to do this in some (all?) cases and dramatically increased data sizes as a result.

Also, since 4.5-4.7 are generating spurious vtables, there’s a lot of unnecessary relocations associated with those tables: the values of the function pointers in the vtables can’t be known until the binary is loaded, so relocations are necessary. This increase in relocations can be partially seen in the “text” numbers in the table above (relocations are constant data…). Going from 4.4 to 4.5 added about 1MB of relocation data and 4.8 benefited by eliminating the need for those extra relocations.

Between the changes in .text section size, the extra relocations, and the extra .eh_frame information, we’ve accounted for a good chunk of the fluctuations seen in the “text” and “data” numbers between compiler versions.  There’s other nickel-and-dime stuff that accounts for the remainder of the fluctuations, but I’m not going to cover those bits here. This post is already long enough! Ideally, the next post will have some Talos performance comparisons.


22
Jan 13

analyzing linker max vsize

mozilla-inbound is currently approval-only due to issues with Windows PGO builds.  The short explanation is that we turn on aggressive code optimization for our Windows builds.  This aggressive code optimization causes the linker than comes with Visual Studio to run out of virtual memory.  The current situation is especially problematic because we can’t increase the amount of virtual memory the linker can access (unlike last time, where we “just” moved the builds to 64-bit machines).

We don’t really have a good handle on what causes these issues (other than the obvious “more code”), but at least we are tracking the linker’s vsize and we’ll soon have pretty pictures of the same.  We hadn’t expected to have to deal with this problem for several more months.  The graph below helps explain why we’re hitting this problem a little sooner than before.  The data for this graph was taken from the Windows nightly build logs.

Win32 Linker max vsize

Notice the massive spike in October, as well as the ~100MB worth of growth in early January.  While the data is not especially fine-grained (nightly builds can include tens of changesets, and we’d really like information on the vsize growth on a per-changeset basis), looking at the biggest increases over the last ten months might prove helpful.  There have been ~300 nightly builds since we started recording data; below is a list of the top 20 daily increases in linker max vsize.  The date in the table is the date the nightly build was done; the newly-included changeset range is linked to for your perusal.

Nightly build date vsize increase (MB)
2012-05-18 282.363281
2012-10-06 103.609375
2012-10-08 90.769531
2013-01-10 49.699219
2012-06-02 49.199219
2012-10-19 32.976562
2012-12-25 32.332031
2013-01-06 32.015625
2013-01-20 30.144531
2013-01-22 27.222656
2012-10-04 19.273438
2012-05-10 18.234375
2012-11-23 17.937500
2012-08-03 17.738281
2013-01-07 17.671875
2012-09-08 17.386719
2012-12-23 17.269531
2012-12-27 17.156250
2012-11-11 17.085938
2012-12-06 17.003906

Mike Hommey suggested that trying to divine the whys and hows of extra memory usage would be a fruitless endeavor. Looking at the above pushlogs, I am inclined to agree with him. There’s nothing in any of them that jumps out. I didn’t try clicking through to individual changesets to figure out what might have added large chunks of code, though.


14
Jan 13

64-bit multiplication pitfalls

I’ve seen several instances of code recently that look something like this:

void madd(int64_t *sum, int32_t x, int32_t y)
{
  *sum += x * y;
}

Or this:

void func(int64_t);
...
  int32_t x, y = ...;
  ...
  func(x * y);

Unfortunately, neither of these cases do what the programmer intended.  The intended result was to compute the full 64-bit product from multiplying two 32-bit numbers.  What gets computed instead is the lower 32-bits of the desired product, sign-extended to 64-bits, which is quite different! The assembly produced by x86-64 GCC at -O2 for the first example looks like:

    imull   %edx, %esi    # int32_t multmp = x * y
    movslq  %esi, %rsi    # int64_t exttmp = static_cast<int64_t>(multmp)
    addq    %rsi, (%rdi)  # *sum += exttmp
    ret

If the full 64-bit product is desired, one of the arguments needs to be cast to a 64-bit value first.

void madd(int64_t *sum, int32_t x, int32_t y)
{
  *sum += static_cast<int64_t>(x) * y;
}

(The standard-ese for this is that operands are automatically promoted based on the types of the operands, not on the type of the result.  Integers smaller than int are promoted to int, which is what you want most of the time. Of course, here we’re dealing with things that are already int-sized[*], so we have to explicitly ask for promotion.)

which produces the desired:

    movslq  %esi, %rsi    # int64_t xtmp = static_cast<int64_t>(x)
    movslq  %edx, %rdx    # int64_t ytmp = static_cast<int64_t>(y)
    imulq   %rdx, %rsi    # int64_t multmp = xtmp * ytmp
    addq    %rsi, (%rdi)  # *sum += multmp
    ret

The above examples are semi-obvious instances, but when dealing with types whose sizes are not specified, similar problems occur. Consider replacing int64_t with off_t and int32_t with size_t in the example above. While such code will mostly work (most files are well under 2GB or 4GB in size), off_t and size_t do not need to be the same size: try compiling sizeof off_t with -D_FILE_OFFSET_BITS=64 on your favorite 32-bit Linux sometime.

[*] Assuming we’re on a fairly standard 32-bit or 64-bit machine, of course.


01
Nov 12

running js tests faster

I burned the tree today.  To verify my proposed fix, I had to run make check, which I don’t normally do.  Excuse me, make -srj20 check, because that’s what my development machine can handle.  Which is all fine until I got to check-jit-tests in js/src.  Which requires running nearly 30 thousand tests.  (N tests x 10 different JS engine option combinations starts to add up rather quickly…)

In serial.  A load average of 1 has never looked so pitiful.

Surely there must be a better way.  While my machine churned through these tests, one at a time, I had a chance to take a look around, and found bug 686240.  Said bug is mostly about reducing overhead by cutting down the number of processes spawned, but it does give a nod to running the tests in parallel.  But I wanted to improve the way the test harness runs first; fixing the innards of the test harness on top of that would just be gravy.

I considered modifying jit_test.py to handle spawning multiple processes and managing them.  But that’s just silly; we already have a tool for doing that: it’s called make.  So let’s use it:

diff --git a/js/src/Makefile.in b/js/src/Makefile.in
index 31dc5fb..7264cc5 100644
--- a/js/src/Makefile.in
+++ b/js/src/Makefile.in
@@ -602,7 +602,7 @@ endif
 endif

 check-jit-test::
-	$(wildcard $(RUN_TEST_PROGRAM)) $(PYTHON) -u $(srcdir)/jit-test/jit_test.py \
+	+$(wildcard $(RUN_TEST_PROGRAM)) $(PYTHON) -u $(srcdir)/jit-test/jit_test.py \
 	        --no-slow --no-progress --tinderbox --tbpl $(JITTEST_VALGRIND_FLAG) \
 	        $(DIST)/bin/js$(BIN_SUFFIX)

diff --git a/js/src/jit-test/jit_test.py b/js/src/jit-test/jit_test.py
index 7690548..e7d323b 100755
--- a/js/src/jit-test/jit_test.py
+++ b/js/src/jit-test/jit_test.py
@@ -502,6 +502,33 @@ def main(argv):
     if not OPTIONS.run_slow:
         test_list = [ _ for _ in test_list if not _.slow ]

+    if 'JS_PARALLELIZE' in os.environ and len(test_list) > 1:
+        makefile_path = tmppath('jsmakefile')
+        with open(makefile_path, 'w') as f:
+            makefile_targets = []
+            for (i, test) in enumerate(test_list):
+                target = "test%d" % i
+                subcmd = [sys.executable, sys.argv[0]]
+                if OPTIONS.tinderbox:
+                    subcmd.append('--tinderbox')
+                if OPTIONS.tbpl:
+                    subcmd.append('--tbpl')
+                if OPTIONS.ion:
+                    subcmd.append('--ion')
+                if OPTIONS.hide_progress:
+                    subcmd.append('--no-progress')
+                subcmd.append(JS)
+                subcmd.append(os.path.basename(test.path))
+                makefile_targets.append((target, subcmd))
+            f.write(".PHONY: %s\n" % ' '.join([x for (x, y) in makefile_targets]))
+            f.write("check: %s\n" % ' '.join([x for (x, y) in makefile_targets]))
+            for (target, cmdline) in makefile_targets:
+                f.write("%s:\n\t@%s\n" % (target, subprocess.list2cmdline(cmdline)))
+        make_cmd = ['make', '-f', makefile_path, 'check']
+        env = os.environ.copy()
+        del env['JS_PARALLELIZE']
+        os.execvpe(make_cmd[0], make_cmd, env)
+
     # The full test list is ready. Now create copies for each JIT configuration.
     job_list = []
     if OPTIONS.tbpl:

What the above code is doing is writing out a makefile that handles executing all the tests found by the test harness in parallel. There’s one target for each test, and then the check target depends on all of them being run. The script then runs a copy of make that executes all the commands found in this new makefile. The modification to Makefile.in is so make knows that you’re going to be executing recursive makes with that command. Credit where credit is due: I cribbed this idea from GCC, which uses it to run bits of LTO compilation in parallel.

Anyway, how fast does it go? JS_PARALLELIZE=1 time -p make -srj20 check-jit-test says:

real 539.33
user 5492.89
sys 653.84

Those times are in seconds. That’s for a --disable-optimize --enable-debug build, by the way.  An --enable-optimize --enable-debug build looks like:

real 144.20
user 1389.44
sys 308.25

So about a 10x speedup either way. Not too shabby for 30 extra lines of code!

The above patch is obviously only a proof-of-concept:

  • We leave our temporary makefile lying around;
  • Propagation of options to the sub-make needs to be expanded and cleaned up;
  • Output from the sub-make may need to be handled sanely;
  • I have no idea whether this works properly with pymake;
  • …and probably a few other things I haven’t thought about.

06
Oct 12

looking at talos differently, part 3

A few thoughts from several days of staring at these charts; I’m going to focus on the tests that generate the most email.  Because Talos generates so much email, developers are prone to ignore it.  And that’s not what we want them to do: getting a Talos email should be cause for consternation or celebration, not callous indifference.  By focusing on the tests that generate the most email at first, we’ll weed out the bulk of the redundancy in Talos emails.

Having said that:

  • The Dromaeo tests are very noisy. I understand they’re important, and we ought to have some way of testing JS/DOM/CSS performance. But consider the FF17 Dromaeo DOM tests: we can see three different changesets causing 6-7% swings in these tests. And those changesets don’t touch code anywhere near the JS/DOM bits being tested! If we’re going to keep these tests, we need to find some way of making them more stable.
  • Trace Malloc Allocs/MaxHeap/Leaks all send too much email for trivial changes: they’re often telling you about changes of tenths or even hundredths of a percent (to four significant digits!). I understand that many small changes eventually accumulate into big ones, but this seems a little excessive. There should be some sort of threshold, say a ~2% change either way, before warning emails get sent.
  • The Number of Constructors…numbers have been identical across x86 and x86-64 Linux the last three release cycles and the numbers keep going up. In the abstract, sure, x86 and x86-64 can have different behavior, but in practice, we just don’t add static constructors dependent on the word size of the platform. We should cut this back to one platform at the very least, and consider setting a threshold here as well.
  • The Tp5 No Network Row Major MozAfterPaint tests all generate a lot of email; it’s a bit of a tossup as to which one is going to stand out. Some of the numbers may be skewed due to DLBI’s cycles of landing and backouts, too. I will say that the more detailed tests, measuring Private Bytes, Main RSS, Content RSS, and %CPU, don’t identify regression candidates that our other tests don’t catch and are therefore not that useful.
  • a11y Row Major MozAfterPaint has the same problem: it’s meant to identify issues in the a11y implementation, but more often than not, winds up complaining about regressions that other tests have already caught for us.

All that is to say we could cut down the amount of email significantly with a couple of simple changes:

  • Set thresholds before email alerts are sent for the Trace Malloc tests;
  • Pick x86-64 or x86 Linux for Number of Constructors, possibly set a threshold here too;
  • Remove the specific measurements in the Tp5 test.

Other ideas:

  • The above analysis was only for Mozilla-Inbound; there are of course statistics from other trees that are sent to dev-tree-management. Maybe it’s worth splitting dev-tree-management up? Must compute statistics on what trees generate the most mail to the list.
  • The graphserver links sent in the emails are helpful; it would be even better if they featured multiple platforms. That way developers would have an easy(ier) way of assessing the usefulness of pursuing a given regression.
  • It’d be even better if regression emails weren’t sent unless there were regressions on multiple platforms. This would be a little tricky.