18
Apr 13

cold page load, OS X 10.7, and talos

I’ve been looking at some Talos pageload benchmark data lately. The intention is gleaning enough information to decide whether having a separate cold page load benchmark is worthwhile. Our Talos pageload benchmark loads a number of sites a specified number of times (currently 25), records times for all of those, discards the highest time per page (usually the first page load), and averages everything else.  The working hypothesis is that the 1st and/or 2nd iteration of each site load might be worth tracking separately.

To that end, I’ve been writing a lot of shell scripts to munge benchmark numbers, running those numbers through gnuplot, and staring at the output. One of the graph styles I’ve been using is the Q-Q plot, which I use for comparing data like so:

  1. Sort the page load times for the 1st iteration;
  2. Sort the page load times for the nth iteration (usually 2nd, 5th, or 10th, the exact number doesn’t seem to matter a great deal);
  3. Plot numbers from step 1 vs. numbers from step 2 and eyeball the graph.

There’s a fair amount of data to crunch (8000 runs times 100 sites times 25 runs divided across 8 operating systems), but the most interesting thing to come out of this experiment so far is consistently slower cold page loads on OS X 10.7 on certain sites.  Graphing most sites across OS X 10.6, 10.7, and 10.8, the graphs look like this:

Q-Q plot for stackoverflow.com

Q-Q plot for reuters.com

These sorts of graphs are exactly what you’d want to see across OS versions: improving load times along warm page load, cold page load, or both.

Before pointing out the sites, it’s worth noting that the pages used in the benchmark are representative of these sites as they appeared ~1 year ago: things may have changed and the pages may have been altered slightly to avoid fetching resources off the network, etc.  The sites are:

  • 56.com
  • alibaba.com
  • bild.de
  • cnet.com
  • deviantart.com
  • etsy.com
  • en.wikipedia.org
  • filestube.com
  • foxnews.com
  • guardian.co.uk
  • huffingtonpost.com
  • hatena.ne.jp
  • icanhascheezburger.com
  • linkedin.com
  • mashable.com
  • mail.ru
  • nicovideo.jp
  • noimpactman.typepad.com
  • orange.fr
  • spiegel.de
  • thepiratebay.org
  • wsj.com
  • whois.domaintools.com

I’ll show a few representative graphs here; you can examine graphs for all the sites in the benchmark if you wish.

Q-Q plot for spiegel.de

Q-Q plot for orange.fr

Q-Q plot for huffingtonpost.com

Q-Q plot for linkedin.com

Q-Q plot for thepiratebay.org

What’s so interesting about these graphs is that the maximum cold page load time for OS X 10.6 and 10.8 is barely the minimum cold page load time for OS X 10.7.  The warm page load times are similar, too.

mozilla.com doesn’t appear in the above list, but is notable for exhibiting significantly worse cold load performance on OS 10.8:

Q-Q plot for mozilla.com

I’m probably not going to dive any deeper on this issue right now; instead, this discrepancy will get filed away under “interesting things that turn up when you look at cold load performance specifically”.  Once I reach some sort of conclusion on whether benchmarking cold page load separately is worthwhile, then I’ll come back to the interesting issues found along the way.  If anybody has any theories about why these pages exhibit this discrepancy, I’d love to hear them!


11
Apr 13

introducing mozilla/Endian.h

In the continuing effort to eliminate usage of prtypes.h from the tree, a significant obstacle is the usage of IS_LITTLE_ENDIAN and IS_BIG_ENDIAN in various places.  These macros are defined if the target platform is little-endian or big-endian, respectively.  (All of our tier-1 platforms are little-endian platforms.  Pay no attention to the big-endian ARM variants; there be dragons.)  If you search for these identifiers, you’ll find that their uses fall into three broad categories:

  1. Defining byte-swapping macros.  Various amounts of attention are paid to using compiler intrinsics for the byte swaps.
  2. Conditionally compiling byte-swapping functionality.
  3. Taking slightly different actions or using different data according to the endianness of the target.

Point 1 is bad because we’re not always using the most efficient code possible to perform the swap.  Using functions would be preferable to gain the benefit of type checking and defined argument evaluation.  Depending on where you looked in the tree, sometimes the argument was modified in-place and sometimes it was returned as a value, so consistency suffers.  And IMHO, stumbling upon:

SWAP(value);

in code is not that informative.  Am I swapping to little-endian or from big-endian or something else?  More explicit names would be good.  Point 2 is bad because #ifdef-ery clutters the code and we may not be compiling the #ifdef‘d code all the time, which may lead to bitrot.

mfbt/Endian.h, which landed last week in bug 798172, is a significant step towards addressing the first two issues above.  Endian.h provides faster, clearer functions for byte-swapping functionality and also enables the byte-swapping to be compiled away depending on the target platform.  While it doesn’t address point 3 directly, it does provide MOZ_LITTLE_ENDIAN and MOZ_BIG_ENDIAN macros as an alternative to IS_LITTLE_ENDIAN and IS_BIG_ENDIAN.  Since MOZ_LITTLE_ENDIAN and MOZ_BIG_ENDIAN are always defined, Endian.h means that previously #ifdef‘d code can now be written (where possible) as straight C++ code, making things more readable.  And there are ideas for how to address point 3 more directly.

Enough talk; what about the bits and bytes?

As previously mentioned, Endian.h #defines MOZ_LITTLE_ENDIAN and MOZ_BIG_ENDIANMOZ_LITTLE_ENDIAN is equal to 1 if we’re targeting a little-endian platform and 0 otherwise.  Likewise, MOZ_BIG_ENDIAN is equal to 1 if we’re targeting a big-endian platform and 0 otherwise.  The intent is these are legacy macros.  You shouldn’t have to use them in newly written Mozilla code, though they may come in handy for interfacing with external libraries that need endianness information.

The next major piece of functionality is a family of functions that read 16-, 32-, or 64-bit signed or unsigned quantities in a given endianness.  The intent here is to replace code written like:

v1 = SWAP(*(uint32_t*)pointer);
*(int64_t*)other_pointer = SWAP(v2);

only with clearer code that’s free of aliasing and (mis-)alignment issues:

v1 = mozilla::BigEndian::readUint32(pointer);
mozilla::BigEndian::writeInt64(other_pointer, v2);

The other read and write functions are named similarly. And of course there’s mozilla::LittleEndian::readUint32 and so forth as well. As a concession to readability (no code uses this yet, so we’re not sure how useful it is), there’s also mozilla::NetworkOrder which functions exactly the same as mozilla::BigEndian.

In an ideal world, those are all the functions that you’d need.  But looking through the code that needed to do byte-swapping, it often seemed that some sort of swap primitive was more convenient than reading or writing in defined endiannesses.  Who knows?  Maybe when the whole tree has been converted over to Endian.h, we’ll find that the swap primitives are completely unnecessary and eliminate them. However, in a partial-converted and not-quite-so-ideal world, we have byte swaps all over.

Accordingly, the last major piece of functionality deals with byte swap primitives.  But these swap primitives specify the direction in which you’re swapping, so as to make the code more self-documenting.  For instance, maybe you had:

struct Header {
  uint32_t magic;
  uint32_t total_length;
  uint64_t checksum;
} header;
fread(&header, sizeof(Header), 1, file);
header.magic = SWAP(header.magic);
header.total_length = SWAP(header.total_length);
header.checksum = SWAP64(header.checksum);

Assuming that the header was stored in little-endian order, you’d use Endian.h functions thusly:

struct Header {
  uint32_t magic;
  uint32_t total_length;
  uint64_t checksum;
} header;
fread(&header, sizeof(Header), 1, file);
header.magic = mozilla::NativeEndian::swapFromLittleEndian(header.magic);
header.total_length = mozilla::NativeEndian::swapFromLittleEndian(header.total_length);
header.checksum = mozilla::NativeEndian::swapFromLittleEndian(header.checksum);

You could write this using LittleEndian::readUint{32,64}. But it’s a little more straightforward to write it with swaps instead. In a similar fashion, there’s NativeEndian::swapToLittleEndian.

You can replace LittleEndian with BigEndian or NetworkOrder in these single-element swap functions and all the functions below with the obvious change to the behavior.

Single-element swaps solve a lot of problems. But the following coding pattern was semi-common:

void* pointer2;
...
memcpy(pointer1, pointer2, n_elements * sizeof(*pointer1));
#if defined(IS_BIG_ENDIAN)
for (size_t i = 0; i < n_elements; i++) {
  pointer1[i] = SWAP(pointer1[i]);
}
#endif

Again, this could be written with LittleEndian::readUint32 or similar. But that loses the benefits of memcpy on little-endian platforms (which are the common case for us). Depending on the type of pointer2, there might be some ugly casting and pointer arithmetic involved too. So Endian.h also includes “bulk swap” primitives:

mozilla::NativeEndian::copyAndSwapFromLittleEndian(pointer1, pointer2, n_elements);

which will do a straight memcpy on a little-endian platform and whatever copying + swapping is necessary on a big-endian platform. As you might expect by now, there’s also NativeEndian::copyAndSwapToLittleEndian. And since the related but slightly different:

uint32_t* pointer = new uint32_t[length];
...
fread(pointer, sizeof(*pointer), length, file);
#if defined(IS_BIG_ENDIAN)
for (size_t = 0; i < length; ++i) {
  pointer[i] = SWAP(pointer[i]);
}
#endif

was also semi-common, the functions NativeEndian::swapFromLittleEndianInPlace and NativeEndian::swapToLittleEndianInPlace were also provided:

uint32_t* pointer = new uint32_t[length];
...
fread(pointer, sizeof(*pointer), length, file);
mozilla::NativeEndian::swapFromLittleEndianInPlace(pointer, length);

All the NativeEndian functions are actually templates, so they’ll work with 16-, 32-, or 64-bit signed or unsigned variables. They’ll also byteswap things like wchar_t and PRUnichar, though compilation will fail if you attempt to byteswap non-integer things like doubles or pointers.

Let the converting begin!  Makoto Kato has already begun by eliminating NS_SWAP{16,32,64} and replacing them with their Endian.h equivalents.


10
Apr 13

mozIStorageService, the main thread, and you

Bug 836493 landed on inbound today.  An additional constraint is now enforced on mozIStorageService: the initial reference to it must be obtained on the main thread.  However, all references after the first can be obtained on any thread.

Seems awfully complicated; what do we gain from that change?  Two things.  The first is that there was a race in the initialization of the storage service.  Some bits of storage service initialization, like accessing preferences, could only be done on the main thread.  If the storage service was initialized on a non-main thread, it dispatched an event to the main thread to perform those initialization tasks.  Therefore, you might end up with a sequence of events like:

  1. Non-main thread requests the storage service.
  2. Storage service starts initialization, dispatches event to the main thread.  This event can’t run until after step 4, for various reasons (thread scheduling, backed-up event queue on the main thread, etc. etc.).
  3. Storage service initialization returns, handing an (incompletely initialized) object back to the caller.
  4. Caller uses the not-yet initialized storage service, leading to possible problems.

Additionally, the storage service builds an SQLite VFS for handling things like quota management.  This building happens on the calling thread and accesses prefs to make profiles that live on networked storage robust.  That’s a non-main thread preferences use which could lead to crashes.  That needs to go away so we can enforce main thread-only preferences usage.

Even though this change might make programming slightly more inconvenient, the end result is safer code for our users and (eventually) better enforcement of good coding practices.