Nathan's Blog | writing code to help other people write code

better build times through configury

There’s been a flurry of activity lately on making Firefox build faster. One thing that’s not talked about much is tailoring what features how build or how you build can make things go faster. Below are a couple configure options that may make your life better, depending on what platform you develop for and what your focus is. None of these are going to help as much as glandium’s xulrunner SDK development trickery, but for those who can’t do that, for whatever reason, these may be helpful. Add them to your mozconfig with ac_add_option or your manual configure invocation if you are oldschool.

--disable-debug-symbols: This option isn’t appropriate if you plan on doing any C/C++ debugging. Really. But if you work more-or-less exclusively in JavaScript (or Java, for Android), this option can make your builds go much faster by making your object files smaller and link times faster. My local Android builds shave 30 minutes of total runtime (!) by enabling this. The actual win is somewhere on the order of 3 minutes due to the wonders of parallel make. Still, 15% improvements are nothing to sneeze at.
--disable-crashreporter: This configure option builds less code, which is always a good thing. Needing the crashreporter would be pretty unusual for local development, so this is probably pretty safe.
--disable-webrtc: Again, ths option is for not building a (somewhat larger) swath of code. May not be appropriate if you’re doing Web Audio or trying to measure startup improvements.
--disable-icf: This option is only useful if you’re building on Linux and you know you’re using the gold linker. This option turns off merging of identical functions, which makes the linker run slightly faster. (May also work on Windows with opt builds, not sure.)

For development on Linux with GCC specifically, there are a few compiler options that may make your life better also:

-fno-var-tracking: This option makes the debug information for C/C++ slightly less accurate. Variable tracking is a relatively new thing (GCC 4.6+) and tends to suck up quite a bit of compile time. If you were debugging happily with GCC 4.5 and earlier, you may not notice any problems with turning variable tracking off.
-gsplit-dwarf: What this does is it separates debug information into a completely different file from the actual compiled code. This change, in turn, makes the object files smaller, primarily for the benefit of the linker. You can read more about the motivation for splitting debug information into separate files on the GCC Wiki. There’s also a bug about adding this by default to local developer builds. (Recent SVN builds of clang also support this option, I think.)

Finally, if you want to dive really deep, you can play around with the options passed to the gold linker on Linux. (And if you do, please post some numbers to that bug; we’d love to get more data on how those options work for people!)

Uncategorized — Comments Off
12
Aug 13

the language lawyer: the curious constexpr conundrum

Last week, in a push that purported to eliminate a number of static constructors from the tree, I burned the Mac debug build:

In file included from SVGAnimateMotionElementBinding.cpp:13:
In file included from ../../dist/include/mozilla/dom/SVGAnimateMotionElement.h:11:
In file included from ../../../content/svg/content/src/SVGMotionSMILAnimationFunction.h:11:
In file included from ../../dist/include/nsSMILAnimationFunction.h:15:
In file included from ../../dist/include/nsSMILValue.h:10:
../../dist/include/nsSMILNullType.h:47:17: error: constexpr constructor never produces a constant expression [-Winvalid-constexpr]
MOZ_CONSTEXPR nsSMILNullType() {}
^
../../dist/include/nsSMILNullType.h:47:17: note: non-literal type 'nsISMILType' cannot be used in a constant expression

This error message caused some confusion:

18:41 < froydnj> "constexpr constructor never produces a constant expression"? what does *that* mean?
18:42 < philor> sweet, now you know what *every* compiler message sounds like to me

Searching the web for the error was also unhelpful. There was a Qt bug report from a year ago that wasn’t relevant and a smattering of pages discussing just how smart the compiler needs to be to diagnose invalid constant expressions. The closest hit was from a GCC bug on constexpr diagnostics. But even that didn’t quite tell the whole story. So what’s going on here?

I wasn’t sure what a literal type was, so I looked it up. The epsilon-close-to-the-standard draft, N3337, defines the term literal type in [basic.types] paragraph 10 (you’ll usually see this abbreviated to just [basic.types]p10 in discussions or compiler front-end comments; I’ll be abbreviating standard references from here on out):

A type is a literal type if it is:

a scalar type; or

a reference type referring to a literal type; or

an array of literal type; or

a class type (Clause 9) that has all of the following properties:

it has a trivial destructor,

every constructor call and full-expression in the brace-or-equal-initializers for non-static data
members (if any) is a constant expression (5.19),

it is an aggregate type (8.5.1) or has at least one constexpr constructor or constructor template
that is not a copy or move constructor, and

all of its non-static data members and base classes are of literal types.

That bit explains how we have a non-literal type in debug builds: debug builds define a no-op protected destructor. Since that destructor is user-provided and not defaulted, it is non-trivial (in the language of the standard; you can read [class.dtor]p5 for the definition of trivial destructors and [class.ctor]p5 for the definition of trivial constructors). The destructor therefore fails the first property of a literal class type, above. (In debug builds, we define a protected destructor to ensure that nobody is improperly deleting our type. But in opt builds, we don’t define the constructor so that it becomes compiler-defaulted and optimized away properly by intelligent frontends. And that means that any static instances of nsSMILNullType don’t need static constructors.)

OK, but why is having a non-literal type a problem? The compiler ought to simply execute the constructor normally and not worry about the constexpr annotation, right? Except that there is this bit in the definition of constant expressions, [expr.const]p2: “A conditional-expression is a constant-expression unless it involves one of the following as a potentially evaluated subexpression…: …an invocation of a function other than a constexpr constructor for a literal class or a constexpr function”

Aha! Here is our real problem. We are claiming that the constructor of nsSMILNullType is constexpr, but actually executing that constructor in an expression is by definition not a constant expression. And so Clang issues an error, as it should.

Moral of the story: don’t define destructors for classes that you give constexpr constructors to…unless you can default them. Guess we might have to add MOZ_DEFAULT to MFBT after all!

Uncategorized — 17 comments
02
Aug 13

I got 99 problems…and compilation time is one of them

Over the past week or so, there have been a few conversations on #developers about the ever-increasing time it takes to compile Firefox. Compiling Firefox used to be a relatively quick affair (well, on my machine, at least) and I’ve grown dissatisfied recently with builds taking ever-longer. All of the aforementioned discussions have boiled down to “well, we’re adding a lot of code and all that code takes longer to compile.”

Surely that can’t be the whole story. A mozilla-central build from a day or two took about 24 minutes on my machine (Linux x86-64 hyperthreaded 8-core @ 2.6GHz, srcdir on SSD, objdir on mechanical hard drive). We are not compiling 2.4x times more code than we were a year ago…or are we? Let’s look at some numbers.

I looked at my mozilla-central repository for some convenient tags to pick from and found the FIREFOX_AURORA_${REV}_BASE set of tags. I started compiling with mozilla-central tip, and then with FIREFOX_AURORA_24_BASE, and resolved to go as far back as I could without undue fiddling with things that wouldn’t compile. I made it to FIREFOX_AURORA_14_BASE before hitting compilation errors with moz_free. (19 and previous required small fiddling with js/src to set the right PYTHON, but that was easily scriptable.) That gives me 11 markers covering over a year of development.

I didn’t try compiling things multiple times to see how much the numbers fluctuated. I didn’t attempt to control for operating system disk caches or anything like that, just hg up -r $TAG, generated configure, ran configure --enable-optimize --disable-debug --disable-gstreamer (I haven’t bothered to get my gstreamer dependencies right on my development box) in a newly-created objdir, and recorded time for make -srj20. The compiler was GCC 4.7 from Debian stable.

m-c tag	real	user	sys
15	13m2.025s	83m36.229s	5m51.238s
16	13m40.359s	87m56.302s	6m15.435s
17	12m58.448s	94m21.514s	6m35.121s
18	14m21.257s	112m42.839s	7m36.213s
19	15m8.415s	120m48.789s	8m8.151s
20	16m26.337s	137m34.680s	9m11.878s
21	18m55.455s	165m29.433s	10m45.208s
22	19m27.887s	185m3.242s	11m53.853s
23	21m9.002s	203m46.708s	12m51.740s
24	21m51.242s	206m49.012s	12m55.356s
tip	24m37.489s	219m27.903s	13m47.080s

Those 10-minute clobber build times were apparently the result of warm disk cache or my dreams. Or perhaps they’re due to an older compiler; I only started using GCC 4.7 recently and I’ve done some tests that show it’s a fair amount slower than GCC 4.4. Or maybe I was using a different build config then. Whatever the reason, we can see that compilation times have been increasing steadily, but can we correlate that to anything?

Let’s look at the number of C++, C, WebIDL, and IPDL source files. We include the last two because the conversion to WebIDL has been cited as a reason for increasing compilation times and IPDL because the IPDL compiler generates a number of source files as well (roughly three source files for every IPDL file). The C++ source files are split by extension; .cpp is the usual prefix to use inside Gecko, so separating .cpp and .cc gives some idea of how much outside C++ source code is being compiled. (Just counting source files is subject to issues, since we don’t compile everything in widget/ for every platform, but we’re counting all the source files therein as if they were. This is a rough estimate.)

m-c tag	cpp	cc	c	webidl	ipdl
15	4060	577	2503	32	206
16	4117	1273	2893	34	213
17	4173	1278	2895	38	221
18	4468	1405	3123	61	227
19	4498	1408	3118	78	228
20	4586	1418	2779	169	228
21	4605	1392	2831	231	228
22	4979	1392	2980	305	228
23	5072	1393	2982	344	231
24	5101	1330	2983	413	234
tip	5211	1427	3029	436	234

First things first: we have a lot of code in the tree.

Now, there’s a couple of interesting things to point out. While there’s a lot of IPDL source files, the increase in IPDL can hardly be held responsible for the increase in compile time. So that’s one suspect out of the way. From 16 to 17, we barely increased the amount of code, but build times went down. This change might point to the recorded times being subject to quite a bit of variability.

We added a lot of WebIDL files from 20 to 21 while holding everything else constant and build times went up quite a bit! But we added roughly the same number of WebIDL files from 21 to 22 along with a lot more code and while build times went up, they didn’t go up by the same amount: 28 minutes of user time going from 20 to 21 and ~20 minutes of user time going from 21 to 22. There are a number of reasons why this could be so: we might not be compiling all those 500 C/C++ source files we added in 22 (ICU landed off by default in 22, so that’s ~400 files right there), the WebIDL files added in 21 might have had more of an impact, compilation-wise, the timings might fluctuate quite a bit, and so on and so forth. And notice that we added the same number of WebIDL files again from 23 to 24 and build times hardly changed. Sure, WebIDL means more files to compile. But I don’t think they are adding a disproportionate amount of time.

And finally, source code file counts. We had about 7800 source code files to compile back at Aurora 15 (remember, IPDL counts as three). We have about 10600 source files to compile now on trunk. That’s about a 40% increase that corresponds to roughly a doubling in time to compile. And yes, there’s also a lot of JS and other things that have been added that increase our time to compile. But I don’t think the amount of source code is the whole story.

Uncategorized — 5 comments
01
Aug 13

visual event tracing on android and eideticker

Over the last couple of weeks, I’ve been working with Jan Bambas’s visual event tracer and using it to analyze why we’re slower than the stock Android browser on Eideticker’s New York Times pageload test.

The first order of business was to add trace-saving capabilities to the tracer. With this change, I could capture traces on my Android device and view them on my desktop, which is much more convenient than looking at them on my phone. (Especially because Simile Timeline, which the about:timeline extension uses, is difficult to use on mobile and slow in Firefox, so it needs all the CPU power it can get. If any JS hackers out there have a replacement timeline widget recommendation, I am all ears!)

The second order of business was to modify Firefox to start and stop tracing automatically, rather than requiring the user to push buttons in about:timeline. Pushing buttons works fine on desktop, not so much on automated tests on mobile devices. I experimented with way too many fully general solutions before settling on the straightforward path: modifying nsDocShell::LoadURI to start and stop the tracer when certain URIs were fetched.

Now the tracer would automatically record profiles, on to viewing them! I modified Jan’s about:timeline extension to function properly in a browser not built with event tracing support. I also added the ability to use a trace saved in a file, rather than recording one in a browser. You can see the changes I made in my about:timeline github repository.

With all those changes in hand, we can view pretty pictures of what’s going on when the browser runs Eideticker tests. First, a normal Firefox for Android.

That almost-4-second delay you see there is the cache getting opened. Yes, that’s happening on the main thread and yes, it’s blocking everything else from moving forward. This is a great illustration of why we are rewriting the network cache. If the cache is so slow, what happens if we disable the cache?

It’s hard to tell from the picture, but if you looked at the Eideticker log of network requests, you’d see that everything completes somewhere around a 0.5-1 second faster (!). Why doesn’t it speed up more than that, since we’ve removed all that cache blocking by disabling the cache? Not sure yet, although you can see that cache entries are still getting opened. Also, you can see highlighted in the picture that the first network request takes about 250ms to start executing from the time that it’s fired off. That also needs to be investigated (and fixed!).

What about this new cache rewrite, how does it fare? Thought you’d never ask.

Again, you have to check out the HTTP request log, but this is still another 0.3 seconds or so faster than the uncached case that we see above. So the new network cache is a clear win in this instance. It’s worth pointing out that the network request delays are still present here and are taking slightly longer than the cache-disabled case. Variance in profiles? Effects of the new cache? Not sure yet!

I’m still figuring out my way around the tracer, but it’s a useful tool to peer into the Firefox network stack. It would be great if other subsystems (images, CSS, layout) would hook into the event tracer so we could have visibility into what they’re doing, too. The API is pretty simple and causes no runtime overhead in our usual release builds.

Uncategorized — Comments Off
28
Jun 13

chasing down android performance problems

In the last week or two, I’ve started looking at speeding up our startup and page render speed on Android. We want to be significantly faster on things like eideticker’s New York Times loading benchmark. Notable things that have happened this week:

Sriram Ramasubramanian shaved a couple hundred milliseconds off startup by changing how we construct menus.
There’s a longstanding bug about how we determine whether to enable linker deletion of dead code on Unix platforms; this bug has bitten Android. Once it gets fixed, Android libxul should get significantly smaller, which translates into less I/O and faster startup times.
I’ve been looking at adding more Android-only startup timeline events so it’s easier to find which areas are contributing more to startup time.
Changing when we load libxul will improve performance on multi-core devices; Mike Hommey and Brad Lassey are both looking into this.
Network performance on Android can be quite different from desktop. Patrick McManus has been looking into that and has found latency issues in how common ethernet drivers on Android work. With some tweaks to our networking code, performance on wifi connections is significantly improved.

Want to contribute? Know of something that affects startup or page load time on Android? We have an open bug for all issues connected with Android startup. Start looking there!

Uncategorized — 2 comments
18
Apr 13

cold page load, OS X 10.7, and talos

I’ve been looking at some Talos pageload benchmark data lately. The intention is gleaning enough information to decide whether having a separate cold page load benchmark is worthwhile. Our Talos pageload benchmark loads a number of sites a specified number of times (currently 25), records times for all of those, discards the highest time per page (usually the first page load), and averages everything else. The working hypothesis is that the 1st and/or 2nd iteration of each site load might be worth tracking separately.

To that end, I’ve been writing a lot of shell scripts to munge benchmark numbers, running those numbers through gnuplot, and staring at the output. One of the graph styles I’ve been using is the Q-Q plot, which I use for comparing data like so:

Sort the page load times for the 1st iteration;
Sort the page load times for the nth iteration (usually 2nd, 5th, or 10th, the exact number doesn’t seem to matter a great deal);
Plot numbers from step 1 vs. numbers from step 2 and eyeball the graph.

There’s a fair amount of data to crunch (8000 runs times 100 sites times 25 runs divided across 8 operating systems), but the most interesting thing to come out of this experiment so far is consistently slower cold page loads on OS X 10.7 on certain sites. Graphing most sites across OS X 10.6, 10.7, and 10.8, the graphs look like this:

These sorts of graphs are exactly what you’d want to see across OS versions: improving load times along warm page load, cold page load, or both.

Before pointing out the sites, it’s worth noting that the pages used in the benchmark are representative of these sites as they appeared ~1 year ago: things may have changed and the pages may have been altered slightly to avoid fetching resources off the network, etc. The sites are:

56.com
alibaba.com
bild.de
cnet.com
deviantart.com
etsy.com
en.wikipedia.org
filestube.com
foxnews.com
guardian.co.uk
huffingtonpost.com
hatena.ne.jp
icanhascheezburger.com
linkedin.com
mashable.com
mail.ru
nicovideo.jp
noimpactman.typepad.com
orange.fr
spiegel.de
thepiratebay.org
wsj.com
whois.domaintools.com

I’ll show a few representative graphs here; you can examine graphs for all the sites in the benchmark if you wish.

What’s so interesting about these graphs is that the maximum cold page load time for OS X 10.6 and 10.8 is barely the minimum cold page load time for OS X 10.7. The warm page load times are similar, too.

mozilla.com doesn’t appear in the above list, but is notable for exhibiting significantly worse cold load performance on OS 10.8:

I’m probably not going to dive any deeper on this issue right now; instead, this discrepancy will get filed away under “interesting things that turn up when you look at cold load performance specifically”. Once I reach some sort of conclusion on whether benchmarking cold page load separately is worthwhile, then I’ll come back to the interesting issues found along the way. If anybody has any theories about why these pages exhibit this discrepancy, I’d love to hear them!

Uncategorized — 1 comment
11
Apr 13

introducing mozilla/Endian.h

In the continuing effort to eliminate usage of prtypes.h from the tree, a significant obstacle is the usage of IS_LITTLE_ENDIAN and IS_BIG_ENDIAN in various places. These macros are defined if the target platform is little-endian or big-endian, respectively. (All of our tier-1 platforms are little-endian platforms. Pay no attention to the big-endian ARM variants; there be dragons.) If you search for these identifiers, you’ll find that their uses fall into three broad categories:

Defining byte-swapping macros. Various amounts of attention are paid to using compiler intrinsics for the byte swaps.
Conditionally compiling byte-swapping functionality.
Taking slightly different actions or using different data according to the endianness of the target.

Point 1 is bad because we’re not always using the most efficient code possible to perform the swap. Using functions would be preferable to gain the benefit of type checking and defined argument evaluation. Depending on where you looked in the tree, sometimes the argument was modified in-place and sometimes it was returned as a value, so consistency suffers. And IMHO, stumbling upon:

SWAP(value);

in code is not that informative. Am I swapping to little-endian or from big-endian or something else? More explicit names would be good. Point 2 is bad because #ifdef-ery clutters the code and we may not be compiling the #ifdef‘d code all the time, which may lead to bitrot.

mfbt/Endian.h, which landed last week in bug 798172, is a significant step towards addressing the first two issues above. Endian.h provides faster, clearer functions for byte-swapping functionality and also enables the byte-swapping to be compiled away depending on the target platform. While it doesn’t address point 3 directly, it does provide MOZ_LITTLE_ENDIAN and MOZ_BIG_ENDIAN macros as an alternative to IS_LITTLE_ENDIAN and IS_BIG_ENDIAN. Since MOZ_LITTLE_ENDIAN and MOZ_BIG_ENDIAN are always defined, Endian.h means that previously #ifdef‘d code can now be written (where possible) as straight C++ code, making things more readable. And there are ideas for how to address point 3 more directly.

Enough talk; what about the bits and bytes?

As previously mentioned, Endian.h #defines MOZ_LITTLE_ENDIAN and MOZ_BIG_ENDIAN. MOZ_LITTLE_ENDIAN is equal to 1 if we’re targeting a little-endian platform and 0 otherwise. Likewise, MOZ_BIG_ENDIAN is equal to 1 if we’re targeting a big-endian platform and 0 otherwise. The intent is these are legacy macros. You shouldn’t have to use them in newly written Mozilla code, though they may come in handy for interfacing with external libraries that need endianness information.

The next major piece of functionality is a family of functions that read 16-, 32-, or 64-bit signed or unsigned quantities in a given endianness. The intent here is to replace code written like:

v1 = SWAP(*(uint32_t*)pointer);
*(int64_t*)other_pointer = SWAP(v2);

only with clearer code that’s free of aliasing and (mis-)alignment issues:

v1 = mozilla::BigEndian::readUint32(pointer);
mozilla::BigEndian::writeInt64(other_pointer, v2);

The other read and write functions are named similarly. And of course there’s mozilla::LittleEndian::readUint32 and so forth as well. As a concession to readability (no code uses this yet, so we’re not sure how useful it is), there’s also mozilla::NetworkOrder which functions exactly the same as mozilla::BigEndian.

In an ideal world, those are all the functions that you’d need. But looking through the code that needed to do byte-swapping, it often seemed that some sort of swap primitive was more convenient than reading or writing in defined endiannesses. Who knows? Maybe when the whole tree has been converted over to Endian.h, we’ll find that the swap primitives are completely unnecessary and eliminate them. However, in a partial-converted and not-quite-so-ideal world, we have byte swaps all over.

Accordingly, the last major piece of functionality deals with byte swap primitives. But these swap primitives specify the direction in which you’re swapping, so as to make the code more self-documenting. For instance, maybe you had:

struct Header {
  uint32_t magic;
  uint32_t total_length;
  uint64_t checksum;
} header;
fread(&header, sizeof(Header), 1, file);
header.magic = SWAP(header.magic);
header.total_length = SWAP(header.total_length);
header.checksum = SWAP64(header.checksum);

Assuming that the header was stored in little-endian order, you’d use Endian.h functions thusly:

struct Header {
  uint32_t magic;
  uint32_t total_length;
  uint64_t checksum;
} header;
fread(&header, sizeof(Header), 1, file);
header.magic = mozilla::NativeEndian::swapFromLittleEndian(header.magic);
header.total_length = mozilla::NativeEndian::swapFromLittleEndian(header.total_length);
header.checksum = mozilla::NativeEndian::swapFromLittleEndian(header.checksum);

You could write this using LittleEndian::readUint{32,64}. But it’s a little more straightforward to write it with swaps instead. In a similar fashion, there’s NativeEndian::swapToLittleEndian.

You can replace LittleEndian with BigEndian or NetworkOrder in these single-element swap functions and all the functions below with the obvious change to the behavior.

Single-element swaps solve a lot of problems. But the following coding pattern was semi-common:

void* pointer2;
...
memcpy(pointer1, pointer2, n_elements * sizeof(*pointer1));
#if defined(IS_BIG_ENDIAN)
for (size_t i = 0; i < n_elements; i++) {
  pointer1[i] = SWAP(pointer1[i]);
}
#endif

Again, this could be written with LittleEndian::readUint32 or similar. But that loses the benefits of memcpy on little-endian platforms (which are the common case for us). Depending on the type of pointer2, there might be some ugly casting and pointer arithmetic involved too. So Endian.h also includes “bulk swap” primitives:

mozilla::NativeEndian::copyAndSwapFromLittleEndian(pointer1, pointer2, n_elements);

which will do a straight memcpy on a little-endian platform and whatever copying + swapping is necessary on a big-endian platform. As you might expect by now, there’s also NativeEndian::copyAndSwapToLittleEndian. And since the related but slightly different:

uint32_t* pointer = new uint32_t[length];
...
fread(pointer, sizeof(*pointer), length, file);
#if defined(IS_BIG_ENDIAN)
for (size_t = 0; i < length; ++i) {
  pointer[i] = SWAP(pointer[i]);
}
#endif

was also semi-common, the functions NativeEndian::swapFromLittleEndianInPlace and NativeEndian::swapToLittleEndianInPlace were also provided:

uint32_t* pointer = new uint32_t[length];
...
fread(pointer, sizeof(*pointer), length, file);
mozilla::NativeEndian::swapFromLittleEndianInPlace(pointer, length);

All the NativeEndian functions are actually templates, so they’ll work with 16-, 32-, or 64-bit signed or unsigned variables. They’ll also byteswap things like wchar_t and PRUnichar, though compilation will fail if you attempt to byteswap non-integer things like doubles or pointers.

Let the converting begin! Makoto Kato has already begun by eliminating NS_SWAP{16,32,64} and replacing them with their Endian.h equivalents.

Uncategorized — Comments Off
10
Apr 13

mozIStorageService, the main thread, and you

Bug 836493 landed on inbound today. An additional constraint is now enforced on mozIStorageService: the initial reference to it must be obtained on the main thread. However, all references after the first can be obtained on any thread.

Seems awfully complicated; what do we gain from that change? Two things. The first is that there was a race in the initialization of the storage service. Some bits of storage service initialization, like accessing preferences, could only be done on the main thread. If the storage service was initialized on a non-main thread, it dispatched an event to the main thread to perform those initialization tasks. Therefore, you might end up with a sequence of events like:

Non-main thread requests the storage service.
Storage service starts initialization, dispatches event to the main thread. This event can’t run until after step 4, for various reasons (thread scheduling, backed-up event queue on the main thread, etc. etc.).
Storage service initialization returns, handing an (incompletely initialized) object back to the caller.
Caller uses the not-yet initialized storage service, leading to possible problems.

Additionally, the storage service builds an SQLite VFS for handling things like quota management. This building happens on the calling thread and accesses prefs to make profiles that live on networked storage robust. That’s a non-main thread preferences use which could lead to crashes. That needs to go away so we can enforce main thread-only preferences usage.

Even though this change might make programming slightly more inconvenient, the end result is safer code for our users and (eventually) better enforcement of good coding practices.

Uncategorized — 5 comments
25
Mar 13

perf workweek talos discussion

Inspired by Avi’s recent work on the Talos tscoll test, we had an extended discussion about things to address in Talos tests. I took a picture of the whiteboard for posterity:

Concerns driving our discussion:

Lack of confidence in Talos. Whether through one too many dev.tree-management emails or inconsistent results on developer machines vs. Try, people don’t have confidence in Talos providing consistent, understandable results.
Wrong pushes identified as the cause of performance regressions. We’ve covered this topic before. Some people had a hard time believing this happens. It shouldn’t, and we’d like to fix it.

Some ways to address those concerns:

Cold cases are just as important as warm cases. On some tests, we might discount the numbers from the first few runs because they aren’t as smooth as the numbers from subsequent runs; we do this for our pageloader tests, for instance. The differences come from things being initialized lazily, caches of various kinds filling up, and so forth. But those “cold” cases are just as important as “warm” cases: In scenarios where the user restarts the web browser repeatedly throughout the day or memory pressure causes the browser to be killed on mobile devices, necessitating restarts when switching to it, the cold cases are the only ones that matter. We shouldn’t be discounting those numbers. Maybe the first N runs should constitute an entirely separate test: Tp5cold or similar. Then we can compare numbers across cold tests, which should provide more insight than comparing cold vs. warm tests.
Criteria for performance tests. We have extensive documentation on the tests themselves. The documentation on the what that’s being measured, the how the test goes about fulfilling that goal, and the why one should believe that goal is actually achieved is in short supply. As shown above, tscroll doesn’t measure the right things. We don’t know if tscroll is an isolated incident or symptomatic of a larger problem. We’d like to find out.
In-tree performance tests. Whether recording frames during an animation and comparing them to some known quantity, measuring overdraws on the UI, or verifying telemetry numbers collected during a test, this would help alleviate difficulties with running the tests. Presumably, it would also make adding new, focused tests easier. And the tools for comparing performance tests would also exist in the tree so that comparing test results would be a snap.

Better real-world testing

. Tp5 was held up as a reasonable performance test. Tp5 iterates over “the top 100 pages of the internet”, loading each one 25 times and measuring page load times before moving on to the next one. Hurrah for

testing the browser on real-world data

rather than

synthetic benchmarks

. But is this necessarily the best way to conduct a test with real-world data? Doesn’t sound very reflective of real-world usage. What about loading 10 pages simultaneously, as one might do clicking through morning news links? Why load each page 25 times? Why not hold some pages open while browsing to other ones (though admittedly without network access this may not matter very much)? Tp5 is always run on a clean profile; clean profiles are not indicative of real-world performance (similarly for disk caches). And so on.

Any one of these things is a decent-sized project on its own. Avi and I are tackling the criteria for performance tests bit first by reviewing all the existing tests, trying to figure out what they’re measuring, and comparing that with what they purport to measure.

Uncategorized — 1 comment
03
Feb 13

gcc version comparison, part 1.5/n: corrections

In my previous post, I discussed the size of libxul as compiled by various versions of GCC. Due to some configuration quirks, it turns out that the comparison was flawed.

To recap: GCC versions 4.5-4.7 contained, among other things, vtables and their associated relocations for classes that were never instantiated. I theorized that some compiler optimization must have been responsible for this, and that this compiler optimization must have gotten disabled in those versions. Thinking about it afterwards, it turned out that there was a simple way to check this theory: examine the object files for the vtables. Some objects compiled by versions 4.5-4.7 must have the vtables, and no objects from versions 4.4 and 4.8 should contain the vtables. So let’s check, using nsIDOMSVGTextElement as an example:

[froydnj@cerebro froydnj]$ for d in 4 5 6 7 8; do
  for o in $(find build-mozilla-gcc-4${d}/ -name '*.o'); do
    if readelf -sW $o |c++filt| grep -q 'vtable for nsIDOMSVGTextElement' 2>/dev/null; then
      echo $o; readelf -sW $o |c++filt|grep 'vtable for nsIDOMSVGTextElement'
    fi
  done
done
build-mozilla-gcc-44/content/svg/content/src/SVGTextElement.o
   971: 0000000000000000   856 OBJECT  WEAK   HIDDEN  450 vtable for nsIDOMSVGTextElement
build-mozilla-gcc-45/content/svg/content/src/SVGTextElement.o
  1241: 0000000000000000   856 OBJECT  WEAK   HIDDEN  676 vtable for nsIDOMSVGTextElement
build-mozilla-gcc-46/content/svg/content/src/SVGTextElement.o
  1021: 0000000000000000   856 OBJECT  WEAK   HIDDEN  498 vtable for nsIDOMSVGTextElement
build-mozilla-gcc-47/content/svg/content/src/SVGTextElement.o
  1075: 0000000000000000   856 OBJECT  WEAK   HIDDEN  533 vtable for nsIDOMSVGTextElement
build-mozilla-gcc-48/content/svg/content/src/SVGTextElement.o
   831: 0000000000000000   856 OBJECT  WEAK   HIDDEN  532 vtable for nsIDOMSVGTextElement

So all versions of the compiler are generating the vtables that are sometimes present and sometimes not in the compiled libxul. Why do the vtables sometimes disappear?

The linker on Linux systems has a --gc-sections option that eliminates unused sections from the final output file, using a form of mark and sweep garbage collection. Normally, this is not terribly effective, since all of your program code goes into .text (resp. data into .data and so forth), and something in .text ought to be getting used. But Mozilla is compiled with the options -ffunction-sections and -fdata-sections; -ffunction-sections gives each function its own uniquely named section and -fdata-sections does a similar thing for variables. Using --gc-sections with the linker, then, effectively eliminates unused functions and/or variables that the compiler can’t prove are unused. (The compiler can eliminate unused static variables from a compilation unit, for instance, but eliminating unused variables that are visible outside of a compilation unit requires the linker’s help.) And indeed, the linking process on Linux uses this --gc-sections option.

…most of the time. Depending on the vagaries of the GCC compiler version and the version of the linker being used, using --gc-sections can impede the debugging experience. So bug 670659 added a check to disable --gc-sections if using that option altered the debugging information in unhelpful ways.

You can probably see where this is going: on my machine, GCC versions 4.5-4.7 failed this check and so the --gc-sections option was not used with those versions. (GCC 4.8 actually wound up bypassing the check altogether.) Unfortunately, compiling things so the --gc-sections option is consistently used is difficult because of how configure.in is structured.

Lesson learned: double-check your experimental setup before analyzing your results! Make sure everything’s being done consistently between your test configurations so your measurements accurately reflect differences in what you’re measuring.

Nathan's Blog writing code to help other people write code

better build times through configury

the language lawyer: the curious constexpr conundrum

I got 99 problems…and compilation time is one of them

visual event tracing on android and eideticker

chasing down android performance problems

cold page load, OS X 10.7, and talos

introducing mozilla/Endian.h

mozIStorageService, the main thread, and you

perf workweek talos discussion

gcc version comparison, part 1.5/n: corrections

Recent Posts

Recent Comments

Archives

Categories

Meta