29
May 18

when an implementation monoculture might be the right thing

It’s looking increasingly likely that Firefox will, in the not-too-distant future, build with a single C++ compiler across the four major platforms we support.  I’m uneasy with this, but I think I’ve made my peace with it, partly as a result of writing the piece below.

Firefox currently builds with three major C++ compilers across four platforms: Microsoft’s Visual C++ compiler (MSVC), GCC, and Clang.  A fair amount of work has been done to deal with peculiar bugs in all three compilers: you can go search the source code and/or Bugzilla to find hacks that were needed for one reason or another.  A fair amount of work has also been stalled or shelved because one or two compilers don’t quite measure up in some required area (e.g. standards support).  As you might imagine, many a Firefox engineer has bemoaned the need for cross-compiler compatibility.

Cross-implementation compatibility is something that Mozilla expends a lot of effort on in a different context.  We have a Tech Evangelism bugzilla component for outreach to sites who use techniques that don’t translate across browsers.  When new sites appear that deliberately block Firefox (whether because the launch team took the time to test with Firefox and determine the user experience wouldn’t be acceptable, or because cross-browser compatibility was an explicit non-goal), Firefox engineers go find the performance cliffs and fix them.  Mozilla has a long-history of promoting the benefits of multiple implementations of the web platform; some of the old guard might remember “Works best in all browsers” campaigns and the like.  If you squint properly, you can even see this promotion in the manifesto (principles 2, 5, 6, 7, and 9, by my reckoning).

So as nice as a single implementation might be, dealing with multiple implementations was a fact of life in building an high quality open-source browser.  We dealt with it, because it seemed like we would always need to support MSVC; who would invest the time to create an open source, MSVC-compatible compiler?

Well, Google, mostly, and a host of other people, because the past several releases of Clang have included an MSVC-compatible frontend, clang-cl.  (Indeed, Firefox has been using clang-cl for Windows static analysis builds for some time.)  And now that we have a usable non-MSVC compiler on Windows, we can contemplate using an open-source compiler to create our release Windows builds.  And once we have that, we can consider using (and potentially only supporting) a single compiler (Clang) for all of the major platforms we support; Linux would be the remaining holdout.  (Chrome already ships on Windows with clang and requires clang everywhere, FWIW.)

We might continue to require that things build with MSVC and GCC on relevant platforms, even if we’re not shipping these builds; even if this happened, such builds seem unlikely to last for very long, for all the reasons that we wanted them dropped in the first place.  I imagine we’d probably continue to accept patches to make things build with non-Clang compilers, as long as the patches were not intrusive, just like we accept patches for non-tier 1 platforms.

Supporting a single compiler has a number of advantages:

  • Cross-language LTO (i.e. inlining) between Rust and C++ (we could, of course, do this today, but we wouldn’t get the win on all platforms);
  • Mozilla engineers can fix bugs in Clang/LLVM if need be;
  • Fixes can be more easily backported from the Clang/LLVM development tree;
  • Contributors have fewer compiler quirks to hold up their patches;
  • Integrating and/or upgrading local copies of upstream projects becomes easier;
  • Performance tuning becomes somewhat more straightforward when you have a single compiler to worry about.

I am probably forgetting some along the way.  (I don’t think it’s true that we’ll be able to entirely eliminate hacks to pacify the compiler; you push on C++ hard enough and long enough, and you find yourself doing all manner of unusual things.  We might even find ourselves doing more hacks, since we can justify it via, “Since we can/can’t rely on the compiler to do X…”)

I can see all the advantages.  I can even admire the sheer coolness of some of them; cross-language inlining sounds fantastic!  But the analogy between the Web situation and the C++ compiler situation makes me uneasy: we ask web developers to write cross-browser compatible websites, with all the time and energy that requires.  We tout the goodness of supporting multiple implementations of the web platform.  However, in the implementation of that web platform, we are in the process of deciding that the benefits of supporting a single C++ implementation are greater than whatever benefits (engineering, philosophical, etc.) might accrue from supporting multiple implementations.

To be explicit: we are making the exact style of decision that we ask web development teams not to make.

After having proposed this and thought about it for a while, I think the analogy is a bit strained.  We make the argument that websites should be cross-browser compatible because we support the freedom of users to access those sites with whatever browser they like.  Whereas Firefox engineering is the only “consumer” of the compiler(s), and so we should optimize for that single consumer.  Indeed, we don’t really concern ourselves with cross-engine compatibility for the JavaScript that lies behind our UI.  Firefox users (generally) don’t care too much what compiler gets used to build Firefox, and they’d probably support a switch to a compiler monoculture if that meant the browser got faster!

(I’m not completely at ease with calling the two situations dissimilar; it’d be all too easy for a website to say they only care about a single “user”, viz. users of $BROWSER, and dispense with cross-browser support.  I want to have a stronger argument for this case, but I don’t at the moment…)

At the end of the day, I think I’m mostly in support (0.6 on the Apache voting scale?).  I think it will be cool when it’s done, and I will probably wind up doing some work in support of the project.  But I can’t completely shake my uneasiness.  What do you think?


31
May 16

why gecko data structures should be preferred to std:: ones

In light of the recent announcement that all of our Tier-1 platforms now have a C++11-supporting standard library, I received some questions about whether we should continue encouraging the use of Gecko-specific data structures. My answer was “yes”, and as I was writing the justification for said answer, I felt that the justification was worth broadcasting to a wider audience. Here are the reasons I came up with; feel free to agree or disagree in the comments.

  • Gecko’s data structures can be customized extensively for our purposes, whereas we don’t have the same control over the standard library.  Our string classes, for instance, permit sharing structure between strings (whether via something like nsDependentString or reference-counted string buffers); that functionality isn’t currently supported in the standard library.  While the default behavior on allocation failure in Gecko is to crash, our data structures provide interfaces for failing gracefully when allocations fail.  Allocation failures in standard library data structures are reported via exceptions, which we don’t use.  If you’re not using exceptions, allocation failures in those data structures simply crash, which isn’t acceptable in a number of places throughout Gecko.
  • Gecko data structures can assume things about the environment that the standard library can’t.  We ship the same memory allocator on all our platforms, so our hashtables and our arrays can attempt to make their allocation behavior line up with what the memory allocator efficiently supports.  It’s possible that the standard library implementations we’re using do things like this, but it’s not guaranteed by the standard.
  • Along similar lines as the first two, Gecko data structures provide better visibility for things like debug checks and memory reporting.  Some standard libraries we support come with built-in debug modes, but not all of them, and not all debug modes are equally complete. Where possible, we should have consistent support for these sorts of things across all our platforms.
  • Custom data structures may provide better behavior than standard data structures by relaxing the specifications provided by the standard.  The WebKit team had a great blog post on their new mutex implementation, which optimizes for cases that OS-provided mutexes aren’t optimized for, either because of compatibility constraints or because of outside specifications.  Chandler Carruth has a CppCon talk where he mentions the non-ideal interfaces in many of the standard library data structures.  We can do better with custom data structures.
  • Data structures in the standard library may provide inconsistent performance across platforms, or disagree on the finer points of the standard.  Love them or hate them, Gecko’s data structures at least provide consistent behavior everywhere.

Most of these arguments are not new; if you look at the documentation for Facebook’s open-source Folly library, for instance, you’ll find a number of these arguments, if not expressed in quite the same way.  Browsing through WebKit’s WTF library shows they have a number of the same things that we do in xpcom/ or mfbt/ as well, presumably for some of the same reasons.

All of this is not to say that our data structures are perfect: the APIs for our hashtables could use some improvements, our strings and nsTArray do a poor job of separating “data structure” from “algorithm”, nsDeque serves as an excellent excuse to go use the standard library instead, and XPCOM’s synchronization primitives should stop going through NSPR and use the underlying OS’s primitives directly (or simply be rewritten to use something like WebKit’s locking primitives, above).  This is a non-exhaustive list; I have more ideas if people are interested.

Having a C++11 standard library on all platforms brings opportunities to remove dead polyfills; MFBT contains a number of these (Atomics.h, Tuple.h, TypeTraits.h, UniquePtr.h, etc.)  But we shouldn’t flock to the standard library’s functionality just because it’s the standard.  If the standard library’s functionality doesn’t fit our use cases, we should definitely write our own replacement(s) and use them widely.


20
Jan 16

gecko and c++ onboarding presentation

One of the things the Firefox team has been doing recently is having onboarding sessions for new hires. This onboarding currently covers:

  • 1st day setup
  • Bugzilla
  • Building Firefox
  • Desktop Firefox Architecture / Product
  • Communication and Community
  • Javascript and the DOM
  • C++ and Gecko
  • Shipping Software
  • Telemetry
  • Org structure and career development

My first day consisted of some useful HR presentations and then I was given my laptop and a pointer to a wiki page on building Firefox.  Needless to say, it took me a while to get started!  It would have been super convenient to have an introduction to all the stuff above.

I’ve been asked to do the C++ and Gecko session three times.  All of the sessions are open to whoever wants to come, not just the new hires, and I think yesterday’s session was easily the most well-attended yet: somewhere between 10 and 20 people showed up.  Yesterday’s session was the first session where I made the slides available to attendees (should have been doing that from the start…) and it seemed equally useful to make the slides available to a broader audience as well. The Gecko and C++ Onboarding slides are up now!

This presentation is a “living” presentation; it will get updated for future sessions with feedback and as I think of things that should have been in the presentation or better ways to set things up (some diagrams would be nice…).  If you have feedback (good, bad, or ugly) on particular things in the slides or you have suggestions on what other things should be covered, please contact me!  Next time I do this I’ll try to record the presentation so folks can watch that if they prefer.


09
Oct 15

gecko include file statistics

I was inspired to poke at which files were most heavily #include‘d and which files contributed the most text as a result of their #include‘ing after seeing the simplicity of Libre Office’s script for doing so. I had to rewrite it in Python, as the obvious modifications to the awk script weren’t working, and I had no taste for debugging awk code. I’ve put the script up as a gist:

It’s intended to be run from a newly built objdir on Linux like so:

python includebloat.py .

The ability to pick a subdirectory of interest:

python includebloat.py dom/bindings/

was useful to me when I was testing the script, so I wasn’t groveling through several thousand files at a time.

The output lines are formatted like so:

total_size file_size num_of_includes filename

and are intended to be manipulated further via sort, etc. The script might work on Mac and Windows, but I make no promises.

The results were…interesting, if not especially helpful at suggesting modifications for future work. I won’t show the entirety of the script’s output, but here are the top twenty files by total size included (size of the file on disk multiplied by number of times it appears as a dependency), done by filtering the script’s output through sort -n -k 1 -r | head -n 20 | cut -f 1,4 -d ' ':

332478924 /usr/lib/gcc/x86_64-linux-gnu/4.9/include/avx512fintrin.h
189877260 /home/froydnj/src/gecko-dev.git/js/src/jsapi.h
161543424 /usr/include/c++/4.9/bits/stl_algo.h
141264528 /usr/include/c++/4.9/bits/random.h
113475040 /home/froydnj/src/gecko-dev.git/xpcom/glue/nsTArray.h
105880002 /usr/include/c++/4.9/bits/basic_string.h
92449760 /home/froydnj/src/gecko-dev.git/xpcom/glue/nsISupportsImpl.h
86975736 /usr/include/c++/4.9/bits/random.tcc
76991387 /usr/include/c++/4.9/type_traits
72934768 /home/froydnj/src/gecko-dev.git/mfbt/TypeTraits.h
68956018 /usr/include/c++/4.9/bits/locale_facets.h
68422130 /home/froydnj/src/gecko-dev.git/js/src/jsfriendapi.h
66917730 /usr/include/c++/4.9/limits
66625614 /home/froydnj/src/gecko-dev.git/xpcom/glue/nsCOMPtr.h
66284625 /usr/include/x86_64-linux-gnu/c++/4.9/bits/c++config.h
63730800 /home/froydnj/src/gecko-dev.git/js/public/Value.h
62968512 /usr/include/stdlib.h
57095874 /home/froydnj/src/gecko-dev.git/js/public/HashTable.h
56752164 /home/froydnj/src/gecko-dev.git/mfbt/Attributes.h
56126246 /usr/include/wchar.h

How does avx512fintrin.h get included so much? It turns out <algorithm> drags in a lot of code, despite people usually only needing min, max, or swap. In this case, <algorithm> includes <random> because std::shuffle requires std::uniform_int_distribution from <random>. This include chain is responsible for essentially all of the /usr/include/c++/4.9-related files in the above list.

If you are compiling with SSE2 enabled (as is the default on x86-64 Linux), then<random> includes <x86intrin.h> because <random> contains a SIMD Mersenne Twister implementation. And <x86intrin.h> is a clearinghouse for all sorts of x86 intrinsics, even though all we need is a few typedefs and intrinsics for SSE2 code. Minus points for GCC header cleanliness here.

What about the top twenty files by number of times included (filter the script’s output through sort -n -k 3 -r | head -n 20 | cut -f 3,4 -d ' ')?

2773 /home/froydnj/src/gecko-dev.git/mfbt/Char16.h
2268 /home/froydnj/src/gecko-dev.git/mfbt/Attributes.h
2243 /home/froydnj/src/gecko-dev.git/mfbt/Compiler.h
2234 /home/froydnj/src/gecko-dev.git/mfbt/Types.h
2204 /home/froydnj/src/gecko-dev.git/mfbt/TypeTraits.h
2132 /home/froydnj/src/gecko-dev.git/mfbt/Likely.h
2123 /home/froydnj/src/gecko-dev.git/memory/mozalloc/mozalloc.h
2108 /home/froydnj/src/gecko-dev.git/mfbt/Assertions.h
2079 /home/froydnj/src/gecko-dev.git/mfbt/MacroArgs.h
2002 /home/froydnj/src/gecko-dev.git/xpcom/base/nscore.h
1973 /usr/include/stdc-predef.h
1955 /usr/include/x86_64-linux-gnu/gnu/stubs.h
1955 /usr/include/x86_64-linux-gnu/bits/wordsize.h
1955 /usr/include/x86_64-linux-gnu/sys/cdefs.h
1955 /usr/include/x86_64-linux-gnu/gnu/stubs-64.h
1944 /usr/lib/gcc/x86_64-linux-gnu/4.9/include/stddef.h
1942 /home/froydnj/src/gecko-dev.git/mfbt/Move.h
1941 /usr/include/features.h
1921 /opt/build/froydnj/build-mc/js/src/js-config.h
1918 /usr/lib/gcc/x86_64-linux-gnu/4.9/include/stdint.h

Not a lot of surprises here. A lot of these are basic definitions for C++ and/or Gecko (<stdint.h>, mfbt/Move.h).

There don’t seem to be very many obvious wins, aside from getting GCC to clean up its header files a bit. Getting us to the point where we can use <type_traits> instead of own homegrown mfbt/TypeTraits.h would be a welcome development. Making js/src/jsapi.h less of a mega-header might help some, but brings of a burden of “did I remember to include the correct JS header files”, which probably devolves into people cutting-and-pasting complete lists, which isn’t a win. Splitting up nsISupportsImpl.h seems like it could help a little bit, though with unified compilation, I suspect we’d likely wind up including all the split-up files at once anyway.


17
Sep 15

compiler-enforced locked accesses

If you’ve done any amount of threaded programming, you’ve probably run across code that looked like:

  // Only accessed with the mutex held.
  uint32_t mFlags;
  bool mConnected;
  nsTArray<int32_t> mData;

  // Only called with the mutex held.
  void DoSomething();

Perhaps you’ve even gotten to debug code which inadvertently violated the locking requirements of the members.

Several months ago, I reviewed a patch by David Keeler that addressed the second half of the above example. Methods that had locking requirements looked like:

  void DoSomething(MutexAutoLock& aProofOfLock);

which ensures (at compile time!) that you can’t call the function without locking the mutex first. I thought this was a nice technique, said as much in my review, and have been looking for places to apply it ever since.

The explicitness and the requirement to constantly pass MutexAutoLock& variables around is a feature, not a bug. Doing so encourages you to limit the amount of code that needs to be executed with locks held, keeping the concurrent parts of the code and the synchronized parts of the code clearly delimited. In this respect, this technique is similar to the IO monad in Haskell. I don’t know whether the extra verbosity would make the code more difficult to read or not, especially if the techniques for guarding members suggested below were applied as well.

This coding style also came in handy a week or so ago investigating overly high CPU usage when playing YouTube videos. Our event queue for nsIRunnables did its own locking internally, and exposed its internal reentrant monitor “for power users”. This led to code like:

{
  ReentrantMonitorAutoEnter mon(mEvents.GetReentrantMonitor());
  ...
  mEvents.PutEvent(event);
}

where the PutEvent call would do an extra round of locking (“just to be sure”), which was wasted work. Data structures like this doing their own locking internally typically isn’t a great idea, so part of the work in the above bug was to separate the locking requirements of the queue from who actually needs to do the locking. Or, in other words, we can have the class that owns the event queue do the locking, and have the event queue’s methods enforce the locking programmatically:

{
  MonitorAutoLock mon(mMonitor);
  ...
  mEvents.PutEvent(event, mon);
}

Now there’s no wasted work in PutEvent, because it already knows the appropriate locking has been done. The ability to use non-reentrant monitors—which are more efficient—was a nice bonus resulting from the separation of concerns here.

This technique can also help solve the first half of the problem we presented at the beginning of this post: ensuring members are only accessed with locks held.

template<typename T>
class Guarded;

template<>
class Guarded<uint32_t>
{
public:
  Guarded() : mValue(0) {}

  uint32_t Value(MutexAutoLock& aProofOfLock)
  {
    return mValue;
  }

  void Assign(MutexAutoLock& aProofOfLock, uint32_t aNewValue)
  {
    mValue = aNewValue;
  }

  // Since accesses should only be done under the lock, and copying
  // and moving would therefore require locks, we require the user
  // to ensure those constraints are met with explicit calls to the
  // above rather than the compiler sneaking unlocked accesses in.
  Guarded(const Guarded&) = delete;
  ...

private:
  uint32_t mValue;
};

The above class isn’t production quality code; it’s intended to sketch out how explicit locking requirements might work. A more robust version might require a Mutex& reference to be passed to the constructor, and member functions assert that the MutexAutoLock& parameters actually lock the specified mutex. Specializing for each of the integer types would also get tiresome, so we’d need to do a better job there. Handling types with methods could be done with something like the following, I think:

template<typename T>
class GuardedAggregate
{
public:
  GuardedAggregate() : mValue() {}

  // The core idea here is that the user would write:
  //
  // GuardedAggregrate<nsTArray> mArray;
  //
  // and then accesses would be done via:
  //
  // mArray.Value(lock).Method(...);
  //
  // This means that we don't have to proxy every single one of
  // the aggregate's methods, but the locking requirements are
  // still explicit.
  class Proxy
  {
  public:
    Proxy(MutexAutoLock& aProofOfLock, T* aValue) : mValue(aValue)
    {}

    T* operator->()
    {
      return mValue;
    }

  private:
    T* mValue;
  };

  Proxy Value(MutexAutoLock& aProofOfLock)
  {
    return Proxy(aProofOfLock, &mValue);
  }

  ...
private:
  T mValue;
};

This can also be though of as a compiler-independent, but less flexible version of clang’s Thread Safety Analysis. Folks have been asking about bringing the annotations that analysis requires into Gecko; I wonder if it might work just as well to apply this technique more liberally throughout the codebase.


20
Aug 15

explicit is better than implicit: c++ implicitly defined member functions

In the tradition of The Zen of Python, I’ve been thinking about pushing for explicit declarations of otherwise implicitly-defined member functions in C++, both in code that I write and in code that I review:

// Instances of this class should not be copied.
MyClass(const MyClass&) = delete;
MyClass& operator=(const MyClass&) = delete;

// We are OK with the default semantics.
OtherClass(const OtherClass&) = default;
OtherClass& operator=(const OtherClass&) = default;
OtherClass(OtherClass&&) = default;
OtherClass& operator=(OtherClass&&) = default;

[Background: C++ specifies several member functions that the compiler will implicitly define for you in any class: the default constructor, the copy/move constructor(s), and the copy/move assignment operator(s). I say “implicitly define”, as though that always happens, but there are a number of constraints on when the compiler will do this. For the purposes of the discussion below, I’ll ignore the default constructor bit and focus on the copy/move constructor and assignment operator. (I will also happily ignore all the different variants thereof that can occur, e.g. when the compiler defines MyClass(MyClass&) for you.) I think the arguments apply equally well to the default constructor case, but most classes I deal with tend to either declare their own default constructor or have several user-defined constructors anyway, which prohibit the compiler from implicitly declaring the default constructor.]

I think the argument for = delete is more obvious and less controversial, so I’ll start there.  = delete‘ing functions you don’t want used is part of the API contract of the class.  Functions that shouldn’t be used shouldn’t be exposed to the user, and = delete ensures that the compiler won’t implicitly define part of your API surface (and users thereby unknowingly violate API guarantees).  The copy constructor/assignment operator are the obvious candidates for = delete, but using = delete for the move constructor/assignment operator makes sense in some cases (e.g. RAII classes). Using = delete gives you pleasant compiler error messages, and it’s clearer than:

private:
  MyClass(const MyClass&);
  MyClass& operator=(const MyClass&);

If you’re lucky, there might be a comment to the effect of // Deliberately not defined.  I know which code I’d prefer to read. (Using = delete also ensures you don’t accidentally use the not-defined members inside the class itself, then spend a while waiting for the linker errors to tell you about your screw-up.)

= default appears to be a little harder to argue for.  “Experienced” programmers always know which functions are provided by the compiler, right?

Understanding whether the compiler implicitly defines something requires looking at the entire class definition (including superclasses) and running a non-trivial decision algorithm. I sure don’t want each reader of the code to do that for two or four different member functions (times superclasses, too), all of which are reasonably important in understanding how a class is intended to be used.

Explicitly declaring what you intend can also avoid performance pitfalls. In reading through the C++ specification to understand when things were implicitly declared, I discovered that the same functions can also be implicitly deleted, including this great note: “When the move constructor is not implicitly declared or explicitly supplied, expressions that otherwise would have invoked the move constructor may instead invoke a copy constructor.” So, if the move constructor was implicitly declared at some point, but then was implicitly deleted through some change, expressions that were previously efficient due to moving would become somewhat less so due to copying. Isn’t C++ great?

Being explicit also avoids the possibility of meaning to define something, but getting tripped up by the finer points of the language:

template<typename T>
class MyClass
{
public:
  // This does not define a copy constructor for MyClass<T>.
  template<typename U>
  MyClass(const MyClass<U>& aOther) : ... { ... }
  ...
};

Comments could serve to notify the reader that we’re OK with the default definition, but if I could choose between encoding something in a place solely intended for humans, or a place both humans and the compiler will understand, I know which one I’d pick.


17
Feb 15

multiple return values in C++

I’d like to think that I know a fair amount about C++, but I keep discovering new things on a weekly or daily basis.  One of my recent sources of new information is the presentations from CppCon 2014.  And the most recent presentation I’ve looked at is Herb Sutter’s Back to the Basics: Essentials of Modern C++ Style.

In the presentation, Herb mentions a feature of tuple that enables returning multiple values from a function.  Of course, one can already return a pair<T1, T2> of values, but accessing the fields of a pair is suboptimal and not very readable:

pair<...> p = f(...);
if (p.second) {
  // do something with p.first
}

The designers of tuple must have listened, because of the function std::tie, which lets you destructure a tuple:

typename container::iterator position;
bool already_existed;
std::tie(position, already_existed) = mMap.insert(...);

It’s not quite as convenient as destructuring multiple values in other languages, since you need to declare the variables prior to std::tie‘ing them, but at least you can assign them sensible names. And since pair implicitly converts to tuple, you can use tie with functions in the standard library that return pairs, like the insertion functions of associative containers.

Sadly, we’re somewhat limited in our ability to use shiny new concepts from the standard library because of our C++ standard library situation on Android (we use stlport there, and it doesn’t feature useful things like <tuple>, <function>, or <thread_local>. We could, of course, polyfill some of these (and other) headers, and indeed we’ve done some of that in MFBT already. But using our own implementations limits our ability to share code with other projects, and it also takes up time to produce the polyfills and make them appropriately high quality. I’ve seen several people complain about this, and I think it’s something I’d like to fix in the next several months.


27
Jan 15

examples of poor API design, 1/N – pldhash functions

The other day in the #content IRC channel:

<bz> I have learned so many things about how to not define APIs in my work with Mozilla code ;)
<bz> (probably lots more to learn, though)

I, too, am still learning a lot about what makes a good API. Like a lot of other things, it’s easier to point out poor API design than to describe examples of good API design, and that’s what this blog post is about. In particular, the venerable XPCOM datastructure PLDHashTable has been undergoing a number of changes lately, all aimed at bringing it up to date. (The question of why we have our own implementations of things that exist in the C++ standard library is for a separate blog post.)

The whole effort started with noticing that PL_DHashTableOperate is not a well-structured API. It’s necessary to quote some more of the API surface to fully understand what’s going on here:

typedef enum PLDHashOperator {
    PL_DHASH_LOOKUP = 0,        /* lookup entry */
    PL_DHASH_ADD = 1,           /* add entry */
    PL_DHASH_REMOVE = 2,        /* remove entry, or enumerator says remove */
    PL_DHASH_NEXT = 0,          /* enumerator says continue */
    PL_DHASH_STOP = 1           /* enumerator says stop */
} PLDHashOperator;

typedef PLDHashOperator
(* PLDHashEnumerator)(PLDHashTable *table, PLDHashEntryHdr *hdr, uint32_t number,
                      void *arg);

uint32_t
PL_DHashTableEnumerate(PLDHashTable *table, PLDHashEnumerator etor, void *arg);

PLDHashEntryHdr*
PL_DHashTableOperate(PLDHashTable* table, const void* key, PLDHashOperator op);

(PL_DHashTableOperate no longer exists in the tree due to other cleanup bugs; the above is approximately what it looked like at the end of 2014.)

There are several problems with the above slice of the API:

  • PL_DHashTableOperate(table, key, PL_DHASH_ADD) is a long way to spell what should have been named PL_DHashTableAdd(table, key)
  • There’s another problem with the above: it’s making a runtime decision (based on the value of op) about what should have been a compile-time decision: this particular call will always and forever be an add operation. We shouldn’t have the (admittedly small) runtime overhead of dispatching on op. It’s worth noting that compiling with LTO and a quality inliner will remove that runtime overhead, but we might as well structure the code so non-LTO compiles benefit and the code at callsites reads better.
  • Given the above definitions, you can say PL_DHashTableOperate(table, key, PL_DHASH_STOP) and nothing will complain. The PL_DHASH_NEXT and PL_DHASH_STOP values are really only for a function of type PLDHashEnumerator to return, but nothing about the above definition enforces that in any way. Similarly, you can return PL_DHASH_LOOKUP from a PLDHashEnumerator function, which is nonsensical.
  • The requirement to always return a PLDHashEntryHdr* from PL_DHashTableOperate means doing a PL_DHASH_REMOVE has to return something; it happens to return nullptr always, but it really should return void. In a similar fashion, PL_DHASH_LOOKUP always returns a non-nullptr pointer (!); one has to check PL_DHASH_ENTRY_IS_{FREE,BUSY} on the returned value. The typical style for an API like this would be to return nullptr if an entry for the given key didn’t exist, and a non-nullptr pointer if such an entry did. The free-ness or busy-ness of a given entry should be a property entirely internal to the hashtable implementation (it’s possible that some scenarios could be slightly more efficient with direct access to the busy-ness of an entry).

We might infer corresponding properties of a good API from each of the above issues:

  • Entry points for the API produce readable code.
  • The API doesn’t enforce unnecessary overhead.
  • The API makes it impossible to talk about nonsensical things.
  • It is always reasonably clear what return values from API functions describe.

Fixing the first two bulleted issues, above, was the subject of bug 1118024, done by Michael Pruett. Once that was done, we really didn’t need PL_DHashTableOperate, and removing PL_DHashTableOperate and related code was done in bug 1121202 and bug 1124920 by Michael Pruett and Nicholas Nethercote, respectively. Fixing the unusual return convention of PL_DHashTableLookup is being done in bug 1124973 by Nicholas Nethercote. Maybe once all this gets done, we can move away from C-style PL_DHashTable* functions to C++ methods on PLDHashTable itself!

Next time we’ll talk about the actual contents of a PL_DHashTable and how improvements have been made there, too.


03
Nov 14

table-driven register reading in rr

Many of the changes done for porting rr to x86-64 were straightforward, mechanical changes: change this type or format directive, add a template parameter to this function, search-and-replace all SYS_* occurrences and so forth. But the way register reading for rr’s GDB stub fell out was actually a marked improvement on what went before and is worth describing in a little more detail.

At first, if rr’s GDB stub wanted to know register values, it pulled them directly out of the Registers class:

enum DbgRegisterName {
  DREG_EAX,
  DREG_ECX,
  // ...and so on.
};

// A description of a register for the GDB stub.
struct DbgRegister {
  DbgRegisterName name;
  long value;
  bool defined;
};

static DbgRegister get_reg(const Registers* regs, DbgRegisterName which) {
  DbgRegister reg;
  reg.name = which;
  reg.defined = true;
  switch (which) {
  case DREG_EAX:
    reg.value = regs->eax;
    return reg;
  case DREG_ECX:
    reg.value = regs->ecx;
    return reg;
  // ...and so on.
  default:
    reg.defined = false;
    return reg;
  }
}

That wasn’t going to work well if Registers supported several different architectures at once, so we needed to push that reading into the Registers class itself. Not all registers on a target fit into a long, either, so we needed to indicate how many bytes wide each register was. Finally, we also needed to indicate whether the target knew about the register or not: GDB is perfectly capable of asking for SSE (or even AVX) registers, but the chip running the debuggee might not support those registers, for instance. So we wound up with this:

struct DbgRegister {
  unsigned int name;
  uint8_t value[DBG_MAX_REG_SIZE];
  size_t size;
  bool defined;
};

static DbgRegster get_reg(const Registers* regs, unsigned int which) {
  DbgRegister reg;
  memset(&reg, 0, sizeof(reg));
  reg.size = regs->read_register(&reg.value[0], which, &reg.defined);
  return reg;
}
...
template<typename T>
static size_t copy_register_value(uint8_t* buf, T src) {
  memcpy(buf, &src, sizeof(src));
  return sizeof(src);
}

// GDBRegister is an enum defined in Registers.
size_t Registers::read_register(uint8_t* buf, GDBRegister regno, bool* defined) const {
  *defined = true;
  switch (regno) {
  case DREG_EAX:
    return copy_register_value(buf, eax);
  case DREG_ECX:
    return copy_register_value(buf, ecx);
  // ...many more integer and segment registers...
  case DREG_XMM0:
    // XMM registers aren't supplied by Registers, but they are supplied someplace else.
    *defined = false;
    return 16;
  // ...and so on
  }
}

The above changes were serviceable, if a little repetitive. When it came time to add x86-64 support to Registers, however, writing out that same basic switch statement for x86-64 registers sounded unappealing.

And there were other things that would require the same duplicated code treatment, too: when replaying traces, rr compares the replayed registers to the recorded registers at syscalls and halts the replay if the registers don’t match (rr calls this “divergence” and it indicates some kind of bug in rr). The x86-only comparison routine looked like:

// |name1| and |name2| are textual descriptions of |reg1| and |reg2|, respectively,
// for error messages.
bool Registers::compare_register_files(const char* name1, const Registers* reg1,
                                       const char* name2, const Registers* reg2,
                                       int mismatch_behavior) {
  bool match = true;

#define REGCMP(_reg)                                     \
  do {                                                   \
    if (reg1-> _reg != reg2-> _reg) {              \
      maybe_print_reg_mismatch(mismatch_behavior, #_reg, \
                               name1, reg1-> _reg,	   \
                               name2, reg2-> _reg);	   \
      match = false;					   \
    }							   \
  } while (0)

  REGCMP(eax);
  REGCMP(ebx);
  // ...and so forth.

  return match;
}

Duplicating the comparison logic for twice as many registers on x86-64 didn’t sound exciting, either.

I didn’t have to solve both problems at once, so I focused solely on getting the registers for GDB. A switch statement with almost-identical cases indicates that you’re doing too much in code, when you should be doing things with data instead. In this case, what we really want is this: given a GDB register number, we want to know whether we know about this register, and if so, how many bytes to copy into the outparam buffer and where to copy those bytes from. A lookup table is really the data structure we should be using here:

// A data structure describing a register.
struct RegisterValue {
  // The offsetof the register in user_regs_struct.
  size_t offset;

  // The size of the register.  0 means we cannot read it.
  size_t nbytes;

  // Returns a pointer to the register in |regs| represented by |offset|.
  // |regs| is assumed to be a pointer to |struct user_regs_struct| of the appropriate
  // architecture.
  void* pointer_into(void* regs) {
    return static_cast<char*>(regs) + offset;
  }

  const void* pointer_into(const char* regs) const {
    return static_cast<const char*>(regs) + offset;
  }
};

template<typename Arch>
struct RegisterInfo;

template<>
struct RegisterInfo<rr::X86Arch> {
  static struct RegisterValue registers[DREG_NUM_LINUX_I386];
};
struct RegisterValue RegisterInfo<rr::X86Arch>::registers[DREG_NUM_LINUX_I386];

static void initialize_register_tables() {
  static bool initialized = false;

  if (initialized) {
    return;
  }

  initialized = true;
#define RV_X86(gdb_suffix, name) \
  RegisterInfo<rr::X86Arch>::registers[DREG_##gdb_suffix] = { \
    offsetof(rr::X86Arch::user_regs_struct, name), \
    sizeof(((rr::X86Arch::user_regs_struct*)0)->name) \
  }
  RV_X86(EAX, eax);
  RV_X86(ECX, ecx);
  // ...and so forth.  Registers not explicitly initialized here, e.g. DREG_XMM0, will
  // be zero-initialized and therefore have a size of 0.
}

template<typename Arch>
size_t Registers::read_register_arch(uint8_t buf, GDBRegister regno, bool* defined) const {
  initialize_register_tables();
  RegisterValue& rv = RegisterInfo::registers[regno];
  if (rv.nbytes == 0) {
    *defined = false;
  } else {
    *defined = true;
    memcpy(buf, rv.pointer_into(ptrace_registers()), rv.nbytes);
  }

  return rv.nbytes;
}

size_t Registers::read_register(uint8_t buf, GDBRegister regno, bool* defined) const {
  // RR_ARCH_FUNCTION performs the return for us.
  RR_ARCH_FUNCTION(read_register_arch, arch(), buf, regno, defined);
}

Much better, and templating enables us to (re)use the RR_ARCH_FUNCTION macro we use elsewhere in the rr codebase.

Having to initialize the tables before using them is annoying, though. And after debugging a failure or two because I didn’t ensure the tables were initialized prior to using them, I wanted something more foolproof. Carefully writing out the array definition with zero-initialized members in the appropriate places sounded long-winded, difficult to read, and prone to failures that were tedious to debug. Ideally, I’d just specify the array indices to initialize along with the corresponding values, and let zero-initialization take care of the rest.

Initially, I wanted to use designated initializers, which are a great C99 feature that provided exactly what I was looking for, readable initialization of specific array indices:

#define RV_X86(gdb_suffix, name) \
  [DREG_##gdb_suffix] = { \
    offsetof(rr::X86Arch::user_regs_struct, name), \
    sizeof(((rr::X86Arch::user_regs_struct*)0)->name) \
  }

struct RegisterValue RegisterInfo<rr::X86Arch>::registers[DREG_NUM_LINUX_I386] = {
  RV_X86(EAX, eax),
  RV_X86(ECX, ecx),
};

I assumed that C++11 (which is the dialect of C++ used in rr) would support something like this, and indeed, my first attempts worked just fine. Except when I added more initializers:

struct RegisterValue RegisterInfo<rr::X86Arch>::registers[DREG_NUM_LINUX_I386] = {
  RV_X86(EAX, eax),
  RV_X86(ECX, ecx),
  // ...other integer registers, eflags, segment registers...
  RV_X86(ORIG_EAX, orig_eax);
};

I got this curious error message from g++: “Sorry, designated initializers not supported.”

But that’s a lie! You clearly do support designated initializers, because I’ve been able to compile code with them!

Well, yes, g++ supports designated initializers for arrays. Except that the indices have to be in consecutive order from the beginning of the array. And all the integer registers, EIP, and the segment registers appeared in consecutive order in the enumeration, starting with DREG_EAX = 0. But DREG_ORIG_EAX came some 20 members after DREG_GS (the last segment register), and so there we had initialization of non-consecutive array indices, and g++ complained. In other words, designated initializers in g++ are useful for documentation and/or readability purposes, but they’re not useful for sparse arrays. Hence the explicit initialization approach described above.

Turns out there’s a nicer, C++11-ier way to handle these kind of sparse arrays, at the cost of some extra infrastructure. Instead of an explicit [] variable, you can have a class with an operator[] overload and a constructor that takes std::initializer_list<std::pair<size_t, YourActualClass>>. The first member of the pair is the index into your sparse array, and the second member is the value you want placed there. The constructor takes care of putting everything in its place; the actual code looks like this:

typedef std::pair<size_t, RegisterValue> RegisterInit;

// Templated over N because different architectures have differently-sized register files.
template<size_t N>
struct RegisterTable : std::array<RegisterValue, N> {
  RegisterTable(std::initializer_list<RegisterInit> list) {
    for (auto& ri : list) {
      (*this)[ri.first] = ri.second;
    }
  }
};

template<>
struct RegisterInfo<rr::X86Arch> {
  static const size_t num_registers = DREG_NUM_LINUX_I386;
  typedef RegisterTable<num_registers> Table;
  static Table registers;
};

RegisterInfo<rr::X86Arch>::Table RegisterInfo<rr::X86Arch>::registers = {
  // RV_X86 macro defined similarly to the previous version.
  RV_X86(EAX, eax),
  RV_X86(ECX, ecx),
  // ...many more register definitions
};

// x86-64 versions defined similarly...

There might be a better way to do this that’s even more C++11-ier. Comments welcome!

Once this table-driven scheme was in place, using it for comparing registers was straightforward: each RegisterValue structure adds a name field, for error messages, and a comparison_mask field, for applying to register values prior to comparison. Registers that we don’t care about comparing can have a register mask of zero, just like registers that we don’t want to expose to GDB can have a “size” of zero. A mask is more flexible than a bool compare/don’t compare field, because eflags in particular is not always determinstic between record and replay, so we mask off particular bits of eflags prior to comparison. The core comparison routine then looks like:

template<typename Arch>
static bool compare_registers_core(const char* name1, const Registers* reg1,
                                   const char* name2, const Registers* reg2,
                                   int mismatch_behavior) {
  bool match = true;

  for (auto& rv : RegisterInfo<Arch>::registers) {
    if (rv.nbytes == 0) {
      continue;
    }

    // Disregard registers that will trivially compare equal.
    if (rv.comparison_mask == 0) {
      continue;
    }

    // XXX correct but oddly displayed for big-endian processors.
    uint64_t val1 = 0, val2 = 0;
    memcpy(&val1, rv.pointer_into(reg1.ptrace_registers()), rv.nbytes);
    memcpy(&val2, rv.pointer_into(reg2.ptrace_registers()), rv.nbytes);
    val1 &= rv.comparison_mask;
    val2 &= rv.comparison_mask;

    if (val1 != val2) {
      maybe_print_reg_mismatch(mismatch_behavior, rv.name, name1, val1, name2,
                               val2);
      match = false;
    }
  }

  return match;
}

which I thought was a definite improvement on the status quo. Additional, architecture-specific comparisons can be layered on top of this (see, for instance, the x86-specialized versions in rr’s Registers.cc).

Eliminating or reducing code by writing data structures instead is one of those coding tasks that gives me a warm, fuzzy feeling after I’ve done it. In addition to the above, I was also able to use the RegisterValue structures to provide architecture-independent dumping of register files, which was another big win in making Registers multi-arch aware. I was quite pleased that everything worked out so well.


08
Sep 14

xpcom and move constructors

Benjamin Smedberg recently announced that he was handing over XPCOM module ownership duties to me.  XPCOM contains basic data structures used throughout the Mozilla codebase, so changes to its code can have wide-ranging effects.  I’m honored to have been given responsibility for a core piece of the Gecko platform.

One issue that’s come up recently and I’m sure will continue to come up is changing XPCOM data structures to support two new C++11 features, rvalue references and their killer app, move constructors.  If you aren’t familiar with C++11’s new rvalue references feature, I highly recommend C++ Rvalue References Explained.  Move constructors are already being put to good use elsewhere in the codebase, notably mozilla::UniquePtr, which can be used to replace XPCOM’s nsAutoPtr and nsAutoRef (bug 1055035).  And some XPCOM data structures have received the move constructor treatment, notably nsRefPtr (bug 980753) and nsTArray (bug 982212).

A recent discussion and the associated bug, however, decided that the core referenced-counted smart pointer class in XPCOM, nsCOMPtr, shouldn’t support move constructors.  While move constructors could have replaced the already_AddRefed usage associated with nsCOMPtr, such as:

already_AddRefed<nsIMyInterface>
NS_DoSomething(...)
{
  nsCOMPtr<nsIMyInterface> interface = ...;
  // do some initialization stuff
  return interface.forget();
}

with the slightly shorter:

nsCOMPtr<nsIMyInterface>
NS_DoSomething(...)
{
  nsCOMPtr<nsIMyInterface> interface = ...;
  // do some initialization stuff
  return interface;
}

There were two primary arguments against move constructor support.  The first argument was that the explicitness of having to call .forget() on an nsCOMPtr (along with the explicitness of the already_AddRefed type), rather than returning it, is valuable for the code author, the patch reviewer, and subsequent readers of the code.  When dealing with ownership issues in C++, it pays to be more explicit, rather than less.  The second argument was that due to the implicit conversion of nsCOMPtr<T> to a bare T* pointer (a common pattern in smart pointer classes), returning nsCOMPtr<T> from functions makes it potentially easy to write buggy code:

// What really happens in the below piece of code is something like:
//
// nsIMyInterface* p;
// {
//   nsCOMPtr<nsIMyInterface> tmp(NS_DoSomething(...));
//   p = tmp.get();
// }
//
// which is bad if NS_DoSomething is returning the only ref to the object.
// p now points to deleted memory, which is a security risk.
nsIMyInterface* p = NS_DoSomething(...);

(I should note that we can return nsCOMPtr<T> from functions today, and in most cases, thanks to compiler optimizations, it will be as efficient as returning already_AddRefed.  But Gecko culture is such that a function returning nsCOMPtr<T> would be quite unusual, and therefore unlikely to pass code review.)

The changes to add move constructors to nsRefPtr and nsTArray?  They were reviewed by me.  And the nixing of move constructors for nsCOMPtr?  That was also done by me (with a lot of feedback from other people).

I accept the charge of inconsistency.  However, I offer the following defense.  In the case of nsTArray, there are no ownership issues like there are with nsCOMPtr: you either own the array, or you don’t, so many of the issues raised about nsCOMPtr don’t apply in that case.

For the case of nsRefPtr, it is true that I didn’t seek out as much input from other people before approving the patch.  But the nsRefPtr patch was also done without the explicit goal of removing already_AddRefed from the code base, which made it both smaller in scope and more palatable.  Also, my hunch is that nsRefPtr is used somewhat less than nsCOMPtr (although this may be changing somewhat given other improvements in the codebase, like WebIDL), and so it provides an interesting testbed for whether move constructors and/or less explicit transfers of ownership are as much of a problem as argued above.