Skip to content

multiple return values in C++


I’d like to think that I know a fair amount about C++, but I keep discovering new things on a weekly or daily basis.  One of my recent sources of new information is the presentations from CppCon 2014.  And the most recent presentation I’ve looked at is Herb Sutter’s Back to the Basics: Essentials of Modern C++ Style.

In the presentation, Herb mentions a feature of tuple that enables returning multiple values from a function.  Of course, one can already return a pair<T1, T2> of values, but accessing the fields of a pair is suboptimal and not very readable:

pair<...> p = f(...);
if (p.second) {
  // do something with p.first

The designers of tuple must have listened, because of the function std::tie, which lets you destructure a tuple:

typename container::iterator position;
bool already_existed;
std::tie(position, already_existed) = mMap.insert(...);

It’s not quite as convenient as destructuring multiple values in other languages, since you need to declare the variables prior to std::tie‘ing them, but at least you can assign them sensible names. And since pair implicitly converts to tuple, you can use tie with functions in the standard library that return pairs, like the insertion functions of associative containers.

Sadly, we’re somewhat limited in our ability to use shiny new concepts from the standard library because of our C++ standard library situation on Android (we use stlport there, and it doesn’t feature useful things like <tuple>, <function>, or <thread_local>. We could, of course, polyfill some of these (and other) headers, and indeed we’ve done some of that in MFBT already. But using our own implementations limits our ability to share code with other projects, and it also takes up time to produce the polyfills and make them appropriately high quality. I’ve seen several people complain about this, and I think it’s something I’d like to fix in the next several months.

examples of poor API design, 1/N – pldhash functions


The other day in the #content IRC channel:

<bz> I have learned so many things about how to not define APIs in my work with Mozilla code ;)
<bz> (probably lots more to learn, though)

I, too, am still learning a lot about what makes a good API. Like a lot of other things, it’s easier to point out poor API design than to describe examples of good API design, and that’s what this blog post is about. In particular, the venerable XPCOM datastructure PLDHashTable has been undergoing a number of changes lately, all aimed at bringing it up to date. (The question of why we have our own implementations of things that exist in the C++ standard library is for a separate blog post.)

The whole effort started with noticing that PL_DHashTableOperate is not a well-structured API. It’s necessary to quote some more of the API surface to fully understand what’s going on here:

typedef enum PLDHashOperator {
    PL_DHASH_LOOKUP = 0,        /* lookup entry */
    PL_DHASH_ADD = 1,           /* add entry */
    PL_DHASH_REMOVE = 2,        /* remove entry, or enumerator says remove */
    PL_DHASH_NEXT = 0,          /* enumerator says continue */
    PL_DHASH_STOP = 1           /* enumerator says stop */
} PLDHashOperator;

typedef PLDHashOperator
(* PLDHashEnumerator)(PLDHashTable *table, PLDHashEntryHdr *hdr, uint32_t number,
                      void *arg);

PL_DHashTableEnumerate(PLDHashTable *table, PLDHashEnumerator etor, void *arg);

PL_DHashTableOperate(PLDHashTable* table, const void* key, PLDHashOperator op);

(PL_DHashTableOperate no longer exists in the tree due to other cleanup bugs; the above is approximately what it looked like at the end of 2014.)

There are several problems with the above slice of the API:

  • PL_DHashTableOperate(table, key, PL_DHASH_ADD) is a long way to spell what should have been named PL_DHashTableAdd(table, key)
  • There’s another problem with the above: it’s making a runtime decision (based on the value of op) about what should have been a compile-time decision: this particular call will always and forever be an add operation. We shouldn’t have the (admittedly small) runtime overhead of dispatching on op. It’s worth noting that compiling with LTO and a quality inliner will remove that runtime overhead, but we might as well structure the code so non-LTO compiles benefit and the code at callsites reads better.
  • Given the above definitions, you can say PL_DHashTableOperate(table, key, PL_DHASH_STOP) and nothing will complain. The PL_DHASH_NEXT and PL_DHASH_STOP values are really only for a function of type PLDHashEnumerator to return, but nothing about the above definition enforces that in any way. Similarly, you can return PL_DHASH_LOOKUP from a PLDHashEnumerator function, which is nonsensical.
  • The requirement to always return a PLDHashEntryHdr* from PL_DHashTableOperate means doing a PL_DHASH_REMOVE has to return something; it happens to return nullptr always, but it really should return void. In a similar fashion, PL_DHASH_LOOKUP always returns a non-nullptr pointer (!); one has to check PL_DHASH_ENTRY_IS_{FREE,BUSY} on the returned value. The typical style for an API like this would be to return nullptr if an entry for the given key didn’t exist, and a non-nullptr pointer if such an entry did. The free-ness or busy-ness of a given entry should be a property entirely internal to the hashtable implementation (it’s possible that some scenarios could be slightly more efficient with direct access to the busy-ness of an entry).

We might infer corresponding properties of a good API from each of the above issues:

  • Entry points for the API produce readable code.
  • The API doesn’t enforce unnecessary overhead.
  • The API makes it impossible to talk about nonsensical things.
  • It is always reasonably clear what return values from API functions describe.

Fixing the first two bulleted issues, above, was the subject of bug 1118024, done by Michael Pruett. Once that was done, we really didn’t need PL_DHashTableOperate, and removing PL_DHashTableOperate and related code was done in bug 1121202 and bug 1124920 by Michael Pruett and Nicholas Nethercote, respectively. Fixing the unusual return convention of PL_DHashTableLookup is being done in bug 1124973 by Nicholas Nethercote. Maybe once all this gets done, we can move away from C-style PL_DHashTable* functions to C++ methods on PLDHashTable itself!

Next time we’ll talk about the actual contents of a PL_DHashTable and how improvements have been made there, too.

profiling for wakeups on osx


One of my Q1 goals is to investigate what can be done for cross-platform power profiling, in service of the Firefox Platform’s Project Candle.  Roberto Vitillo already did a bit of exploration in this space last year.  Ideally, what would come out of my research would be some sort of event counter that we could hook into the Gecko Profiler.  Failing that, it would at least be good to document what is possible for power profiling on each of our supported platforms.

The only power profiling tool I was at all familiar with was powertop. I thought I’d start by figuring out if there was any rough equivalent on OS X.  Turns out there is; it’s called powermetrics.  You can simply run:

$ sudo powermetrics

at a Terminal prompt and you will periodically (every five seconds by default) see a large amount of information concerning power usage printed.  The default data sources that powermetrics draws from includes things like detailed C-state usage for all available processors.  For my purposes, I’m just interested in what applications are waking up most frequently.  powermetrics supports limiting its data sources with the –samplers switch:

$ sudo powermetrics --samplers tasks

will just display interesting information on all currently running tasks in the system.  You might think of it as a more detailed version of Activity Monitor’s Energy tab, and indeed, you can convince powermetrics to output the “Energy Impact” number found in Activity Monitor:

$ sudo powermetrics --samplers tasks --show-process-energy

The numbers from powermetrics don’t match up exactly with the numbers from Activity Monitor; I’m guessing that the impact of kernel_task (seen in powermetrics) is somehow spread among applications calling into the kernel by Activity Monitor. Or Activity Monitor is aggregating energy usage over a longer time period than every five seconds.

Knowing that your application is waking up rather often is helpful (Firefox is waking up ~once a second while I’m writing this), but even more helpful is where the wakeups are coming from.  The easiest way I know of for finding that information is to use dtrace:

$ sudo dtrace -n 'mach_kernel::wakeup { @[ustack()] = count(); }'

It’s worth breaking down what’s going on in the above.

  1. The mach_kernel::wakeup bit selects a probe point in dtrace parlance.  mach_kernel is the “module name” and wakeup is the “probe name”.  You can see a complete list of probes by running sudo dtrace -l.
  2. The bits between the braces are the actions to run when the probe point is hit.  In our example, we’re counting unique stack traces when wakeups occur; ustack is short for “user stack”, i.e. the stack of the userspace program executing.

(dtrace is a pretty snazzy tool; it’s worth checking out Brendan Gregg’s DTrace Tools or FreeBSD’s DTrace tutorial if you’re at all interested by the above script.)

Running the above on my system for several seconds gives stack traces like the following:

              XUL`non-virtual thunk to nsBaseAppShell::OnDispatchedEvent(nsIThreadInternal*)+0x26
              XUL`nsThread::PutEvent(nsIRunnable*, nsThread::nsNestedEventTarget*)+0x95
              XUL`nsThread::ProcessNextEvent(bool, bool*)+0x3d8
              XUL`NS_ProcessNextEvent(nsIThread*, bool)+0x35

              XUL`nsTimerImpl::InitCommon(unsigned int, unsigned int)+0x1f7
              XUL`nsThread::ProcessNextEvent(bool, bool*)+0x3d8
              XUL`NS_ProcessPendingEvents(nsIThread*, unsigned int)+0x51

              Activity Monitor`0x000000010e918bd1+0xa76
              Activity Monitor`0x000000010e9038c8+0x141
              Activity Monitor`0x000000010e9032df+0x231
              Activity Monitor`0x000000010e90a156+0x192
              Activity Monitor`0x000000010e90a09b+0x91

              XUL`non-virtual thunk to nsBaseAppShell::OnDispatchedEvent(nsIThreadInternal*)+0x26
              XUL`nsThread::PutEvent(nsIRunnable*, nsThread::nsNestedEventTarget*)+0x95
              XUL`nsThread::ProcessNextEvent(bool, bool*)+0x3d8
              XUL`NS_ProcessNextEvent(nsIThread*, bool)+0x35

              XUL`nsThread::ProcessNextEvent(bool, bool*)+0x3ce
              XUL`NS_ProcessPendingEvents(nsIThread*, unsigned int)+0x51
              AppKit`-[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:]+0xc2
              XUL`-[GeckoNSApplication nextEventMatchingMask:untilDate:inMode:dequeue:]+0x56
              AppKit`-[NSApplication run]+0x252


You can see that timers going off in Firefox are a major source of wakeups (those are the stacks with TimerThread::Run() in them).

Even with stacks causing wakeups, it’s not always obvious what sequence of events leads to those wakeups; this is especially true in the e10s era, where a particular stack ending in an IPC message send might only give you half of the picture.  Figuring that puzzle out will help solve problems like frequently waking up when animating images with CSS.

what’s new in xpcom


I was talking to somebody at Mozilla’s recent all-hands meeting in Portland, and in the course of attempting to provide a reasonable answer for “What have you been doing lately?”, I said that I had been doing a lot of reviews, mostly because of my newfound duties as XPCOM module owner. My conversational partner responded with some surprise that people were still modifying code in XPCOM at such a clip that reviews would be a burden. I responded that while the code was not rapidly changing, people were still finding reasons to do significant modifications to XPCOM code, and I mentioned a few recent examples.

But in light of that conversation, it’s good to broadcast some of the work I’ve had the privilege of reviewing this year.  I apologize in advance for not citing everybody; in particular, my reviews email folder only goes back to August, so I have a very incomplete record of what I’ve reviewed in the past year.  In no particular order:

table-driven register reading in rr


Many of the changes done for porting rr to x86-64 were straightforward, mechanical changes: change this type or format directive, add a template parameter to this function, search-and-replace all SYS_* occurrences and so forth. But the way register reading for rr’s GDB stub fell out was actually a marked improvement on what went before and is worth describing in a little more detail.

At first, if rr’s GDB stub wanted to know register values, it pulled them directly out of the Registers class:

enum DbgRegisterName {
  // ...and so on.

// A description of a register for the GDB stub.
struct DbgRegister {
  DbgRegisterName name;
  long value;
  bool defined;

static DbgRegister get_reg(const Registers* regs, DbgRegisterName which) {
  DbgRegister reg; = which;
  reg.defined = true;
  switch (which) {
  case DREG_EAX:
    reg.value = regs->eax;
    return reg;
  case DREG_ECX:
    reg.value = regs->ecx;
    return reg;
  // ...and so on.
    reg.defined = false;
    return reg;

That wasn’t going to work well if Registers supported several different architectures at once, so we needed to push that reading into the Registers class itself. Not all registers on a target fit into a long, either, so we needed to indicate how many bytes wide each register was. Finally, we also needed to indicate whether the target knew about the register or not: GDB is perfectly capable of asking for SSE (or even AVX) registers, but the chip running the debuggee might not support those registers, for instance. So we wound up with this:

struct DbgRegister {
  unsigned int name;
  uint8_t value[DBG_MAX_REG_SIZE];
  size_t size;
  bool defined;

static DbgRegster get_reg(const Registers* regs, unsigned int which) {
  DbgRegister reg;
  memset(&reg, 0, sizeof(reg));
  reg.size = regs->read_register(&reg.value[0], which, &reg.defined);
  return reg;
template<typename T>
static size_t copy_register_value(uint8_t* buf, T src) {
  memcpy(buf, &src, sizeof(src));
  return sizeof(src);

// GDBRegister is an enum defined in Registers.
size_t Registers::read_register(uint8_t* buf, GDBRegister regno, bool* defined) const {
  *defined = true;
  switch (regno) {
  case DREG_EAX:
    return copy_register_value(buf, eax);
  case DREG_ECX:
    return copy_register_value(buf, ecx);
  // ...many more integer and segment registers...
  case DREG_XMM0:
    // XMM registers aren't supplied by Registers, but they are supplied someplace else.
    *defined = false;
    return 16;
  // ...and so on

The above changes were serviceable, if a little repetitive. When it came time to add x86-64 support to Registers, however, writing out that same basic switch statement for x86-64 registers sounded unappealing.

And there were other things that would require the same duplicated code treatment, too: when replaying traces, rr compares the replayed registers to the recorded registers at syscalls and halts the replay if the registers don’t match (rr calls this “divergence” and it indicates some kind of bug in rr). The x86-only comparison routine looked like:

// |name1| and |name2| are textual descriptions of |reg1| and |reg2|, respectively,
// for error messages.
bool Registers::compare_register_files(const char* name1, const Registers* reg1,
                                       const char* name2, const Registers* reg2,
                                       int mismatch_behavior) {
  bool match = true;

#define REGCMP(_reg)                                     \
  do {                                                   \
    if (reg1-> _reg != reg2-> _reg) {              \
      maybe_print_reg_mismatch(mismatch_behavior, #_reg, \
                               name1, reg1-> _reg,	   \
                               name2, reg2-> _reg);	   \
      match = false;					   \
    }							   \
  } while (0)

  // ...and so forth.

  return match;

Duplicating the comparison logic for twice as many registers on x86-64 didn’t sound exciting, either.

I didn’t have to solve both problems at once, so I focused solely on getting the registers for GDB. A switch statement with almost-identical cases indicates that you’re doing too much in code, when you should be doing things with data instead. In this case, what we really want is this: given a GDB register number, we want to know whether we know about this register, and if so, how many bytes to copy into the outparam buffer and where to copy those bytes from. A lookup table is really the data structure we should be using here:

// A data structure describing a register.
struct RegisterValue {
  // The offsetof the register in user_regs_struct.
  size_t offset;

  // The size of the register.  0 means we cannot read it.
  size_t nbytes;

  // Returns a pointer to the register in |regs| represented by |offset|.
  // |regs| is assumed to be a pointer to |struct user_regs_struct| of the appropriate
  // architecture.
  void* pointer_into(void* regs) {
    return static_cast<char*>(regs) + offset;

  const void* pointer_into(const char* regs) const {
    return static_cast<const char*>(regs) + offset;

template<typename Arch>
struct RegisterInfo;

struct RegisterInfo<rr::X86Arch> {
  static struct RegisterValue registers[DREG_NUM_LINUX_I386];
struct RegisterValue RegisterInfo<rr::X86Arch>::registers[DREG_NUM_LINUX_I386];

static void initialize_register_tables() {
  static bool initialized = false;

  if (initialized) {

  initialized = true;
#define RV_X86(gdb_suffix, name) \
  RegisterInfo<rr::X86Arch>::registers[DREG_##gdb_suffix] = { \
    offsetof(rr::X86Arch::user_regs_struct, name), \
    sizeof(((rr::X86Arch::user_regs_struct*)0)->name) \
  RV_X86(EAX, eax);
  RV_X86(ECX, ecx);
  // ...and so forth.  Registers not explicitly initialized here, e.g. DREG_XMM0, will
  // be zero-initialized and therefore have a size of 0.

template<typename Arch>
size_t Registers::read_register_arch(uint8_t buf, GDBRegister regno, bool* defined) const {
  RegisterValue& rv = RegisterInfo::registers[regno];
  if (rv.nbytes == 0) {
    *defined = false;
  } else {
    *defined = true;
    memcpy(buf, rv.pointer_into(ptrace_registers()), rv.nbytes);

  return rv.nbytes;

size_t Registers::read_register(uint8_t buf, GDBRegister regno, bool* defined) const {
  // RR_ARCH_FUNCTION performs the return for us.
  RR_ARCH_FUNCTION(read_register_arch, arch(), buf, regno, defined);

Much better, and templating enables us to (re)use the RR_ARCH_FUNCTION macro we use elsewhere in the rr codebase.

Having to initialize the tables before using them is annoying, though. And after debugging a failure or two because I didn’t ensure the tables were initialized prior to using them, I wanted something more foolproof. Carefully writing out the array definition with zero-initialized members in the appropriate places sounded long-winded, difficult to read, and prone to failures that were tedious to debug. Ideally, I’d just specify the array indices to initialize along with the corresponding values, and let zero-initialization take care of the rest.

Initially, I wanted to use designated initializers, which are a great C99 feature that provided exactly what I was looking for, readable initialization of specific array indices:

#define RV_X86(gdb_suffix, name) \
  [DREG_##gdb_suffix] = { \
    offsetof(rr::X86Arch::user_regs_struct, name), \
    sizeof(((rr::X86Arch::user_regs_struct*)0)->name) \

struct RegisterValue RegisterInfo<rr::X86Arch>::registers[DREG_NUM_LINUX_I386] = {
  RV_X86(EAX, eax),
  RV_X86(ECX, ecx),

I assumed that C++11 (which is the dialect of C++ used in rr) would support something like this, and indeed, my first attempts worked just fine. Except when I added more initializers:

struct RegisterValue RegisterInfo<rr::X86Arch>::registers[DREG_NUM_LINUX_I386] = {
  RV_X86(EAX, eax),
  RV_X86(ECX, ecx),
  // ...other integer registers, eflags, segment registers...
  RV_X86(ORIG_EAX, orig_eax);

I got this curious error message from g++: “Sorry, designated initializers not supported.”

But that’s a lie! You clearly do support designated initializers, because I’ve been able to compile code with them!

Well, yes, g++ supports designated initializers for arrays. Except that the indices have to be in consecutive order from the beginning of the array. And all the integer registers, EIP, and the segment registers appeared in consecutive order in the enumeration, starting with DREG_EAX = 0. But DREG_ORIG_EAX came some 20 members after DREG_GS (the last segment register), and so there we had initialization of non-consecutive array indices, and g++ complained. In other words, designated initializers in g++ are useful for documentation and/or readability purposes, but they’re not useful for sparse arrays. Hence the explicit initialization approach described above.

Turns out there’s a nicer, C++11-ier way to handle these kind of sparse arrays, at the cost of some extra infrastructure. Instead of an explicit [] variable, you can have a class with an operator[] overload and a constructor that takes std::initializer_list<std::pair<size_t, YourActualClass>>. The first member of the pair is the index into your sparse array, and the second member is the value you want placed there. The constructor takes care of putting everything in its place; the actual code looks like this:

typedef std::pair<size_t, RegisterValue> RegisterInit;

// Templated over N because different architectures have differently-sized register files.
template<size_t N>
struct RegisterTable : std::array<RegisterValue, N> {
  RegisterTable(std::initializer_list<RegisterInit> list) {
    for (auto& ri : list) {
      (*this)[ri.first] = ri.second;

struct RegisterInfo<rr::X86Arch> {
  static const size_t num_registers = DREG_NUM_LINUX_I386;
  typedef RegisterTable<num_registers> Table;
  static Table registers;

RegisterInfo<rr::X86Arch>::Table RegisterInfo<rr::X86Arch>::registers = {
  // RV_X86 macro defined similarly to the previous version.
  RV_X86(EAX, eax),
  RV_X86(ECX, ecx),
  // ...many more register definitions

// x86-64 versions defined similarly...

There might be a better way to do this that’s even more C++11-ier. Comments welcome!

Once this table-driven scheme was in place, using it for comparing registers was straightforward: each RegisterValue structure adds a name field, for error messages, and a comparison_mask field, for applying to register values prior to comparison. Registers that we don’t care about comparing can have a register mask of zero, just like registers that we don’t want to expose to GDB can have a “size” of zero. A mask is more flexible than a bool compare/don’t compare field, because eflags in particular is not always determinstic between record and replay, so we mask off particular bits of eflags prior to comparison. The core comparison routine then looks like:

template<typename Arch>
static bool compare_registers_core(const char* name1, const Registers* reg1,
                                   const char* name2, const Registers* reg2,
                                   int mismatch_behavior) {
  bool match = true;

  for (auto& rv : RegisterInfo<Arch>::registers) {
    if (rv.nbytes == 0) {

    // Disregard registers that will trivially compare equal.
    if (rv.comparison_mask == 0) {

    // XXX correct but oddly displayed for big-endian processors.
    uint64_t val1 = 0, val2 = 0;
    memcpy(&val1, rv.pointer_into(reg1.ptrace_registers()), rv.nbytes);
    memcpy(&val2, rv.pointer_into(reg2.ptrace_registers()), rv.nbytes);
    val1 &= rv.comparison_mask;
    val2 &= rv.comparison_mask;

    if (val1 != val2) {
      maybe_print_reg_mismatch(mismatch_behavior,, name1, val1, name2,
      match = false;

  return match;

which I thought was a definite improvement on the status quo. Additional, architecture-specific comparisons can be layered on top of this (see, for instance, the x86-specialized versions in rr’s

Eliminating or reducing code by writing data structures instead is one of those coding tasks that gives me a warm, fuzzy feeling after I’ve done it. In addition to the above, I was also able to use the RegisterValue structures to provide architecture-independent dumping of register files, which was another big win in making Registers multi-arch aware. I was quite pleased that everything worked out so well.

porting rr to x86-64


(TL;DR: rr from git can record and replay 64-bit programs.  Try it for yourself!)

Over the last several months, I’ve been devoting an ever-increasing amount of my time to making rr able to trace x86-64 programs.  I’ve learned a lot along the way and thought I’d lay out all the major pieces of work that needed to be done to make this happen.

Before explaining the major pieces, it will be helpful to define some terms: the host architecture is the architecture that the rr binary itself is compiled for.  The target architecture is the architecture of the binary that rr is tracing.  These are often equivalent, but not necessarily so: you could be tracing a 64-bit binary with a 64-bit rr (host == target), but then the program starts to run a 32-bit subprocess, which rr also begins to trace (host != target).  And you have to handle both cases in a single rr session, with a single rr binary.  (64-bit rr doesn’t handle the host != target case quite yet, but all the infrastructure is there.)

All of the pieces described below are not new ideas: the major programs you use for development (compiler, linker, debugger, etc.) all have done some variation of what I describe below.  However, it’s not every day that one takes a program written without any awareness of host/target distinctions and endows it with the necessary awareness.

Quite often, a program written exclusively for 32-bit hosts has issues when trying to compile for 64-bit hosts, and rr was no exception in this regard.  Making the code 64-bit clean by fixing all the places that triggered compiler warnings on x86-64, but not on i386, was probably the easiest part of the whole porting effort.  Format strings were a big part of this: writing %llx when you wanted to print a uint64_t, for instance, which assumes that uint64_t is implemented as unsigned long long (not necessarily true on 64-bit hosts).  There were several places where long was used instead of uint32_t.  And there were even places that triggered signed/unsigned comparison warnings on 64-bit platforms only.  (Exercise for the reader: construct code where this happens before looking at the solution.)

Once all the host issues are dealt with, removing all the places where rr assumed semantics or conventions of the x86 architecture was the next step.  In short, all of the code assumed host == target: we were compiled on x86, so that must be the architecture of the program we’re debugging.  How many places actually assumed this, though?  Consider what the very simplified pseudo-code of the rr main recording loop looks like:

while (true) {
  wait for the tracee to make a syscall
  grab the registers at the point of the syscall
  extract the syscall number from the registers (1)
  switch (syscall_number) {
    case SYS_read: (2)
      extract pointer to the data read from the registers (3)
      record contents of data
    case SYS_clock_gettime:
      extract pointers for argument structures from the registers
      record contents of those argument structures (4)
    case SYS_mmap: (5)
    case SYS_mmap2: (6)
    case SYS_clone: (7)
      complain about an unhandled syscall
  let the tracee resume

Every line marked with a number at the end indicates a different instance where host and target differences come into play and/or the code might have assumed x86 semantics.  (And the numbering above is not exhaustive!)  Taking them in order:

  1. You can obtain the registers of your target with a single ptrace call, but the layout of those registers depends on your target.  ptrace returns the registers as a struct user_regs, which differs between targets; the syscall number location obviously differs between different layouts of struct user_regs.
  2. The constant SYS_read refers to the syscall number for read on the host.  If you want to identify the syscall number for the target, you’ll need to do something different.
  3. This instance is a continuation of #1: syscall arguments are passed in different registers for each target, and the locations of those registers differ in size and location between different layouts of struct user_regs.
  4. SYS_clock_gettime takes a pointer to a struct timespec.  How much data should we read from that pointer for recording purposes?  We can’t just use sizeof(struct timespec), since that’s the size for the host, not the target.
  5. Like SYS_read, SYS_mmap refers to the syscall number for mmap on the host, so we need to do something similar to SYS_read here.  But just because two different architectures have a SYS_mmap, it doesn’t mean that the calling conventions for those syscalls at the kernel level are identical.  (This distinction applies to several other syscalls as well.)  SYS_mmap on x86 takes a single pointer argument, pointing to a structure that contains the syscall’s arguments.  The x86-64 version takes its arguments in registers.  We have to extract arguments appropriately for each calling convention.
  6. SYS_mmap2 only exists on x86; x86-64 has no such syscall.  So we have to handle host-only syscalls or target-only syscalls in addition to things like SYS_read.
  7. SYS_clone has four (!) different argument orderings at the kernel level, depending on the architecture, and x86 and x86-64 of course use different argument orderings.  You must take these target differences into account when extracting arguments.  SYS_clone implementations also differ in how they treat the tls parameter, and those differences have to be handled as well.

So, depending on the architecture of our target, we want to use different constants, different structures, and do different things depending on calling conventions or other semantic differences.

The approach rr uses is that the Registers of every rr Task (rr’s name for an operating system thread) has an architecture, along with a few other things like recorded events.  Every structure for which the host/target distinction matters has an arch() accessor.  Additionally, we define some per-architecture classes.  Each class contains definitions for important kernel types and structures, along with enumerations for syscalls and various constants.

Then we try to let C++ templates do most of the heavy lifting.  In code, it looks something like this:

enum SupportedArch {

class X86Arch {
  /* many typedefs, structures, enums, and constants defined... */

class X64Arch {
  /* many typedefs, structures, enums, and constants defined... */

#define RR_ARCH_FUNCTION(f, arch, args...) \
  switch (arch) { \
    default: \
      assert(0 && "Unknown architecture"); \
    case x86: \
      return f<X86Arch>(args); \
    case x86_64: \
      return f<X64Arch>(args); \

class Registers {
  SupportedArch arch() const { ... }

  intptr_t syscallno() const {
    switch (arch()) {
      case x86:
        return u.x86.eax;
      case x86_64:
        return u.x64.rax;

  // And so on for argN accessors and so forth...

  union RegisterUnion {
    X86Arch::user_regs x86;
    X64Arch::user_regs x64;
  } u.

template <typename Arch>
static void process_syscall_arch(Task* t, int syscall_number) {
  switch (syscall_number) {
    case Arch::read:
      remote_ptr buf = t->regs().arg2();
      // do stuff with buf
    case Arch::clock_gettime:
      // We ensure Arch::timespec is defined with the appropriate types so it
      // is exactly the size |struct timespec| would be on the target arch.
      remote_ptr tp = t->regs().arg2();
      // do stuff with tp
    case Arch::mmap:
      switch (Arch::mmap_argument_semantics) {
        case Arch::MmapRegisterArguments:
          // x86-64
        case Arch::MmapStructArguments:
          // x86
    case Arch::mmap2:
      // Arch::mmap2 is always defined, but is a negative number on architectures
      // where SYS_mmap2 isn't defined.
      // do stuff
    case Arch::clone:
      switch (Arch::clone_argument_ordering) {
        case Arch::FlagsStackParentTLSChild:
          // x86
        case Arch::FlagsStackParentChildTLS:
          // x86-64

void process_syscall(Task* t, int syscall_number) {
  int syscall_number = t->regs().syscallno();
  RR_ARCH_FUNCTION(process_syscall_arch, t->arch(), t, syscall_number);

The definitions of X86Arch and X64Arch also contain static_asserts to try and ensure that we’ve defined structures correctly for at least the host architecture.  And even now the definitions of the structures aren’t completely bulletproof; I don’t think the X86Arch definitions of some structures are robust on a 64-bit host because of differences in structure field alignment between 32-bit and 64-bit, for instance.  So that’s still something to fix in rr.

Templates handle the bulk of target-specific code in rr.  There are a couple of places where we need to care about how the target implements mmap and other syscalls which aren’t amenable to templates (or, at least, we didn’t use them; it’s probably possible to (ab)use templates for these purposes), and so we have code like:

Task* t = ...
if (has_mmap2_syscall(t->arch())) {
  // do something specifically for mmap2
} else {
  // do something with mmap

Finally, various bits of rr’s main binary and its testsuite are written in assembly, so of course those needed to be carefully ported over.

That’s all the major source-code related work that needed to be done. I’ll leave the target-specific runtime work required for a future post.

x86-64 support for rr hasn’t been formally released, but the x86-64 support in the github repository is functional: x86-64 rr passes all the tests in rr’s test suite and is able to record and replay Firefox mochitests.  I will note that it’s not nearly as fast as the x86 version; progress is being made in improving performance, but we’re not quite there yet.

If you’re interested in trying 64-bit rr out, you’ll find the build and installation instructions helpful, with one small modification: you need to add the command-line option -Dforce64bit=ON to any cmake invocations.  Therefore, to build with Makefiles, one needs to do:

git clone
mkdir obj64
cd obj64
cmake -Dforce64bit=ON ../rr
make -j4
make check

Once you’ve done that, the usage instructions will likely be helpful.  Please try it out and report bugs if you find any!

xpcom and move constructors


Benjamin Smedberg recently announced that he was handing over XPCOM module ownership duties to me.  XPCOM contains basic data structures used throughout the Mozilla codebase, so changes to its code can have wide-ranging effects.  I’m honored to have been given responsibility for a core piece of the Gecko platform.

One issue that’s come up recently and I’m sure will continue to come up is changing XPCOM data structures to support two new C++11 features, rvalue references and their killer app, move constructors.  If you aren’t familiar with C++11’s new rvalue references feature, I highly recommend C++ Rvalue References Explained.  Move constructors are already being put to good use elsewhere in the codebase, notably mozilla::UniquePtr, which can be used to replace XPCOM’s nsAutoPtr and nsAutoRef (bug 1055035).  And some XPCOM data structures have received the move constructor treatment, notably nsRefPtr (bug 980753) and nsTArray (bug 982212).

A recent discussion and the associated bug, however, decided that the core referenced-counted smart pointer class in XPCOM, nsCOMPtr, shouldn’t support move constructors.  While move constructors could have replaced the already_AddRefed usage associated with nsCOMPtr, such as:

  nsCOMPtr<nsIMyInterface> interface = ...;
  // do some initialization stuff
  return interface.forget();

with the slightly shorter:

  nsCOMPtr<nsIMyInterface> interface = ...;
  // do some initialization stuff
  return interface;

There were two primary arguments against move constructor support.  The first argument was that the explicitness of having to call .forget() on an nsCOMPtr (along with the explicitness of the already_AddRefed type), rather than returning it, is valuable for the code author, the patch reviewer, and subsequent readers of the code.  When dealing with ownership issues in C++, it pays to be more explicit, rather than less.  The second argument was that due to the implicit conversion of nsCOMPtr<T> to a bare T* pointer (a common pattern in smart pointer classes), returning nsCOMPtr<T> from functions makes it potentially easy to write buggy code:

// What really happens in the below piece of code is something like:
// nsIMyInterface* p;
// {
//   nsCOMPtr<nsIMyInterface> tmp(NS_DoSomething(...));
//   p = tmp.get();
// }
// which is bad if NS_DoSomething is returning the only ref to the object.
// p now points to deleted memory, which is a security risk.
nsIMyInterface* p = NS_DoSomething(...);

(I should note that we can return nsCOMPtr<T> from functions today, and in most cases, thanks to compiler optimizations, it will be as efficient as returning already_AddRefed.  But Gecko culture is such that a function returning nsCOMPtr<T> would be quite unusual, and therefore unlikely to pass code review.)

The changes to add move constructors to nsRefPtr and nsTArray?  They were reviewed by me.  And the nixing of move constructors for nsCOMPtr?  That was also done by me (with a lot of feedback from other people).

I accept the charge of inconsistency.  However, I offer the following defense.  In the case of nsTArray, there are no ownership issues like there are with nsCOMPtr: you either own the array, or you don’t, so many of the issues raised about nsCOMPtr don’t apply in that case.

For the case of nsRefPtr, it is true that I didn’t seek out as much input from other people before approving the patch.  But the nsRefPtr patch was also done without the explicit goal of removing already_AddRefed from the code base, which made it both smaller in scope and more palatable.  Also, my hunch is that nsRefPtr is used somewhat less than nsCOMPtr (although this may be changing somewhat given other improvements in the codebase, like WebIDL), and so it provides an interesting testbed for whether move constructors and/or less explicit transfers of ownership are as much of a problem as argued above.

on code review and commit policies


I’ve been doing what feels like a lot of reviews this year.  Groveling imperfectly through mozilla-inbound’s commit log:

[froydnj@cerebro mi.hg]$ hg log -d '>2014-01-01' -k r=froydnj -k r=nfroyd|grep -c summary:

tells me I’ve been doing slightly over 1 review/day, which seems like a healthy clip.  Most of these reviews are hardly rocket science: some minor changes to make a compiler happy(ier), some sweeping style changes, and tweaks to make things more robust against previously unforeseen problems.  I call out reviews that take me “significant” amounts of time in my weekly status updates to my team; last week was the first week in a while where I could say that a review would lead to significantly better code being committed into mozilla-central.

You could take this as commentary on the generally high quality of patches that I get to review (the “our contributors are incredible” argument). You could take this as people asking me to review fairly trivial patches and saving the “real” patches for “real” reviewers (the “inferiority complex” argument). You could also take this as evidence that not all patches need an explicitly-requested and mandatory code review prior to commit (the “you’re doing code review wrong” argument).

All the projects I’ve worked on in my professional life have what I’ll call “hard” code review policies: everything that gets committed must undergo review, save for “obvious” patches. Definitions of “obvious” vary, but spelling fixes, backing out build/test bustages, and no-questions-asked correctness fixes generally qualify. (GDB has a “will the person who hates my work the most be able to find fault with the change” test for obviousness that seems both whimsical and helpful.) Such a policy applies to everyone, from the person who just submitted their first patch to the veteran working for 10+ years on the same project.

Other, equally successful projects that I’ve had contact with have what I’ll call “soft” code review policies: patches from people without commit bits undergo review, but being given commit privileges is an expression of trust. (This trust is qualitatively different from the trust expressed in granting commit privileges in the “hard” review model.) Your prior work has demonstrated sufficient taste and competence that you may commit patches freely, but you are also expected to exhibit appropriate humility in knowing when to ask for a second opinion. Of course, your code may undergo post-commit review by interested parties, where changes ranging from minor edits to complete backouts may be requested.

What percentage of commits benefit from a “hard” review policy? Can we classify patches so as to apply the “correct” review policy? (Can the people writing the patches even identify the “correct” review policy themselves?)

the compiler is always right


I think everybody who programs has had a bug in one of their programs that they were positive was the compiler’s fault.  (Perhaps “it’s the operating system’s fault” or “it’s the hardware’s fault”, though these are less common.) At this point, you learn the rules of programming:

  1. The compiler is always right.
  2. If the compiler is ever wrong, see rule #1.

The corollary, of course, is that it is your program that has the bug, not the compiler.  (The third through sixth laws are restatements of these two with the operating system and the hardware, respectively.)

Yesterday was one of those occasions where I thought the compiler might be wrong and spent a little time remembering the First Rule.  It’s instructive to look at the example and understand why the compiler is always right.  I was looking at bug 611781, reducing the size of the NSS library shipped with Firefox, and ran across this Linux x86-64 assembly code, compiled with gcc -Os:

0000000000000000 <NSSTrustDomain_GenerateSymmetricKeyFromPassword>:
   0:	41 51                	push   %r9
   2:	48 8b 05 00 00 00 00 	mov    0x0(%rip),%rax
			5: R_X86_64_GOTPCREL	NSS_ERROR_NOT_FOUND+0xfffffffffffffffc
   9:	8b 38                	mov    (%rax),%edi
   b:	e8 00 00 00 00       	callq  10 <NSSTrustDomain_GenerateSymmetricKeyFromPassword+0x10>
			c: R_X86_64_PLT32	nss_SetError+0xfffffffffffffffc
  10:	31 c0                	xor    %eax,%eax
  12:	41 5a                	pop    %r10
  14:	c3                   	retq   

You can find a lot of these small functions in lib/pki/trustdomain.c in an NSS source tree. Looking over this, you notice two things if you know a little bit about x86-64 assembler:

  1. The push and pop instructions are suspiciously mismatched; if I save a register to the stack with push, I ought to restore the same register from the stack with a pop. The compiler must have a bug!
  2. Also, there’s an efficiency concern here. Why is the compiler saving and restoring any registers here? We never use %r9 in the body of the function, so we shouldn’t need to save it at all.

What is the compiler doing here? It’s easiest to explain away the second issue first. If you look at the x86-64 ABI document, you’ll see that in section 3.2.2, the stack must be kept 16-byte aligned before calling other functions. Since our function calls another function, our function must ensure that the stack is properly aligned when that other function begins execution. And since the call instruction on x86-64 (which is how we would have arrived at our function) adjusts the stack pointer by 8 bytes (in addition to all the other work it does), our function must adjust by an additional 8 bytes to maintain 16-byte alignment. The compiler has chosen to use a push instruction to manipulate the stack pointer in this case. This instruction subtracts 8 bytes from the stack pointer and stores the indicated register into the memory at the new stack pointer.

Another way to do this would be to subtract 8 bytes from the stack pointer (sub $0x8, %rsp), which avoids writing the register to memory at all. If you compile with -O2, optimizing for speed, instead of -Os, you would indeed see the compiler using sub $0x8, %rsp. But since we compiled this code with -Os, optimizing for size, the compiler knows that the instruction for pushing a register onto the stack (2 bytes) is smaller than the instruction for subtracting 8 bytes from the stack pointer (4 bytes). Likewise, the instruction for popping a register from the stack (2 bytes) is smaller than the instruction for adding 8 bytes to the stack pointer (4 bytes).

These “useless” instructions are therefore doing real work, which is maintaining the contract of the ABI.

OK, so the efficiency claim has been addressed. What about the correctness claim with mismatched registers? Again, if you look at the aforementioned ABI document, section 3.2.3 describes how registers are used for function calls. The registers %r9 and %r10 are caller-saved registers, which means that a called function is free to overwrite any values stored in those registers. It doesn’t matter what value, if any, our function stores in %r10, because we know that if the caller had cared about the value, the caller would have stored that value away somewhere. Since we need spare registers for maintaining stack alignment via push and pop, caller-saved registers are ideal for pushing and popping with abandon.

In this case, it turned out that my understanding of what the program was doing had the bug, not the compiler.  It’s also worth pointing out that if the compiler really was mismatching register saves and restores, lots and lots of things would be broken.  The likelihood of the code produced in this instance being wrong—but the same problem not occurring in the millions of lines of code the compiler has compiled to produce the system on your computer—is vanishingly small.  The next time you see the compiler doing something weird, remember that the compiler is always right and try to figure out why it’s doing that.

(I should say, of course, that compilers, just like many other computer programs, do have bugs; GCC’s bugzilla or LLVM’s bugzilla would not exist otherwise, nor would bugfix releases continue to come out for your favorite compiler.  But in the vast, vast majority of cases, you have a bug to fix, not the compiler.)

my code search engine


Christian Legnitto wrote a blog post where he mentioned Firefox developers being forced to deal with “crufty code-search tools” (and many other perceived suboptimalities in the development process).  I’m looking forward to reading his followup, but I also thought it was worth blogging about what I use for my day-to-day code search needs.

I use Emacs’s rgreprgrep (and its cousin commands grep and lgrep) executes grep on a group of files in a given directory, recursively. The grep results are then displayed in a hyperlinked format for easy browsing between matches. Well, being Emacs, I use it with a few modifications of my own.  Here’s my setup.

First, a small utility I wrote for making quick selections with a single keypress:

(defun character-prompt (alist description)
  (message "Select [%s]: "
           (apply #'string (mapcar #'car alist)))
  (let* ((ch (save-window-excursion
               (select-window (minibuffer-window))
         (x (find ch alist :key #'car)))
      ((null x)
       (message (format "No %s for character: %s" description ch))
       (sleep-for 1)
       (character-prompt alist description))
      (t (cdr x)))))

This function gets used in the small wrapper I wrote around rgrep. Some preliminaries first, like where the Firefox tree lives, files that contain overly long lines and therefore mess with Emacs’s hyperlinking, and directories that I generally don’t deal with in my day-to-day work.

(defvar froydnj-mozilla-srcdir (expand-file-name "~/src/gecko-dev.git/"))
(defvar froydnj-mozilla-ignored-files
  (list "searchindex.js"
(defvar froydnj-mozilla-ignored-directories
  (list "nss" "nsprpub" "js/src/tests" "intl/icu"))

Next, the way I select subsets of files to search in. I learned after writing all this that rgrep already has built-in functionality for this (see the grep-files-aliases variable), but I like my setup better.

(defvar froydnj-mozilla-files
  '((?a . "*")                    ; All of it
    (?c . "*.[cm]*")              ; C/C++/Obj-C
    (?C . "*.[cmh]*")             ; Same, plus headers (and HTML, sadly)
    (?h . "*.h")
    (?H . "*.html")
    (?i . "*.idl")
    (?j . "*.js*")
    (?l . "*.lisp")
    (?m . "")
    (?p . "*.py")
    (?v . "*.java")
    (?w . "*.webidl")))

Finally, the wrapper itself, which prompts for the search pattern, the filename pattern, makes sure the directories and files above are ignored, and executes the search.

(defun froydnj-mozilla-rgrep ()
  (let ((regexp (grep-read-regexp))
        (files (character-prompt froydnj-mozilla-files "filename pattern"))
        (grep-find-ignored-files (append grep-find-ignored-files
        (grep-find-ignored-directories (append grep-find-ignored-directories
    (rgrep regexp files froydnj-mozilla-srcdir)))

One other bit that I find useful is a custom name for each buffer. By default, the rgrep results are deposited in a buffer named *grep* (likewise for grep and lgrep; the following advice applies to them automatically), and future greps overwrite this buffer. Having records of your greps lying around is occasionally useful, so I’ve changed the hook that determines the buffer name for rgrep. The comments that refer to compilation are from a previous job where it was useful to launch compilation jobs from within Emacs. I keep meaning to tweak those bits to launch mach in various configurations instead, but I haven’t done so yet.

(defun froydnj-compilation-buffer-name (major-mode-name)
  (let ((cfg-regexp "\\([-0-9_.a-z/+]+\\.cfg\\)"))
      ;; We're doing local compilation, stick the name of the release
      ;; configuration in the buffer name.
      ((or (string-match "^cd /scratch/froydnj/\\([^ ;]+\\)" command)
           (string-match "^build-config" command))
       (string-match cfg-regexp command)
       (concat "*compilation " (match-string 1 command) "*"))
      ;; We're doing remote compilation, note the machine name and
      ;; the release configuration name.
      ((string-match "^ssh \\([^ ]+\\)" command)
       (let ((machine (match-string 1 command)))
         (string-match cfg-regexp command)
         (concat "*compilation@" machine " " (match-string 1 command) "*")))
      ;; grep.el invokes compile, we might as well take advantage of that.
      ((string-equal major-mode-name "grep")
       (if (boundp 'regexp)
           (concat "*grep for " regexp "*")
      ;; We have no idea, just use the default.

(setq compilation-buffer-name-function 'froydnj-compilation-buffer-name)

Search times are comparable to web-based tools, and browsing results is more convenient. It has its shortcomings (overloaded C++ method names can be a pain to deal with, for instance), but it works well enough for 95%+ of my searching needs.