18
Apr 16

rr talk post-mortem

On Wednesday last week, I gave an invited talk on rr to a group of interested students and faculty at Rose-Hulman. The slides I used are available, though I doubt they make a lot of sense without the talk itself to go with them. Things I was pleased with:

  • I didn’t overrun my time limit, which was pretty satisfying.  I would have liked to have an hour (40 minutes talk/20 minutes for questions or overrun), but the slot was for a standard class period of 50 minutes.  I also wanted to leave some time for questions at the end, of which there were a few. Despite the talk being scheduled for the last class period of the day, it was well-attended.
  • The slides worked well  My slides are inspired by Lawrence Lessig’s style of presenting, which I also used for my lightning talk in Orlando.  It forces you to think about what you’re putting on each slide and make each slide count.  (I realize I didn’t use this for my Gecko onboarding presentation; I’m not sure if the Lessig method would work for things like that.  Maybe at the next onboarding…)
  • The level of sophistication was just about right, and I think the story approach to creating rr helped guide people through the presentation.  At least, it didn’t look as though many people were nodding off or completely confused, despite rr being a complex systems-heavy program.

Most of the above I credit to practicing the talk repeatedly.  I forget where I heard it, but a rule of thumb I use for presentations is 10 hours of prep time minimum (!) for every 1 hour of talk time.  The prep time always winds up helping: improving the material, refining the presentation, and boosting my confidence giving the presentation.  Despite all that practice, opportunities for improvement remain:

  • The talk could have used any amount of introduction on “here’s how debuggers work”.  This is kind of old hat to me, but I realized after the fact that to many students (perhaps even some faculty), blithely asserting that rr can start and stop threads at will, for instance, might seem mysterious.  A slide or two on the differences between how rr record works vs. how rr replay works and interacts with GDB would have been clarifying as well.
  • The above is an instance where a diagram or two might have been helpful.  I dislike putting diagrams in my talks because I dislike the thought of spending all that time to find a decent, simple app for drawing things, actually drawing them, and then exporting a non-awful version into a presentation.  It’s just a hurdle that I have to clear once, though, so I should just get over it.
  • Checkpointing and the actual mechanisms by which rr can run forwards or backwards in your program got short shrift and should have been explained in a little more detail.  (Diagrams again…)  Perhaps not surprisingly, the checkpointing material got added later during the talk prep and therefore didn’t get practiced as much.
  • The demo received very little practice (I’m sensing a theme here) and while it was able to show off a few of rr‘s capabilities, it wasn’t very polished or impressive.  Part of that is due to rr mysteriously deciding to cease working on my virtual machine, but part of that was just my own laziness and assuming things would work out just fine at the actual talk.  Always practice!

03
Nov 14

table-driven register reading in rr

Many of the changes done for porting rr to x86-64 were straightforward, mechanical changes: change this type or format directive, add a template parameter to this function, search-and-replace all SYS_* occurrences and so forth. But the way register reading for rr’s GDB stub fell out was actually a marked improvement on what went before and is worth describing in a little more detail.

At first, if rr’s GDB stub wanted to know register values, it pulled them directly out of the Registers class:

enum DbgRegisterName {
  DREG_EAX,
  DREG_ECX,
  // ...and so on.
};

// A description of a register for the GDB stub.
struct DbgRegister {
  DbgRegisterName name;
  long value;
  bool defined;
};

static DbgRegister get_reg(const Registers* regs, DbgRegisterName which) {
  DbgRegister reg;
  reg.name = which;
  reg.defined = true;
  switch (which) {
  case DREG_EAX:
    reg.value = regs->eax;
    return reg;
  case DREG_ECX:
    reg.value = regs->ecx;
    return reg;
  // ...and so on.
  default:
    reg.defined = false;
    return reg;
  }
}

That wasn’t going to work well if Registers supported several different architectures at once, so we needed to push that reading into the Registers class itself. Not all registers on a target fit into a long, either, so we needed to indicate how many bytes wide each register was. Finally, we also needed to indicate whether the target knew about the register or not: GDB is perfectly capable of asking for SSE (or even AVX) registers, but the chip running the debuggee might not support those registers, for instance. So we wound up with this:

struct DbgRegister {
  unsigned int name;
  uint8_t value[DBG_MAX_REG_SIZE];
  size_t size;
  bool defined;
};

static DbgRegster get_reg(const Registers* regs, unsigned int which) {
  DbgRegister reg;
  memset(&reg, 0, sizeof(reg));
  reg.size = regs->read_register(&reg.value[0], which, &reg.defined);
  return reg;
}
...
template<typename T>
static size_t copy_register_value(uint8_t* buf, T src) {
  memcpy(buf, &src, sizeof(src));
  return sizeof(src);
}

// GDBRegister is an enum defined in Registers.
size_t Registers::read_register(uint8_t* buf, GDBRegister regno, bool* defined) const {
  *defined = true;
  switch (regno) {
  case DREG_EAX:
    return copy_register_value(buf, eax);
  case DREG_ECX:
    return copy_register_value(buf, ecx);
  // ...many more integer and segment registers...
  case DREG_XMM0:
    // XMM registers aren't supplied by Registers, but they are supplied someplace else.
    *defined = false;
    return 16;
  // ...and so on
  }
}

The above changes were serviceable, if a little repetitive. When it came time to add x86-64 support to Registers, however, writing out that same basic switch statement for x86-64 registers sounded unappealing.

And there were other things that would require the same duplicated code treatment, too: when replaying traces, rr compares the replayed registers to the recorded registers at syscalls and halts the replay if the registers don’t match (rr calls this “divergence” and it indicates some kind of bug in rr). The x86-only comparison routine looked like:

// |name1| and |name2| are textual descriptions of |reg1| and |reg2|, respectively,
// for error messages.
bool Registers::compare_register_files(const char* name1, const Registers* reg1,
                                       const char* name2, const Registers* reg2,
                                       int mismatch_behavior) {
  bool match = true;

#define REGCMP(_reg)                                     \
  do {                                                   \
    if (reg1-> _reg != reg2-> _reg) {              \
      maybe_print_reg_mismatch(mismatch_behavior, #_reg, \
                               name1, reg1-> _reg,	   \
                               name2, reg2-> _reg);	   \
      match = false;					   \
    }							   \
  } while (0)

  REGCMP(eax);
  REGCMP(ebx);
  // ...and so forth.

  return match;
}

Duplicating the comparison logic for twice as many registers on x86-64 didn’t sound exciting, either.

I didn’t have to solve both problems at once, so I focused solely on getting the registers for GDB. A switch statement with almost-identical cases indicates that you’re doing too much in code, when you should be doing things with data instead. In this case, what we really want is this: given a GDB register number, we want to know whether we know about this register, and if so, how many bytes to copy into the outparam buffer and where to copy those bytes from. A lookup table is really the data structure we should be using here:

// A data structure describing a register.
struct RegisterValue {
  // The offsetof the register in user_regs_struct.
  size_t offset;

  // The size of the register.  0 means we cannot read it.
  size_t nbytes;

  // Returns a pointer to the register in |regs| represented by |offset|.
  // |regs| is assumed to be a pointer to |struct user_regs_struct| of the appropriate
  // architecture.
  void* pointer_into(void* regs) {
    return static_cast<char*>(regs) + offset;
  }

  const void* pointer_into(const char* regs) const {
    return static_cast<const char*>(regs) + offset;
  }
};

template<typename Arch>
struct RegisterInfo;

template<>
struct RegisterInfo<rr::X86Arch> {
  static struct RegisterValue registers[DREG_NUM_LINUX_I386];
};
struct RegisterValue RegisterInfo<rr::X86Arch>::registers[DREG_NUM_LINUX_I386];

static void initialize_register_tables() {
  static bool initialized = false;

  if (initialized) {
    return;
  }

  initialized = true;
#define RV_X86(gdb_suffix, name) \
  RegisterInfo<rr::X86Arch>::registers[DREG_##gdb_suffix] = { \
    offsetof(rr::X86Arch::user_regs_struct, name), \
    sizeof(((rr::X86Arch::user_regs_struct*)0)->name) \
  }
  RV_X86(EAX, eax);
  RV_X86(ECX, ecx);
  // ...and so forth.  Registers not explicitly initialized here, e.g. DREG_XMM0, will
  // be zero-initialized and therefore have a size of 0.
}

template<typename Arch>
size_t Registers::read_register_arch(uint8_t buf, GDBRegister regno, bool* defined) const {
  initialize_register_tables();
  RegisterValue& rv = RegisterInfo::registers[regno];
  if (rv.nbytes == 0) {
    *defined = false;
  } else {
    *defined = true;
    memcpy(buf, rv.pointer_into(ptrace_registers()), rv.nbytes);
  }

  return rv.nbytes;
}

size_t Registers::read_register(uint8_t buf, GDBRegister regno, bool* defined) const {
  // RR_ARCH_FUNCTION performs the return for us.
  RR_ARCH_FUNCTION(read_register_arch, arch(), buf, regno, defined);
}

Much better, and templating enables us to (re)use the RR_ARCH_FUNCTION macro we use elsewhere in the rr codebase.

Having to initialize the tables before using them is annoying, though. And after debugging a failure or two because I didn’t ensure the tables were initialized prior to using them, I wanted something more foolproof. Carefully writing out the array definition with zero-initialized members in the appropriate places sounded long-winded, difficult to read, and prone to failures that were tedious to debug. Ideally, I’d just specify the array indices to initialize along with the corresponding values, and let zero-initialization take care of the rest.

Initially, I wanted to use designated initializers, which are a great C99 feature that provided exactly what I was looking for, readable initialization of specific array indices:

#define RV_X86(gdb_suffix, name) \
  [DREG_##gdb_suffix] = { \
    offsetof(rr::X86Arch::user_regs_struct, name), \
    sizeof(((rr::X86Arch::user_regs_struct*)0)->name) \
  }

struct RegisterValue RegisterInfo<rr::X86Arch>::registers[DREG_NUM_LINUX_I386] = {
  RV_X86(EAX, eax),
  RV_X86(ECX, ecx),
};

I assumed that C++11 (which is the dialect of C++ used in rr) would support something like this, and indeed, my first attempts worked just fine. Except when I added more initializers:

struct RegisterValue RegisterInfo<rr::X86Arch>::registers[DREG_NUM_LINUX_I386] = {
  RV_X86(EAX, eax),
  RV_X86(ECX, ecx),
  // ...other integer registers, eflags, segment registers...
  RV_X86(ORIG_EAX, orig_eax);
};

I got this curious error message from g++: “Sorry, designated initializers not supported.”

But that’s a lie! You clearly do support designated initializers, because I’ve been able to compile code with them!

Well, yes, g++ supports designated initializers for arrays. Except that the indices have to be in consecutive order from the beginning of the array. And all the integer registers, EIP, and the segment registers appeared in consecutive order in the enumeration, starting with DREG_EAX = 0. But DREG_ORIG_EAX came some 20 members after DREG_GS (the last segment register), and so there we had initialization of non-consecutive array indices, and g++ complained. In other words, designated initializers in g++ are useful for documentation and/or readability purposes, but they’re not useful for sparse arrays. Hence the explicit initialization approach described above.

Turns out there’s a nicer, C++11-ier way to handle these kind of sparse arrays, at the cost of some extra infrastructure. Instead of an explicit [] variable, you can have a class with an operator[] overload and a constructor that takes std::initializer_list<std::pair<size_t, YourActualClass>>. The first member of the pair is the index into your sparse array, and the second member is the value you want placed there. The constructor takes care of putting everything in its place; the actual code looks like this:

typedef std::pair<size_t, RegisterValue> RegisterInit;

// Templated over N because different architectures have differently-sized register files.
template<size_t N>
struct RegisterTable : std::array<RegisterValue, N> {
  RegisterTable(std::initializer_list<RegisterInit> list) {
    for (auto& ri : list) {
      (*this)[ri.first] = ri.second;
    }
  }
};

template<>
struct RegisterInfo<rr::X86Arch> {
  static const size_t num_registers = DREG_NUM_LINUX_I386;
  typedef RegisterTable<num_registers> Table;
  static Table registers;
};

RegisterInfo<rr::X86Arch>::Table RegisterInfo<rr::X86Arch>::registers = {
  // RV_X86 macro defined similarly to the previous version.
  RV_X86(EAX, eax),
  RV_X86(ECX, ecx),
  // ...many more register definitions
};

// x86-64 versions defined similarly...

There might be a better way to do this that’s even more C++11-ier. Comments welcome!

Once this table-driven scheme was in place, using it for comparing registers was straightforward: each RegisterValue structure adds a name field, for error messages, and a comparison_mask field, for applying to register values prior to comparison. Registers that we don’t care about comparing can have a register mask of zero, just like registers that we don’t want to expose to GDB can have a “size” of zero. A mask is more flexible than a bool compare/don’t compare field, because eflags in particular is not always determinstic between record and replay, so we mask off particular bits of eflags prior to comparison. The core comparison routine then looks like:

template<typename Arch>
static bool compare_registers_core(const char* name1, const Registers* reg1,
                                   const char* name2, const Registers* reg2,
                                   int mismatch_behavior) {
  bool match = true;

  for (auto& rv : RegisterInfo<Arch>::registers) {
    if (rv.nbytes == 0) {
      continue;
    }

    // Disregard registers that will trivially compare equal.
    if (rv.comparison_mask == 0) {
      continue;
    }

    // XXX correct but oddly displayed for big-endian processors.
    uint64_t val1 = 0, val2 = 0;
    memcpy(&val1, rv.pointer_into(reg1.ptrace_registers()), rv.nbytes);
    memcpy(&val2, rv.pointer_into(reg2.ptrace_registers()), rv.nbytes);
    val1 &= rv.comparison_mask;
    val2 &= rv.comparison_mask;

    if (val1 != val2) {
      maybe_print_reg_mismatch(mismatch_behavior, rv.name, name1, val1, name2,
                               val2);
      match = false;
    }
  }

  return match;
}

which I thought was a definite improvement on the status quo. Additional, architecture-specific comparisons can be layered on top of this (see, for instance, the x86-specialized versions in rr’s Registers.cc).

Eliminating or reducing code by writing data structures instead is one of those coding tasks that gives me a warm, fuzzy feeling after I’ve done it. In addition to the above, I was also able to use the RegisterValue structures to provide architecture-independent dumping of register files, which was another big win in making Registers multi-arch aware. I was quite pleased that everything worked out so well.


30
Oct 14

porting rr to x86-64

(TL;DR: rr from git can record and replay 64-bit programs.  Try it for yourself!)

Over the last several months, I’ve been devoting an ever-increasing amount of my time to making rr able to trace x86-64 programs.  I’ve learned a lot along the way and thought I’d lay out all the major pieces of work that needed to be done to make this happen.

Before explaining the major pieces, it will be helpful to define some terms: the host architecture is the architecture that the rr binary itself is compiled for.  The target architecture is the architecture of the binary that rr is tracing.  These are often equivalent, but not necessarily so: you could be tracing a 64-bit binary with a 64-bit rr (host == target), but then the program starts to run a 32-bit subprocess, which rr also begins to trace (host != target).  And you have to handle both cases in a single rr session, with a single rr binary.  (64-bit rr doesn’t handle the host != target case quite yet, but all the infrastructure is there.)

All of the pieces described below are not new ideas: the major programs you use for development (compiler, linker, debugger, etc.) all have done some variation of what I describe below.  However, it’s not every day that one takes a program written without any awareness of host/target distinctions and endows it with the necessary awareness.

Quite often, a program written exclusively for 32-bit hosts has issues when trying to compile for 64-bit hosts, and rr was no exception in this regard.  Making the code 64-bit clean by fixing all the places that triggered compiler warnings on x86-64, but not on i386, was probably the easiest part of the whole porting effort.  Format strings were a big part of this: writing %llx when you wanted to print a uint64_t, for instance, which assumes that uint64_t is implemented as unsigned long long (not necessarily true on 64-bit hosts).  There were several places where long was used instead of uint32_t.  And there were even places that triggered signed/unsigned comparison warnings on 64-bit platforms only.  (Exercise for the reader: construct code where this happens before looking at the solution.)

Once all the host issues are dealt with, removing all the places where rr assumed semantics or conventions of the x86 architecture was the next step.  In short, all of the code assumed host == target: we were compiled on x86, so that must be the architecture of the program we’re debugging.  How many places actually assumed this, though?  Consider what the very simplified pseudo-code of the rr main recording loop looks like:

while (true) {
  wait for the tracee to make a syscall
  grab the registers at the point of the syscall
  extract the syscall number from the registers (1)
  switch (syscall_number) {
    case SYS_read: (2)
      extract pointer to the data read from the registers (3)
      record contents of data
      break;
    case SYS_clock_gettime:
      extract pointers for argument structures from the registers
      record contents of those argument structures (4)
      break;
    case SYS_mmap: (5)
      ...
    case SYS_mmap2: (6)
      ...
    case SYS_clone: (7)
      ...
    ...
    default:
      complain about an unhandled syscall
  }
  let the tracee resume
}

Every line marked with a number at the end indicates a different instance where host and target differences come into play and/or the code might have assumed x86 semantics.  (And the numbering above is not exhaustive!)  Taking them in order:

  1. You can obtain the registers of your target with a single ptrace call, but the layout of those registers depends on your target.  ptrace returns the registers as a struct user_regs, which differs between targets; the syscall number location obviously differs between different layouts of struct user_regs.
  2. The constant SYS_read refers to the syscall number for read on the host.  If you want to identify the syscall number for the target, you’ll need to do something different.
  3. This instance is a continuation of #1: syscall arguments are passed in different registers for each target, and the locations of those registers differ in size and location between different layouts of struct user_regs.
  4. SYS_clock_gettime takes a pointer to a struct timespec.  How much data should we read from that pointer for recording purposes?  We can’t just use sizeof(struct timespec), since that’s the size for the host, not the target.
  5. Like SYS_read, SYS_mmap refers to the syscall number for mmap on the host, so we need to do something similar to SYS_read here.  But just because two different architectures have a SYS_mmap, it doesn’t mean that the calling conventions for those syscalls at the kernel level are identical.  (This distinction applies to several other syscalls as well.)  SYS_mmap on x86 takes a single pointer argument, pointing to a structure that contains the syscall’s arguments.  The x86-64 version takes its arguments in registers.  We have to extract arguments appropriately for each calling convention.
  6. SYS_mmap2 only exists on x86; x86-64 has no such syscall.  So we have to handle host-only syscalls or target-only syscalls in addition to things like SYS_read.
  7. SYS_clone has four (!) different argument orderings at the kernel level, depending on the architecture, and x86 and x86-64 of course use different argument orderings.  You must take these target differences into account when extracting arguments.  SYS_clone implementations also differ in how they treat the tls parameter, and those differences have to be handled as well.

So, depending on the architecture of our target, we want to use different constants, different structures, and do different things depending on calling conventions or other semantic differences.

The approach rr uses is that the Registers of every rr Task (rr’s name for an operating system thread) has an architecture, along with a few other things like recorded events.  Every structure for which the host/target distinction matters has an arch() accessor.  Additionally, we define some per-architecture classes.  Each class contains definitions for important kernel types and structures, along with enumerations for syscalls and various constants.

Then we try to let C++ templates do most of the heavy lifting.  In code, it looks something like this:

enum SupportedArch {
  x86,
  x86_64,
};

class X86Arch {
  /* many typedefs, structures, enums, and constants defined... */
};

class X64Arch {
  /* many typedefs, structures, enums, and constants defined... */
};

#define RR_ARCH_FUNCTION(f, arch, args...) \
  switch (arch) { \
    default: \
      assert(0 && "Unknown architecture"); \
    case x86: \
      return f<X86Arch>(args); \
    case x86_64: \
      return f<X64Arch>(args); \
  }

class Registers {
public:
  SupportedArch arch() const { ... }

  intptr_t syscallno() const {
    switch (arch()) {
      case x86:
        return u.x86.eax;
      case x86_64:
        return u.x64.rax;
    }
  }

  // And so on for argN accessors and so forth...

private:
  union RegisterUnion {
    X86Arch::user_regs x86;
    X64Arch::user_regs x64;
  } u.
};

template <typename Arch>
static void process_syscall_arch(Task* t, int syscall_number) {
  switch (syscall_number) {
    case Arch::read:
      remote_ptr buf = t->regs().arg2();
      // do stuff with buf
      break;
    case Arch::clock_gettime:
      // We ensure Arch::timespec is defined with the appropriate types so it
      // is exactly the size |struct timespec| would be on the target arch.
      remote_ptr tp = t->regs().arg2();
      // do stuff with tp
      break;
    case Arch::mmap:
      switch (Arch::mmap_argument_semantics) {
        case Arch::MmapRegisterArguments:
          // x86-64
          break;
        case Arch::MmapStructArguments:
          // x86
          break;
      }
      break;
    case Arch::mmap2:
      // Arch::mmap2 is always defined, but is a negative number on architectures
      // where SYS_mmap2 isn't defined.
      // do stuff
      break;
    case Arch::clone:
      switch (Arch::clone_argument_ordering) {
        case Arch::FlagsStackParentTLSChild:
          // x86
          break;
        case Arch::FlagsStackParentChildTLS:
          // x86-64
          break;
      }
      break;
    ...
  }
}

void process_syscall(Task* t, int syscall_number) {
  int syscall_number = t->regs().syscallno();
  RR_ARCH_FUNCTION(process_syscall_arch, t->arch(), t, syscall_number);
}

The definitions of X86Arch and X64Arch also contain static_asserts to try and ensure that we’ve defined structures correctly for at least the host architecture.  And even now the definitions of the structures aren’t completely bulletproof; I don’t think the X86Arch definitions of some structures are robust on a 64-bit host because of differences in structure field alignment between 32-bit and 64-bit, for instance.  So that’s still something to fix in rr.

Templates handle the bulk of target-specific code in rr.  There are a couple of places where we need to care about how the target implements mmap and other syscalls which aren’t amenable to templates (or, at least, we didn’t use them; it’s probably possible to (ab)use templates for these purposes), and so we have code like:

Task* t = ...
if (has_mmap2_syscall(t->arch())) {
  // do something specifically for mmap2
} else {
  // do something with mmap
}

Finally, various bits of rr’s main binary and its testsuite are written in assembly, so of course those needed to be carefully ported over.

That’s all the major source-code related work that needed to be done. I’ll leave the target-specific runtime work required for a future post.

x86-64 support for rr hasn’t been formally released, but the x86-64 support in the github repository is functional: x86-64 rr passes all the tests in rr’s test suite and is able to record and replay Firefox mochitests.  I will note that it’s not nearly as fast as the x86 version; progress is being made in improving performance, but we’re not quite there yet.

If you’re interested in trying 64-bit rr out, you’ll find the build and installation instructions helpful, with one small modification: you need to add the command-line option -Dforce64bit=ON to any cmake invocations.  Therefore, to build with Makefiles, one needs to do:

git clone https://github.com/mozilla/rr.git
mkdir obj64
cd obj64
cmake -Dforce64bit=ON ../rr
make -j4
make check

Once you’ve done that, the usage instructions will likely be helpful.  Please try it out and report bugs if you find any!