Oct 14

porting rr to x86-64

(TL;DR: rr from git can record and replay 64-bit programs.  Try it for yourself!)

Over the last several months, I’ve been devoting an ever-increasing amount of my time to making rr able to trace x86-64 programs.  I’ve learned a lot along the way and thought I’d lay out all the major pieces of work that needed to be done to make this happen.

Before explaining the major pieces, it will be helpful to define some terms: the host architecture is the architecture that the rr binary itself is compiled for.  The target architecture is the architecture of the binary that rr is tracing.  These are often equivalent, but not necessarily so: you could be tracing a 64-bit binary with a 64-bit rr (host == target), but then the program starts to run a 32-bit subprocess, which rr also begins to trace (host != target).  And you have to handle both cases in a single rr session, with a single rr binary.  (64-bit rr doesn’t handle the host != target case quite yet, but all the infrastructure is there.)

All of the pieces described below are not new ideas: the major programs you use for development (compiler, linker, debugger, etc.) all have done some variation of what I describe below.  However, it’s not every day that one takes a program written without any awareness of host/target distinctions and endows it with the necessary awareness.

Quite often, a program written exclusively for 32-bit hosts has issues when trying to compile for 64-bit hosts, and rr was no exception in this regard.  Making the code 64-bit clean by fixing all the places that triggered compiler warnings on x86-64, but not on i386, was probably the easiest part of the whole porting effort.  Format strings were a big part of this: writing %llx when you wanted to print a uint64_t, for instance, which assumes that uint64_t is implemented as unsigned long long (not necessarily true on 64-bit hosts).  There were several places where long was used instead of uint32_t.  And there were even places that triggered signed/unsigned comparison warnings on 64-bit platforms only.  (Exercise for the reader: construct code where this happens before looking at the solution.)

Once all the host issues are dealt with, removing all the places where rr assumed semantics or conventions of the x86 architecture was the next step.  In short, all of the code assumed host == target: we were compiled on x86, so that must be the architecture of the program we’re debugging.  How many places actually assumed this, though?  Consider what the very simplified pseudo-code of the rr main recording loop looks like:

while (true) {
  wait for the tracee to make a syscall
  grab the registers at the point of the syscall
  extract the syscall number from the registers (1)
  switch (syscall_number) {
    case SYS_read: (2)
      extract pointer to the data read from the registers (3)
      record contents of data
    case SYS_clock_gettime:
      extract pointers for argument structures from the registers
      record contents of those argument structures (4)
    case SYS_mmap: (5)
    case SYS_mmap2: (6)
    case SYS_clone: (7)
      complain about an unhandled syscall
  let the tracee resume

Every line marked with a number at the end indicates a different instance where host and target differences come into play and/or the code might have assumed x86 semantics.  (And the numbering above is not exhaustive!)  Taking them in order:

  1. You can obtain the registers of your target with a single ptrace call, but the layout of those registers depends on your target.  ptrace returns the registers as a struct user_regs, which differs between targets; the syscall number location obviously differs between different layouts of struct user_regs.
  2. The constant SYS_read refers to the syscall number for read on the host.  If you want to identify the syscall number for the target, you’ll need to do something different.
  3. This instance is a continuation of #1: syscall arguments are passed in different registers for each target, and the locations of those registers differ in size and location between different layouts of struct user_regs.
  4. SYS_clock_gettime takes a pointer to a struct timespec.  How much data should we read from that pointer for recording purposes?  We can’t just use sizeof(struct timespec), since that’s the size for the host, not the target.
  5. Like SYS_read, SYS_mmap refers to the syscall number for mmap on the host, so we need to do something similar to SYS_read here.  But just because two different architectures have a SYS_mmap, it doesn’t mean that the calling conventions for those syscalls at the kernel level are identical.  (This distinction applies to several other syscalls as well.)  SYS_mmap on x86 takes a single pointer argument, pointing to a structure that contains the syscall’s arguments.  The x86-64 version takes its arguments in registers.  We have to extract arguments appropriately for each calling convention.
  6. SYS_mmap2 only exists on x86; x86-64 has no such syscall.  So we have to handle host-only syscalls or target-only syscalls in addition to things like SYS_read.
  7. SYS_clone has four (!) different argument orderings at the kernel level, depending on the architecture, and x86 and x86-64 of course use different argument orderings.  You must take these target differences into account when extracting arguments.  SYS_clone implementations also differ in how they treat the tls parameter, and those differences have to be handled as well.

So, depending on the architecture of our target, we want to use different constants, different structures, and do different things depending on calling conventions or other semantic differences.

The approach rr uses is that the Registers of every rr Task (rr’s name for an operating system thread) has an architecture, along with a few other things like recorded events.  Every structure for which the host/target distinction matters has an arch() accessor.  Additionally, we define some per-architecture classes.  Each class contains definitions for important kernel types and structures, along with enumerations for syscalls and various constants.

Then we try to let C++ templates do most of the heavy lifting.  In code, it looks something like this:

enum SupportedArch {

class X86Arch {
  /* many typedefs, structures, enums, and constants defined... */

class X64Arch {
  /* many typedefs, structures, enums, and constants defined... */

#define RR_ARCH_FUNCTION(f, arch, args...) \
  switch (arch) { \
    default: \
      assert(0 && "Unknown architecture"); \
    case x86: \
      return f<X86Arch>(args); \
    case x86_64: \
      return f<X64Arch>(args); \

class Registers {
  SupportedArch arch() const { ... }

  intptr_t syscallno() const {
    switch (arch()) {
      case x86:
        return u.x86.eax;
      case x86_64:
        return u.x64.rax;

  // And so on for argN accessors and so forth...

  union RegisterUnion {
    X86Arch::user_regs x86;
    X64Arch::user_regs x64;
  } u.

template <typename Arch>
static void process_syscall_arch(Task* t, int syscall_number) {
  switch (syscall_number) {
    case Arch::read:
      remote_ptr buf = t->regs().arg2();
      // do stuff with buf
    case Arch::clock_gettime:
      // We ensure Arch::timespec is defined with the appropriate types so it
      // is exactly the size |struct timespec| would be on the target arch.
      remote_ptr tp = t->regs().arg2();
      // do stuff with tp
    case Arch::mmap:
      switch (Arch::mmap_argument_semantics) {
        case Arch::MmapRegisterArguments:
          // x86-64
        case Arch::MmapStructArguments:
          // x86
    case Arch::mmap2:
      // Arch::mmap2 is always defined, but is a negative number on architectures
      // where SYS_mmap2 isn't defined.
      // do stuff
    case Arch::clone:
      switch (Arch::clone_argument_ordering) {
        case Arch::FlagsStackParentTLSChild:
          // x86
        case Arch::FlagsStackParentChildTLS:
          // x86-64

void process_syscall(Task* t, int syscall_number) {
  int syscall_number = t->regs().syscallno();
  RR_ARCH_FUNCTION(process_syscall_arch, t->arch(), t, syscall_number);

The definitions of X86Arch and X64Arch also contain static_asserts to try and ensure that we’ve defined structures correctly for at least the host architecture.  And even now the definitions of the structures aren’t completely bulletproof; I don’t think the X86Arch definitions of some structures are robust on a 64-bit host because of differences in structure field alignment between 32-bit and 64-bit, for instance.  So that’s still something to fix in rr.

Templates handle the bulk of target-specific code in rr.  There are a couple of places where we need to care about how the target implements mmap and other syscalls which aren’t amenable to templates (or, at least, we didn’t use them; it’s probably possible to (ab)use templates for these purposes), and so we have code like:

Task* t = ...
if (has_mmap2_syscall(t->arch())) {
  // do something specifically for mmap2
} else {
  // do something with mmap

Finally, various bits of rr’s main binary and its testsuite are written in assembly, so of course those needed to be carefully ported over.

That’s all the major source-code related work that needed to be done. I’ll leave the target-specific runtime work required for a future post.

x86-64 support for rr hasn’t been formally released, but the x86-64 support in the github repository is functional: x86-64 rr passes all the tests in rr’s test suite and is able to record and replay Firefox mochitests.  I will note that it’s not nearly as fast as the x86 version; progress is being made in improving performance, but we’re not quite there yet.

If you’re interested in trying 64-bit rr out, you’ll find the build and installation instructions helpful, with one small modification: you need to add the command-line option -Dforce64bit=ON to any cmake invocations.  Therefore, to build with Makefiles, one needs to do:

git clone https://github.com/mozilla/rr.git
mkdir obj64
cd obj64
cmake -Dforce64bit=ON ../rr
make -j4
make check

Once you’ve done that, the usage instructions will likely be helpful.  Please try it out and report bugs if you find any!

Sep 14

xpcom and move constructors

Benjamin Smedberg recently announced that he was handing over XPCOM module ownership duties to me.  XPCOM contains basic data structures used throughout the Mozilla codebase, so changes to its code can have wide-ranging effects.  I’m honored to have been given responsibility for a core piece of the Gecko platform.

One issue that’s come up recently and I’m sure will continue to come up is changing XPCOM data structures to support two new C++11 features, rvalue references and their killer app, move constructors.  If you aren’t familiar with C++11’s new rvalue references feature, I highly recommend C++ Rvalue References Explained.  Move constructors are already being put to good use elsewhere in the codebase, notably mozilla::UniquePtr, which can be used to replace XPCOM’s nsAutoPtr and nsAutoRef (bug 1055035).  And some XPCOM data structures have received the move constructor treatment, notably nsRefPtr (bug 980753) and nsTArray (bug 982212).

A recent discussion and the associated bug, however, decided that the core referenced-counted smart pointer class in XPCOM, nsCOMPtr, shouldn’t support move constructors.  While move constructors could have replaced the already_AddRefed usage associated with nsCOMPtr, such as:

  nsCOMPtr<nsIMyInterface> interface = ...;
  // do some initialization stuff
  return interface.forget();

with the slightly shorter:

  nsCOMPtr<nsIMyInterface> interface = ...;
  // do some initialization stuff
  return interface;

There were two primary arguments against move constructor support.  The first argument was that the explicitness of having to call .forget() on an nsCOMPtr (along with the explicitness of the already_AddRefed type), rather than returning it, is valuable for the code author, the patch reviewer, and subsequent readers of the code.  When dealing with ownership issues in C++, it pays to be more explicit, rather than less.  The second argument was that due to the implicit conversion of nsCOMPtr<T> to a bare T* pointer (a common pattern in smart pointer classes), returning nsCOMPtr<T> from functions makes it potentially easy to write buggy code:

// What really happens in the below piece of code is something like:
// nsIMyInterface* p;
// {
//   nsCOMPtr<nsIMyInterface> tmp(NS_DoSomething(...));
//   p = tmp.get();
// }
// which is bad if NS_DoSomething is returning the only ref to the object.
// p now points to deleted memory, which is a security risk.
nsIMyInterface* p = NS_DoSomething(...);

(I should note that we can return nsCOMPtr<T> from functions today, and in most cases, thanks to compiler optimizations, it will be as efficient as returning already_AddRefed.  But Gecko culture is such that a function returning nsCOMPtr<T> would be quite unusual, and therefore unlikely to pass code review.)

The changes to add move constructors to nsRefPtr and nsTArray?  They were reviewed by me.  And the nixing of move constructors for nsCOMPtr?  That was also done by me (with a lot of feedback from other people).

I accept the charge of inconsistency.  However, I offer the following defense.  In the case of nsTArray, there are no ownership issues like there are with nsCOMPtr: you either own the array, or you don’t, so many of the issues raised about nsCOMPtr don’t apply in that case.

For the case of nsRefPtr, it is true that I didn’t seek out as much input from other people before approving the patch.  But the nsRefPtr patch was also done without the explicit goal of removing already_AddRefed from the code base, which made it both smaller in scope and more palatable.  Also, my hunch is that nsRefPtr is used somewhat less than nsCOMPtr (although this may be changing somewhat given other improvements in the codebase, like WebIDL), and so it provides an interesting testbed for whether move constructors and/or less explicit transfers of ownership are as much of a problem as argued above.

Jun 14

on code review and commit policies

I’ve been doing what feels like a lot of reviews this year.  Groveling imperfectly through mozilla-inbound’s commit log:

[froydnj@cerebro mi.hg]$ hg log -d '>2014-01-01' -k r=froydnj -k r=nfroyd|grep -c summary:

tells me I’ve been doing slightly over 1 review/day, which seems like a healthy clip.  Most of these reviews are hardly rocket science: some minor changes to make a compiler happy(ier), some sweeping style changes, and tweaks to make things more robust against previously unforeseen problems.  I call out reviews that take me “significant” amounts of time in my weekly status updates to my team; last week was the first week in a while where I could say that a review would lead to significantly better code being committed into mozilla-central.

You could take this as commentary on the generally high quality of patches that I get to review (the “our contributors are incredible” argument). You could take this as people asking me to review fairly trivial patches and saving the “real” patches for “real” reviewers (the “inferiority complex” argument). You could also take this as evidence that not all patches need an explicitly-requested and mandatory code review prior to commit (the “you’re doing code review wrong” argument).

All the projects I’ve worked on in my professional life have what I’ll call “hard” code review policies: everything that gets committed must undergo review, save for “obvious” patches. Definitions of “obvious” vary, but spelling fixes, backing out build/test bustages, and no-questions-asked correctness fixes generally qualify. (GDB has a “will the person who hates my work the most be able to find fault with the change” test for obviousness that seems both whimsical and helpful.) Such a policy applies to everyone, from the person who just submitted their first patch to the veteran working for 10+ years on the same project.

Other, equally successful projects that I’ve had contact with have what I’ll call “soft” code review policies: patches from people without commit bits undergo review, but being given commit privileges is an expression of trust. (This trust is qualitatively different from the trust expressed in granting commit privileges in the “hard” review model.) Your prior work has demonstrated sufficient taste and competence that you may commit patches freely, but you are also expected to exhibit appropriate humility in knowing when to ask for a second opinion. Of course, your code may undergo post-commit review by interested parties, where changes ranging from minor edits to complete backouts may be requested.

What percentage of commits benefit from a “hard” review policy? Can we classify patches so as to apply the “correct” review policy? (Can the people writing the patches even identify the “correct” review policy themselves?)

May 14

the compiler is always right

I think everybody who programs has had a bug in one of their programs that they were positive was the compiler’s fault.  (Perhaps “it’s the operating system’s fault” or “it’s the hardware’s fault”, though these are less common.) At this point, you learn the rules of programming:

  1. The compiler is always right.
  2. If the compiler is ever wrong, see rule #1.

The corollary, of course, is that it is your program that has the bug, not the compiler.  (The third through sixth laws are restatements of these two with the operating system and the hardware, respectively.)

Yesterday was one of those occasions where I thought the compiler might be wrong and spent a little time remembering the First Rule.  It’s instructive to look at the example and understand why the compiler is always right.  I was looking at bug 611781, reducing the size of the NSS library shipped with Firefox, and ran across this Linux x86-64 assembly code, compiled with gcc -Os:

0000000000000000 <NSSTrustDomain_GenerateSymmetricKeyFromPassword>:
   0:	41 51                	push   %r9
   2:	48 8b 05 00 00 00 00 	mov    0x0(%rip),%rax
			5: R_X86_64_GOTPCREL	NSS_ERROR_NOT_FOUND+0xfffffffffffffffc
   9:	8b 38                	mov    (%rax),%edi
   b:	e8 00 00 00 00       	callq  10 <NSSTrustDomain_GenerateSymmetricKeyFromPassword+0x10>
			c: R_X86_64_PLT32	nss_SetError+0xfffffffffffffffc
  10:	31 c0                	xor    %eax,%eax
  12:	41 5a                	pop    %r10
  14:	c3                   	retq   

You can find a lot of these small functions in lib/pki/trustdomain.c in an NSS source tree. Looking over this, you notice two things if you know a little bit about x86-64 assembler:

  1. The push and pop instructions are suspiciously mismatched; if I save a register to the stack with push, I ought to restore the same register from the stack with a pop. The compiler must have a bug!
  2. Also, there’s an efficiency concern here. Why is the compiler saving and restoring any registers here? We never use %r9 in the body of the function, so we shouldn’t need to save it at all.

What is the compiler doing here? It’s easiest to explain away the second issue first. If you look at the x86-64 ABI document, you’ll see that in section 3.2.2, the stack must be kept 16-byte aligned before calling other functions. Since our function calls another function, our function must ensure that the stack is properly aligned when that other function begins execution. And since the call instruction on x86-64 (which is how we would have arrived at our function) adjusts the stack pointer by 8 bytes (in addition to all the other work it does), our function must adjust by an additional 8 bytes to maintain 16-byte alignment. The compiler has chosen to use a push instruction to manipulate the stack pointer in this case. This instruction subtracts 8 bytes from the stack pointer and stores the indicated register into the memory at the new stack pointer.

Another way to do this would be to subtract 8 bytes from the stack pointer (sub $0x8, %rsp), which avoids writing the register to memory at all. If you compile with -O2, optimizing for speed, instead of -Os, you would indeed see the compiler using sub $0x8, %rsp. But since we compiled this code with -Os, optimizing for size, the compiler knows that the instruction for pushing a register onto the stack (2 bytes) is smaller than the instruction for subtracting 8 bytes from the stack pointer (4 bytes). Likewise, the instruction for popping a register from the stack (2 bytes) is smaller than the instruction for adding 8 bytes to the stack pointer (4 bytes).

These “useless” instructions are therefore doing real work, which is maintaining the contract of the ABI.

OK, so the efficiency claim has been addressed. What about the correctness claim with mismatched registers? Again, if you look at the aforementioned ABI document, section 3.2.3 describes how registers are used for function calls. The registers %r9 and %r10 are caller-saved registers, which means that a called function is free to overwrite any values stored in those registers. It doesn’t matter what value, if any, our function stores in %r10, because we know that if the caller had cared about the value, the caller would have stored that value away somewhere. Since we need spare registers for maintaining stack alignment via push and pop, caller-saved registers are ideal for pushing and popping with abandon.

In this case, it turned out that my understanding of what the program was doing had the bug, not the compiler.  It’s also worth pointing out that if the compiler really was mismatching register saves and restores, lots and lots of things would be broken.  The likelihood of the code produced in this instance being wrong—but the same problem not occurring in the millions of lines of code the compiler has compiled to produce the system on your computer—is vanishingly small.  The next time you see the compiler doing something weird, remember that the compiler is always right and try to figure out why it’s doing that.

(I should say, of course, that compilers, just like many other computer programs, do have bugs; GCC’s bugzilla or LLVM’s bugzilla would not exist otherwise, nor would bugfix releases continue to come out for your favorite compiler.  But in the vast, vast majority of cases, you have a bug to fix, not the compiler.)

May 14

my code search engine

Christian Legnitto wrote a blog post where he mentioned Firefox developers being forced to deal with “crufty code-search tools” (and many other perceived suboptimalities in the development process).  I’m looking forward to reading his followup, but I also thought it was worth blogging about what I use for my day-to-day code search needs.

I use Emacs’s rgreprgrep (and its cousin commands grep and lgrep) executes grep on a group of files in a given directory, recursively. The grep results are then displayed in a hyperlinked format for easy browsing between matches. Well, being Emacs, I use it with a few modifications of my own.  Here’s my setup.

First, a small utility I wrote for making quick selections with a single keypress:

(defun character-prompt (alist description)
  (message "Select [%s]: "
           (apply #'string (mapcar #'car alist)))
  (let* ((ch (save-window-excursion
               (select-window (minibuffer-window))
         (x (find ch alist :key #'car)))
      ((null x)
       (message (format "No %s for character: %s" description ch))
       (sleep-for 1)
       (character-prompt alist description))
      (t (cdr x)))))

This function gets used in the small wrapper I wrote around rgrep. Some preliminaries first, like where the Firefox tree lives, files that contain overly long lines and therefore mess with Emacs’s hyperlinking, and directories that I generally don’t deal with in my day-to-day work.

(defvar froydnj-mozilla-srcdir (expand-file-name "~/src/gecko-dev.git/"))
(defvar froydnj-mozilla-ignored-files
  (list "searchindex.js"
(defvar froydnj-mozilla-ignored-directories
  (list "nss" "nsprpub" "js/src/tests" "intl/icu"))

Next, the way I select subsets of files to search in. I learned after writing all this that rgrep already has built-in functionality for this (see the grep-files-aliases variable), but I like my setup better.

(defvar froydnj-mozilla-files
  '((?a . "*")                    ; All of it
    (?c . "*.[cm]*")              ; C/C++/Obj-C
    (?C . "*.[cmh]*")             ; Same, plus headers (and HTML, sadly)
    (?h . "*.h")
    (?H . "*.html")
    (?i . "*.idl")
    (?j . "*.js*")
    (?l . "*.lisp")
    (?m . "Makefile.in")
    (?p . "*.py")
    (?v . "*.java")
    (?w . "*.webidl")))

Finally, the wrapper itself, which prompts for the search pattern, the filename pattern, makes sure the directories and files above are ignored, and executes the search.

(defun froydnj-mozilla-rgrep ()
  (let ((regexp (grep-read-regexp))
        (files (character-prompt froydnj-mozilla-files "filename pattern"))
        (grep-find-ignored-files (append grep-find-ignored-files
        (grep-find-ignored-directories (append grep-find-ignored-directories
    (rgrep regexp files froydnj-mozilla-srcdir)))

One other bit that I find useful is a custom name for each buffer. By default, the rgrep results are deposited in a buffer named *grep* (likewise for grep and lgrep; the following advice applies to them automatically), and future greps overwrite this buffer. Having records of your greps lying around is occasionally useful, so I’ve changed the hook that determines the buffer name for rgrep. The comments that refer to compilation are from a previous job where it was useful to launch compilation jobs from within Emacs. I keep meaning to tweak those bits to launch mach in various configurations instead, but I haven’t done so yet.

(defun froydnj-compilation-buffer-name (major-mode-name)
  (let ((cfg-regexp "\\([-0-9_.a-z/+]+\\.cfg\\)"))
      ;; We're doing local compilation, stick the name of the release
      ;; configuration in the buffer name.
      ((or (string-match "^cd /scratch/froydnj/\\([^ ;]+\\)" command)
           (string-match "^build-config" command))
       (string-match cfg-regexp command)
       (concat "*compilation " (match-string 1 command) "*"))
      ;; We're doing remote compilation, note the machine name and
      ;; the release configuration name.
      ((string-match "^ssh \\([^ ]+\\)" command)
       (let ((machine (match-string 1 command)))
         (string-match cfg-regexp command)
         (concat "*compilation@" machine " " (match-string 1 command) "*")))
      ;; grep.el invokes compile, we might as well take advantage of that.
      ((string-equal major-mode-name "grep")
       (if (boundp 'regexp)
           (concat "*grep for " regexp "*")
      ;; We have no idea, just use the default.

(setq compilation-buffer-name-function 'froydnj-compilation-buffer-name)

Search times are comparable to web-based tools, and browsing results is more convenient. It has its shortcomings (overloaded C++ method names can be a pain to deal with, for instance), but it works well enough for 95%+ of my searching needs.

Apr 14

getting older

I have been reading The Eighth Day of Creation by Horace Freeland Judson, which is a superb book, and thought this passage was relevant to writing software as well as scientific research:

At lunch one day in Paris, early in December of 1975, I asked Monod whether he missed doing research directly. “Oh, I miss it,” he said; then what began as a shrug became instantaneously more thoughtful. “I do more than miss it. It’s too short a question.” He paused, began again. “No, I don’t know that it is actually working at the bench that I miss—miss so very much, although I do, at times; but it is in fact not being this permanent contact with what’s going on in science, in the doing, which I do miss.” I was reminded of a parallel conversation in which Watson had tried to claim the opposite, that he could stay close to what was happening in science. But if one was not actively working, Monod said, “Then you don’t have that. And also if you’re overburdened with general responsibilities, it becomes not so much a question of time but your subjective preoccupations. There’s a displacement—the internal conversation that you keep running in your head concerns all sorts of subjects, things that have got to be done, rather than just thinking about situations [in research]. That’s what bothers me most.”

When his term as director was up? “No, it’s too late to go back to research.” Why? Monod paused once more, and then said, “Well, you know, I always had a sort of amused and—amused, pitiful sympathy for the wonderful old guys who were still doing something at the bench when it was quite clear that whatever they did, it would be less than one hundredth of what they had been able to do before.” We spoke of examples—of scientists whose work became gradually less interesting as they aged, of others who lost their critical judgement and fooled themselves into believing they had solved problems that were beyond them…

[Kornberg said] “Almost every scientist winds up working on a problem he can’t bear to solve. And that’s where his life in science ends. It’s probably being very cruel to the older scientists, but I really believe it’s true. Or sometimes it’s a gradual loss of energy, of the ability to focus the energy on the problem. Or perhaps it’s a loss of edge—of the hunger. Some younger scientists—a few—have that quality that Francis has exemplified; he was ruthless in solving problems, I mean he would just carve them up and solve them in the most brutal way, much to the dismay of people like Chargaff who enjoyed the mystery of those problems and didn’t want to see it disappear, to them the mystery was the beauty of it….It probably does happen to all aging scientists.”

Feb 14

finding addresses of virtual functions

Somebody on #developers this morning wanted to know if there was an easy way to find the address of the virtual function that would be called on a given object…without a debugger. Perhaps this address could be printed to a logfile for later analysis. Perhaps it just sounds like an interesting exercise.

Since my first attempt had the right idea, but was incomplete in some details, I thought I’d share the actual solution. The solution is highly dependent on the details of the C++ ABI for your particular platform; the code below is for Linux and Mac. If somebody wants to write up a solution for Windows, where the size of pointer-to-member functions can vary (!), I’d love to see it.


/* AbstractClass may have several different concrete implementations.  */
class AbstractClass {
  virtual int f() = 0;
  virtual int g() = 0;

/* Return the address of the `f' function of `aClass' that would be
   called for the expression:


   regardless of the concrete type of `aClass'.

   It is left as an exercise for the reader to templatize this function for
   arbitrary `f'.  */
find_f_address(AbstractClass* aClass)
  /* The virtual function table is stored at the beginning of the object.  */
  void** vtable = *(void***)aClass;

  /* This structure is described in the cross-platform "Itanium" C++ ABI:


     The particular layout replicated here is described in:

     http://mentorembedded.github.io/cxx-abi/abi.html#member-pointers  */
  struct pointerToMember
    /* This field has separate representations for non-virtual and virtual
       functions.  For non-virtual functions, this field is simply the
       address of the function.  For our case, virtual functions, this
       field is 1 plus the virtual table offset (in bytes) of the function
       in question.  The least-significant bit therefore discriminates
       between virtual and non-virtual functions.

       "Ah," you say, "what about architectures where function pointers do
       not necessarily have even addresses?"  (ARM, MIPS, and AArch64 are
       the major ones.)  Excellent point.  Please see below.  */
    size_t pointerOrOffset;

    /* This field is only interesting for calling the function; it
       describes the amount that the `this' pointer must be adjusted
       prior to the call.  However, on architectures where function
       pointers do not necessarily have even addresses, this field has the

       2 * adjustment + (virtual_function_p ? 1 : 0)  */
    ptrdiff_t thisAdjustment;

  /* Translate from the opaque pointer-to-member type representation to
     the representation given above.  */
  pointerToMember p;
  int ((AbstractClass::*m)()) = &AbstractClass::f;
  memcpy(&p, &m, sizeof(p));

  /* Compute the actual offset into the vtable.  Given the differing meaing
     of the fields between architectures, as described above, and that
     there's no convenient preprocessor macro, we have to do this
     ourselves.  */
#if defined(__arm__) || defined(__mips__) || defined(__aarch64__)
  /* No adjustment required to `pointerOrOffset'.  */
  static const size_t pfnAdjustment = 0;
  /* Strip off the lowest bit of `pointerOrOffset'.  */
  static const size_t pfnAdjustment = 1;

  size_t offset = (p.pointerOrOffset - pfnAdjustment) / sizeof(void*);

  /* Now grab the address out of the vtable and return it.  */
  return vtable[offset];

Jan 14

space saving miscellany

Yesterday’s post on space saving techniques generated a few comments.  It seemed worthwhile to highlight a few of the comments for a wider audience.

  • Various people have pointed out that clang and GCC support a -Wpadded option to warn when padding is necessary inside of a structure.  Visual C++ supports warning C4280 that does the same thing.  You can enable this warning in Visual C++ by passing /we4280 on the compiler command line.  I’m fairly certain this warning would generate a lot of output, but it might be worthwhile to comb through the output and see if anything interesting turns up.
  • David Major pointed out the /d1reportAllClassLayout switch for Visual C++, which prints the class layout of all the classes in the compilation unit.  If you’re only interested in a single class, you can use /d1reportSingleClass$NAME to narrow the report down to the class with $NAME. GCC used to have something similar in -fdump-class-hierarchy, but that option has been removed.
  • Benoit Girard asked if he could see a list of the 50 largest things on a Linux build.  Forthwith, I present the 50 largest objects and the 50 largest functions in libxul for an x86-64 Linux optimized build.  One thing to note about the objects is that they’re not all in the same section; the seventh field in readelf output tells you the section index.  So for the linked list of objects above, section 15 is .rodata (read-only, shareable data), section 22 is .data (read-write non-shareable data), section 27 is .data.rel.ro (data that needs to have relocations applied at load time, but can be read-only thereafter, e.g. virtual function tables), and section 29 is .bss (zero-initialized memory).  Unsurprisingly, string encoding/decoding tables are the bulk of the large objects, with various bits from WebRTC, JavaScript, and the Gecko profiler also making an appearance.  Media codec-related functions appear to take up a large amount of space, along with some JavaScript functions, and a few things related to the HTML parser.
  • A commenter by the name of “AnimalFriend” correctly pointed out that what you really want to know is which structures both have a lot of instances hanging around and have holes that you could fill.  I don’t know of a good way to answer the first part without adding a lot of instrumentation (though perhaps you could catch a lot of cases by augmenting the MOZ_COUNT_CTOR macro to tell you which structures get allocated a lot). The second part can be answered by something like pahole.
  • Alternatively, you could use something like access_profiler to tell you what fields in your objects get accessed and how often, then carefully packing those fields into the same cache line.  The techniques access_profiler uses are also applicable to counting allocations of individual objects.  Maybe we should start using something more access_profiler-like instead of MOZ_COUNT_CTOR and friends!  Definitely more C++-ish, more flexible, and it eliminates the need to write the corresponding MOZ_COUNT_DTOR.

Jan 14

finding space savings

My last post talked about trimming down JSJitInfo.  But it left the question of how one might find things like this unanswered (and even unasked!).  Herein follows a short list of my strategies, with a large amount of explanation, for finding things that take up too much space.

The first thing to look at is what objects (as distinct from functions, see below) are taking up a lot of space (on Linux/Android):

readelf -sW dist/bin/libxul.so | awk '$4 == "OBJECT" { print }' | sort -k 3 -n -r | head -n 50 | c++filt

(I don’t know of a good way of doing this on Mac or Windows. Mach-O symbol tables on Mac don’t contain the sizes of objects like ELF symbol tables do. The sizes are derivable from the addresses, of course, but that’s non-trivial code to write and insert into the pipeline above. dumpbin on Windows appears to print sizes of symbols with /symbols, but its output is so verbose and there appears to be some sort of indirection between the symbol and the actual data associated with the symbol. If anybody has non-Linux/Android insight here, I would appreciate it!)

That pipeline is a mouthful, so breaking it down:

  • readelf -sW dist/bin/libxul.so: The -s argument tells readelf to print the symbol table from libxul.so. The -W argument says to format it for a wide display, so the long symbol names that are so common with C++ programming don’t get cut off.
  • awk '$4 == "OBJECT" { print }': The lines readelf prints out look like this:

    3: 00000000021ef9c0 0 FUNC GLOBAL DEFAULT 13 _fini@@xul29a1

    The interesting ones for our purposes are the third field (the size of the object), the fourth field (the kind of object), and the last field (the name of the object). The awk invocation is selecting all the symbols with a type of OBJECT.

  • sort -k 3 -n -r: Now that we have all of the information about the symbols, let’s sort them by their size. -k 3 selects the third field, which is the size, and -n says that we have a numeric field to sort on. -r is just for convenience to list the biggest symbols first.
  • head -n 50 selects the top fifty biggest symbols.
  • c++filt converts all the mangled C++ symbol names to something more human-readable. So instead of _Z25js_GetterOnlyPropertyStubP9JSContextN2JS6HandleIP8JSObjectEENS2_I4jsidEEbNS1_13MutableHandleINS1_5ValueEEE@@xul29a1, you get js_GetterOnlyPropertyStub(JSContext*, JS::Handle<JSObject*>, JS::Handle, bool, JS::MutableHandle)@@xul29a1, which is more understandable.

You don’t have to look at libxul, of course: you can look at whatever objects or executables are of interest to you.

Once you have this information in hand, you can start investigating. Maybe the object is an array of structures that need better bitfield packing.  Maybe the object is large and doesn’t get used.  Maybe we could do a better job of packing string data if there are embedded pointers which require relocations.  Maybe we could use smaller datatypes to store values (saved nearly a megabyte on 32-bit platforms!).  Maybe a structure’s fields just need to be reordered to pack better.  Maybe we have large, duplicated tables lying around.

For discovering whether structures are wasting space, I recommend pahole.  Specifically, I recommend using my fork of pahole, which contains some build fixes for non-Fedora systems and some improvements so pahole copes with modern C++ debug information better. My experience with pahole suggests that its options for only selecting structures with holes are buggy and unhelpful, so you’re going to have to examine the full output. Using the -M option to avoid printing out all the methods on individual objects is tremendously useful.

The second, but less fruitful, strategy is to look for large functions, rather than large objects:

readelf -sW dist/bin/libxul.so | awk '$4 == "FUNC" { print }' | sort -k 3 -n -r | head -n 50 | c++filt

This is less effective, because functions are often harder to trim down: there’s some amount of work that needs to be done and reducing that is difficult. The biggest functions also tend to be quite a bit smaller than the biggest objects in the program, so there’s less room for winning space back.

Even so, maybe you could find cases of excessive inlining.  Finding cases of excessive inlining is a little more difficult than just looking at the symbol table, but the principles are the same. Or you could find cases where lookup tables should be used instead of long if-else chains.  Or macro-constructed case statements could be made smaller.

The third strategy recognizes that sometimes individual functions or objects aren’t that large, but there are so many of them that they wind up costing you a significant amount of space.  For instance, if you do:

readelf -sW dist/bin/libxul.so | c++filt | grep '::Read' | grep StandardURLParams

you’ll notice that IPDL codegen has created several copies of the exact same function.  Perhaps you could find a way to teach the IPDL compiler to not do that.  Or maybe there are several copies of functions that differ only slightly; find a way to merge those functions together. Templates are great places for finding code duplication, maybe they could be restructured to duplicate less.

A word of caution: link-time optimization, which gets used on a majority of our major platforms, does essentially the same thing as some of the aforementioned by-hand merging strategies. So just because you get some huge win locally doesn’t mean that a release build is going to exhibit the same win. (I would argue that the by-hand merging is beneficial in any event, but that is a topic for some other time.)

Finally, you need to decide what is a reasonable amount of space savings for you to try and hunt down. There are likely several thousand places in Gecko where we could win back ten or twenty bytes. But finding them all is liable to be difficult. Changing them to effect the savings may require significant amounts of code gymnastics that not all reviewers would accept. My personal threshold is somewhere around 10K and varies with how difficult the patch is to write. If I can save 5K with a two-line, fairly obvious patch, I’ll do it. If I can save 12K, but the patch is more involved, it goes on the todo list for a rainy day.

Jan 14

packing structures for fun and profit

When the DOM bindings code first started generating information for the JavaScript JIT about getters and setters, the generated information was rather sparse. JSJitInfo, the container for such information, looked like this:

struct JSJitInfo {
    JSJitPropertyOp op;
    uint32_t protoID;
    uint32_t depth;
    bool isInfallible;
    bool isConstant;

On a 32-bit platform, sizeof(JSJitInfo) was 16. You could shrink some of the fields, and even pack things into bitfields, but there weren’t that many getters and setters, so it probably wasn’t worth it.

Of course, nothing ever gets smaller: the bindings code started generating these structures for methods, the amount of information the bindings code wanted to communicate to the JIT expanded, and some of the parallel JS work started using the structure for its own purposes. And a bunch more interfaces got converted to WebIDL, so many more of these “little” structures were generated. Eventually, we wound up with:

struct JSJitInfo {
    enum OpType { ... };
    enum AliasSet { ... };
    union {
        JSJitGetterOp getter;
        JSJitSetterOp setter;
        JSJitMethodOp method;
    uint32_t protoID;
    uint32_t depth;
    OpType type;
    bool isInfallible;
    bool isMovable;
    AliasSet aliasSet;
    bool isInSlot;
    size_t slotIndex;
    JSValueType returnType;
    const ArgType* const argTypes;
    JSParallelNative parallelNative;

This structure has several issues:

  • There’s wasted space after the isMovable and isInSlot fields, since the aliasSet and slotIndex fields need to be aligned on 4-byte boundaries (for 32-bit platforms). There’s even more wasted space on 64-bit platforms.
  • It’s less obvious, but JSValueType is only a single byte, so there’s wasted space after the returnType field so argTypes can be properly aligned.
  • OpType and AliasSet both have only a few values, yet take up four bytes each.
  • The argTypes field is only used for DOM methods, and even then, it isn’t used for very many.
  • The parallelNative field is never used for any of the DOM bindings code.
  • We’re unlikely to have four billion prototypes in the bindings. The web platform is large, but not that large. Similarly, we are unlikely to have an inheritance graph that’s four billion layers deep, or even several thousand layers deep.

This definition weighed in at 44 bytes on a 32-bit system. With 7k+ of these being generated by the bindings code, and more being added every release cycle, now seemed like a worthwhile time to slim these structures down.

This work has gone on in bug 952777, bug 960653, and yet-to-be-landed bug 960109. After all of those bugs land, we’ll have something that looks like:

struct JSJitInfo {
    union {
        JSJitGetterOp getter;
        JSJitSetterOp setter;
        JSJitMethodOp method;
        JSParallelNative parallelNative;
    uint16_t protoID;
    uint16_t depth;
    uint32_t type_ : 4;
    uint32_t aliasSet_ : 4;
    uint32_t returnType_ : 8;
    uint32_t isInfallible : 1;
    uint32_t isMovable : 1;
    uint32_t isInSlot : 1;
    uint32_t isTypedMethod : 1;
    uint32_t slotIndex : 12;

Compared to our earlier version, we’ve addressed every complaint:

  • No internally wasted space due to field alignment, on 32-bit or 64-bit platforms.
  • Enum fields don’t take up a full four bytes of space anymore; they take much closer to the minimum amount of space needed.
  • DOM methods with type information available now have a separate subclass, JSTypedMethodJitInfo, so the argTypes field is only present when needed.
  • The parallelNative field has been moved into the union, so we’re not wasting that field anymore.
  • The protoID and depth fields are now a more reasonable size.

It’s worth noting that several of these fields could be smaller, but there’s no reason for them to be currently, given that shrinking them wouldn’t make the overall structure smaller.

The final size of the structure is 12 bytes on 32-bit platforms, and 16 bytes on 64-bit platforms. With 7k+ JSJitInfo structures, that means we’ve saved ~220K of space in a 32-bit libxul.  For a 32-bit libxul, this is almost 1% of the entire size of libxul, which is an extremely pleasant win for size optimizations.  Smaller downloads, less space on your device, and less memory at runtime!

If you’re interested in examining how much space particular structures take up, you can use pahole for Linux or struct_layout for Mac. I’m not aware of any similar program for Windows, though I imagine such a beast does exist. These programs work by examining the debugging information generated by the compiler and displaying the structure layout, along with any “holes” in the structure. pahole will also tell you about cacheline boundaries, so that you can pack frequently-accessed members in the same cacheline to minimize cache misses.

Given the above (and that there have been other wins, somewhat less spectacular in nature), I’ve wondered if it isn’t worth adding some sort of annotations, either in-source or off to the side, about how big we expect certain structures to be or how much internal waste we expect them to have. Then people modifying those structures–and the reviewers as well–would be aware that they were modifying something space-critical, and take appropriate steps to minimize the space increase or even trim things down a bit.