Category: Programming

How to get the size of Rust types with -Zprint-type-sizes

Post author By Nicholas Nethercote
Post date November 9, 2018

When optimizing Rust code it’s sometimes useful to know how big a type is, i.e. how many bytes it takes up in memory. std::mem::size_of can tell you, but often you want to know the exact layout as well. For example, an enum might be surprisingly big, in which case you probably will want to know if, for example, there is one variant that is much bigger than the others.

The -Zprint-type-sizes option does exactly this. It isn’t enabled on release versions of rustc, so you’ll need to a nightly version of rustc. Here is one possible invocation:

cargo +nightly rustc -- -Zprint-type-sizes

It will print out details of the size, layout, and alignment of all types in use. For example, for this type:

enum E {
    A,
    B(i32),
    C(u64, u8, u64, u8),
    D(Vec<u32>),
}

it prints the following, plus info about a few built-in types:

print-type-size type: `E`: 32 bytes, alignment: 8 bytes
print-type-size     discriminant: 1 bytes
print-type-size     variant `A`: 0 bytes
print-type-size     variant `B`: 7 bytes
print-type-size         padding: 3 bytes
print-type-size         field `.0`: 4 bytes, alignment: 4 bytes
print-type-size     variant `C`: 23 bytes
print-type-size         field `.1`: 1 bytes
print-type-size         field `.3`: 1 bytes
print-type-size         padding: 5 bytes
print-type-size         field `.0`: 8 bytes, alignment: 8 bytes
print-type-size         field `.2`: 8 bytes
print-type-size     variant `D`: 31 bytes
print-type-size         padding: 7 bytes
print-type-size         field `.0`: 24 bytes, alignment: 8 bytes

It shows:

the size and alignment of the type;
for enums, the size of the discriminant;
for enums, the size of each variant;
the size, alignment, and ordering of all fields (note that the compiler has reordered variant C‘s fields to minimize the size of E);
the size and location of all padding.

Every detail you could possibly want is there. Brilliant!

For rustc developers, there’s an extra-special trick for getting the size of a type within rustc itself. Put code like this into a file a.rs:

#![feature(rustc_private)]
extern crate syntax;
use syntax::ast::Expr;
fn main() {
    let _x = std::mem::size_of::<Expr>();
}

and then compile it like this:

RUSTC_BOOTSTRAP=1 rustc -Zprint-type-sizes a.rs

I won’t pretend to understand how it works, but the use of rustc_private and RUSTC_BOOTSTRAP somehow let you see inside rustc while using it, rather than while compiling it. I have used this trick for PRs such as this one.

Firefox Memory consumption MemShrink Programming

Slimmer and simpler static atoms

Post author By Nicholas Nethercote
Post date October 12, 2018
2 Comments on Slimmer and simpler static atoms

String interning is:

a method of storing only one copy of each distinct string value, which must be immutable. Interning strings makes some string processing tasks more time- or space-efficient at the cost of requiring more time when the string is created or interned. The distinct values are stored in a string intern pool. The single copy of each string is called its intern.

In Firefox’s code we use the term atom rather than intern, and atom table rather than string intern pool. I don’t know why; those names have been used for a long time.

Furthermore, Firefox distinguishes between static atoms, which are those that are chosen at compile time and can be directly referred to via an identifier, and dynamic atoms, which are added on-demand at runtime. This post is about the former.

In 2016, Firefox’s implementation of static atoms was complex and inefficient. I filed a bug about this that included the following ASCII diagram showing all the data structures involved for a single atom for the string “foobar”.

static nsFakeStringBuffer<N=7> foobar_buffer (.data, 8+2N bytes)
/-----------------------------------------\ <------+
| int32_t mRefCnt = 1 // never reaches 0  |        | 
| uint32_t mSize = 14 // 7 x 16-bit chars |        | 
| u"foobar"           // the actual chars | <----+ | 
\-----------------------------------------/      | | 
                                                 | | 
PermanentAtomImpl (heap, 32 bytes)               | | 
/----------------------------------------------\ | | <-+
| void* vtablePtr    // implicit               | | |   | 
| uint32_t mLength = 6                         | | |   | 
| uint32_t mHash = ...                         | | |   | 
| char16_t* mString = @------------------------|-+ |   | 
| uintptr_t mRefCnt  // from NS_DECL_ISUPPORTS |   |   | 
\----------------------------------------------/   |   | 
                                                   |   | 
static nsIAtom* foobar (.bss, 8 bytes)             |   | 
/---\ <-----------------------------------+        |   | 
| @-|-------------------------------------|------------+
\---/                                     |        |   | 
                                          |        |   | 
static nsStaticAtom (.d.r.ro.l, 16 bytes) |        |   | 
(this element is part of a larger array)  |        |   | 
/------------------------------------\    |        |   | 
| nsStringBuffer* mStringBuffer = O--|----|--------+   | 
| nsIAtom** mAtom = @----------------|----+            | 
\------------------------------------/                 | 
                                                       | 
AtomTableEntry (heap, ~2 x 16 bytes[*])                | 
(this entry is part of gAtomTable)                     | 
/-------------------------\                            | 
| uint32_t mKeyHash = ... |                            | 
| AtomImpl* mAtom = @-----|----------------------------+
\-------------------------/                            | 
                                                       | 
StaticAtomEntry (heap, ~2 x 16 bytes[*])               | 
(this entry is part of gStaticAtomTable)               | 
/-------------------------\                            | 
| uint32_t mKeyHash = ... |                            | 
| nsIAtom* mAtom = @------|----------------------------+
\-------------------------/

[*] Each hash table is half full on average, so each entry takes up
approximately twice its actual size.

There is a lot going on in that diagram, but putting that all together gave the following overhead per atom.

Static shared: 0 bytes
Static unshared: 8 + 2(length+1) + 8 + 16
Dynamic: 32 + ~32 + ~32 bytes
Total bytes: (2(length+1) + 64 + ~64) * num_processes

(Although these atoms are “static” in the sense of being known at compile-time, a lot of the associated data was allocated dynamically.)

At the time there were about 2,700 static atoms, and avg_length was about 11, so the overhead was roughly:

0 bytes fixed, and
410,400 bytes per process. (Or more, depending on how the relocations required for the static pointers were represented, which depended on the platform.)

Today, things have improved greatly and now look like the following.

const char16_t[7] (.rodata, 2(N+1) bytes)
(this is detail::gGkAtoms.foobar_string)
/-----------------------------------------\ <--+
| u"foobar"           // the actual chars |    | 
\-----------------------------------------/    | 
                                               | 
const nsStaticAtom (.rodata, 12 bytes)         | 
(this is within detail::gGkAtoms.mAtoms[])     | 
/-------------------------------------\ <---+  | 
| uint32_t mLength:30 = 6             |     |  | 
| uint32_t mKind:2 = AtomKind::Static |     |  | 
| uint32_t mHash = ...                |     |  | 
| uint32_t mStringOffset = @----------|-----|--+
\-------------------------------------/     | 
                                            | 
constexpr nsStaticAtom* (0 bytes) @---------+
(this is nsGkAtoms::foobar)                 | 
                                            | 
AtomTableEntry (heap, ~2 x 16 bytes[*])     | 
(this entry is part of gAtomTable)          | 
/-------------------------\                 | 
| uint32_t mKeyHash = ... |                 | 
| nsAtom* mAtom = @-------|-----------------+
\-------------------------/

[*] Each hash table is half full on average, so each entry takes up
approximately twice its actual size.

That gives the following overhead per atom.

Static shared: 12 + 2(length+1) bytes
Static unshared: 0 bytes
Dynamic: ~32 bytes
Total: 12 + 2(length+1) + ~32 * num_processes

We now have about 2,300 static atoms and avg_length is still around 11, so the overhead is roughly:

82,800 bytes fixed, and
73,600 bytes per process.

I won’t explain all the parts of the two diagrams, but it can be seen that we’ve gone from six pieces per static atom to four; the size and complexity of the remaining pieces are greatly reduced; there are no static pointers (only constexpr pointers and integral offsets) and thus no relocations; and there is a lot more interprocess sharing thanks to more use of const. Also, there is no need for a separate static atom table any more, because the main atom table is thread-safe and the HTML5 parser (the primary user of the separate static atom table) now has a small but highly effective static atoms cache.

Things that aren’t visible from the diagrams: atoms are no longer exposed to JavaScript code via XPIDL, there are no longer any virtual methods involved, and all atoms are defined in a single place (with no duplicates) instead of 7 or 8 different places. Notably, the last few steps were blocked for some time by a bug in MSVC involving the handling of constexpr.

The bug dependency tree gives a good indication of how many separate steps were involved in this work. If there is any lesson to be had here, it’s that small improvements add up over time.

Performance Programming Work habits

Ad Hoc Profiling

I have used a variety of profiling tools over the years, including several I wrote myself.

But there is one profiling tool I have used more than any other. It is capable of providing invaluable, domain-specific profiling data of a kind not obtainable by any general-purpose profiler.

It’s a simple text processor implemented in a few dozen lines of code. I use it in combination with logging print statements in the programs I am profiling. No joke.

Post-processing

The tool is called counts, and it tallies line frequencies within text files, like an improved version of the Unix command chain sort | uniq -c. For example, given the following input.

a 1
b 2
b 2
c 3
c 3
c 3
d 4
d 4
d 4
d 4

counts produces the following output.

10 counts:
(  1)        4 (40.0%, 40.0%): d 4
(  2)        3 (30.0%, 70.0%): c 3
(  3)        2 (20.0%, 90.0%): b 2
(  4)        1 (10.0%,100.0%): a 1

It gives a total line count, and shows all the unique lines, ordered by frequency, with individual and cumulative percentages.

Alternatively, when invoked with the -w flag, it assigns each line a weight, determined by the last integer that appears on the line (or 1 if there is no such integer). On the same input, counts -w produces the following output.

30 counts:
(  1)       16 (53.3%, 53.3%): d 4
(  2)        9 (30.0%, 83.3%): c 3
(  3)        4 (13.3%, 96.7%): b 2
(  4)        1 ( 3.3%,100.0%): a 1

The total and per-line counts are now weighted; the output incorporates both frequency and a measure of magnitude.

That’s it. That’s all counts does. I originally implemented it in 48 lines of Perl, then later rewrote it as 48 lines of Python, and then later again rewrote it as 71 lines of Rust.

In terms of benefit-to-effort ratio, it is by far the best code I have ever written.

`counts` in action

As an example, I added print statements to Firefox’s heap allocator so it prints a line for every allocation that shows its category, requested size, and actual size. A short run of Firefox with this instrumentation produced a 77 MB file containing 5.27 million lines. counts produced the following output for this file.

5270459 counts:
( 1) 576937 (10.9%, 10.9%): small 32 (32)
( 2) 546618 (10.4%, 21.3%): small 24 (32)
( 3) 492358 ( 9.3%, 30.7%): small 64 (64)
( 4) 321517 ( 6.1%, 36.8%): small 16 (16)
( 5) 288327 ( 5.5%, 42.2%): small 128 (128)
( 6) 251023 ( 4.8%, 47.0%): small 512 (512)
( 7) 191818 ( 3.6%, 50.6%): small 48 (48)
( 8) 164846 ( 3.1%, 53.8%): small 256 (256)
( 9) 162634 ( 3.1%, 56.8%): small 8 (8)
( 10) 146220 ( 2.8%, 59.6%): small 40 (48)
( 11) 111528 ( 2.1%, 61.7%): small 72 (80)
( 12) 94332 ( 1.8%, 63.5%): small 4 (8)
( 13) 91727 ( 1.7%, 65.3%): small 56 (64)
( 14) 78092 ( 1.5%, 66.7%): small 168 (176)
( 15) 64829 ( 1.2%, 68.0%): small 96 (96)
( 16) 60394 ( 1.1%, 69.1%): small 88 (96)
( 17) 58414 ( 1.1%, 70.2%): small 80 (80)
( 18) 53193 ( 1.0%, 71.2%): large 4096 (4096)
( 19) 51623 ( 1.0%, 72.2%): small 1024 (1024)
( 20) 45979 ( 0.9%, 73.1%): small 2048 (2048)

Unsurprisingly, small allocations dominate. But what happens if we weight each entry by its size? counts -w produced the following output.

2554515775 counts:
( 1) 501481472 (19.6%, 19.6%): large 32768 (32768)
( 2) 217878528 ( 8.5%, 28.2%): large 4096 (4096)
( 3) 156762112 ( 6.1%, 34.3%): large 65536 (65536)
( 4) 133554176 ( 5.2%, 39.5%): large 8192 (8192)
( 5) 128523776 ( 5.0%, 44.6%): small 512 (512)
( 6) 96550912 ( 3.8%, 48.3%): large 3072 (4096)
( 7) 94164992 ( 3.7%, 52.0%): small 2048 (2048)
( 8) 52861952 ( 2.1%, 54.1%): small 1024 (1024)
( 9) 44564480 ( 1.7%, 55.8%): large 262144 (262144)
( 10) 42200576 ( 1.7%, 57.5%): small 256 (256)
( 11) 41926656 ( 1.6%, 59.1%): large 16384 (16384)
( 12) 39976960 ( 1.6%, 60.7%): large 131072 (131072)
( 13) 38928384 ( 1.5%, 62.2%): huge 4864000 (4866048)
( 14) 37748736 ( 1.5%, 63.7%): huge 2097152 (2097152)
( 15) 36905856 ( 1.4%, 65.1%): small 128 (128)
( 16) 31510912 ( 1.2%, 66.4%): small 64 (64)
( 17) 24805376 ( 1.0%, 67.3%): huge 3097600 (3100672)
( 18) 23068672 ( 0.9%, 68.2%): huge 1048576 (1048576)
( 19) 22020096 ( 0.9%, 69.1%): large 524288 (524288)
( 20) 18980864 ( 0.7%, 69.9%): large 5432 (8192)

This shows that the cumulative count of allocated bytes (2.55GB) is dominated by a mixture of larger allocation sizes.

This example gives just a taste of what counts can do.

(An aside: in both cases it’s good the see there isn’t much slop, i.e. the difference between the requested sizes and actual sizes are mostly 0. That 5432 entry at the bottom of the second table is curious, though.)

Other Uses

This technique is often useful when you already know something — e.g. a general-purpose profiler showed that a particular function is hot — but you want to know more.

Exactly how many times are paths X, Y and Z executed? For example, how often do lookups succeed or fail in data structure D? Print an identifying string each time a path is hit.
How many times does loop L iterate? What does the loop count distribution look like? Is it executed frequently with a low loop count, or infrequently with a high loop count, or a mix? Print the iteration count before or after the loop.
How many elements are typically in hash table H at this code location? Few? Many? A mixture? Print the element count.
What are the contents of vector V at this code location? Print the contents.
How many bytes of memory are used by data structure D at this code location? Print the byte size.
Which call sites of function F are the hot ones? Print an identifying string at the call site.

Then use counts to aggregate the data. Often this domain-specific data is critical to fully optimize hot code.

Worse is better

Print statements are an admittedly crude way to get this kind of information, profligate with I/O and disk space. In many cases you could do it in a way that uses machine resources much more efficiently, e.g. by creating a small table data structure in the code to track frequencies, and then printing that table at program termination.

But that would require:

writing the custom table (collection and printing);
deciding where to define the table;
possibly exposing the table to multiple modules;
deciding where to initialize the table; and
deciding where to print the contents of the table.

That is a pain, especially in a large program you don’t fully understand.

Alternatively, sometimes you want information that a general-purpose profiler could give you, but running that profiler on your program is a hassle because the program you want to profile is actually layered under something else, and setting things up properly takes effort.

In contrast, inserting print statements is trivial. Any measurement can be set up in no time at all. (Recompiling is often the slowest part of the process.) This encourages experimentation. You can also kill a running program at any point with no loss of profiling data.

Don’t feel guilty about wasting machine resources; this is temporary code. You might sometimes end up with output files that are gigabytes in size. But counts is fast because it’s so simple… and the Rust version is 3–4x faster than the Python version, which is nice. Let the machine do the work for you. (It does help if you have a machine with an SSD.)

Ad Hoc Profiling

For a long time I have, in my own mind, used the term ad hoc profiling to describe this combination of logging print statements and frequency-based post-processing. Wikipedia defines “ad hoc” as follows.

In English, it generally signifies a solution designed for a specific problem or task, non-generalizable, and not intended to be able to be adapted to other purposes

The process of writing custom code to collect this kind of profiling data — in the manner I disparaged in the previous section — truly matches this definition of “ad hoc”.

But counts is valuable specifically because it makes this type of custom profiling less ad hoc and more repeatable. I should arguably call it “generalized ad hoc profiling” or “not so ad hoc profiling”… but those names don’t have quite the same ring to them.

Tips

Use unbuffered output for the print statements. In C and C++ code, use fprintf(stderr, ...). In Rust code use eprintln!. (Update: Rust 1.32 added the dbg! macro, which also works well.)

Pipe the stderr output to file, e.g. firefox 2> log.

Sometimes programs print other lines of output to stderr that should be ignored by counts. (Especially if they include integer IDs that counts -w would interpret as weights!) Prepend all logging lines with a short identifier, and then use grep $ID log | counts to ignore the other lines. If you use more than one prefix, you can grep for each prefix individually or all together.

Occasionally output lines get munged together when multiple print statements are present. Because there are typically many lines of output, having a few garbage ones almost never matters.

It’s often useful to use both counts and counts -w on the same log file; each one gives different insights into the data.

To find which call sites of a function are hot, you can instrument the call sites directly. But it’s easy to miss one, and the same print statements need to be repeated multiple times. An alternative is to add an extra string or integer argument to the function, pass in a unique value from each call site, and then print that value within the function.

It’s occasionally useful to look at the raw logs as well as the output of counts, because the sequence of output lines can be informative. For example, I recently diagnosed an occurrences of quadratic behaviour in the Rust compiler by seeing that a loop iterated 1, 2, 3, …, 9000+ times.

The Code

counts is available here.

Conclusion

I use counts to do ad hoc profiling all the time. It’s the first tool I reach for any time I have a question about code execution patterns. I have used it extensively for every bout of major performance work I have done in the past few years, as well as in plenty of other circumstances. I even built direct support for it into rustc-perf, the Rust compiler’s benchmark suite, via the profile eprintln subcommand. Give it a try!

GDB Programming

Using GDB to get stack traces at particular program points

Post author By Nicholas Nethercote
Post date February 13, 2015
2 Comments on Using GDB to get stack traces at particular program points

A while back I wrote how I sometimes use Valgrind to print a stack trace every time a particular program point is reached. I just learned how to do likewise with GDB. Here’s an example session that illustrates what to do.

(gdb) break je_chunk_alloc_mmap     # set a breakpoint
Breakpoint 1 at 0x1aa32c0
(gdb) commands                      # enter breakpoint command sequence
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>bt                                 # print stack trace
>c                                  # continue execution
>end                                # end command sequence
(gdb) set pagination off            # don't pause when the screen fills up
(gdb) set logging on                # copy output to a file
Copying output to gdb.txt.

This is better than the Valgrind technique because (a) it doesn’t require modifying the source code, and (b) programs tend to run faster under GDB than they do under Valgrind.

Thanks to Mike Hommey for helping with this.

Programming

Premature Optimisation

Post author By Nicholas Nethercote
Post date November 25, 2013

I loved this sentence from Olin Shivers’ description of some Scheme history:

I fashionably decried premature optimisation in college without really understanding it until I once committed an act of premature opt so horrific that I can now tell when it is going to rain by the twinges I get in the residual scar tissue. Now I understand premature optimisation.

I’d love to know exactly what the premature optimisation was.

I also read Olin’s Dissertation Advice about fifty times in 2004. Great stuff.

about:memory MemShrink Programming

Libraries should permit custom allocators

Post author By Nicholas Nethercote
Post date November 8, 2013

Some C and C++ libraries permit the use of custom allocators, which are registered through some kind of external API. For example, the following libraries used by Firefox provide this facility.

FreeType provides this via the FT_MemoryRec_ argument of the FT_New_Library() function.
ICU provides this via the u_setMemoryFunctions() function.
SQLite provides this via the sqlite3_config() function.

This gives the users of these libraries additional flexibility that can be very helpful. For example, in Firefox we provide custom allocators that measure the size of all the live allocations done by the library; these measurements are shown in about:memory.

In contrast, libraries that don’t allow custom allocator are very hard to account for in about:memory. Such libraries are major contributors to the dreaded “heap-unclassified” value in about:memory. These include Cairo and the WebRTC libraries.

Now, supporting custom allocators in a library takes some effort. You have to be careful to always allocate in a fashion that will use the custom allocators if they have been registered. Direct calls to vanilla allocation/free functions like malloc(), realloc(), and free() must be avoided. For example, SpiderMonkey allows custom allocators (although Firefox doesn’t need to use that functionality), and I just fixed a handful of cases where it was accidentally using vanilla allocation/free functions.

But, it’s a very useful facility to provide, and I encourage all library writers to consider it.

Firefox Programming Software Engineering

Duplicated abstraction layers in Firefox

Post author By Nicholas Nethercote
Post date May 12, 2011
3 Comments on Duplicated abstraction layers in Firefox

Just about every operating system provides a mechanism for directly allocating and deallocating memory at the page level (ie. not malloc/free or new/delete). The functions to do this vary from OS to OS:

Windows: VirtualAlloc/FreeAlloc.
Posix (e.g. Mac and Linux): mmap/munmap.
Mac also has: vm_allocate/vm_deallocate.

So it’s very natural to add an abstraction layer: your own functions (let’s call them Map and Unmap) that use conditional compilation to choose the appropriate OS-specific call.

An abstraction layer like this appears in lots of software projects. Firefox happens to incorporate code from a lot of other projects, and so what happens is you end up with lots of duplicate abstraction layers. For example, in the JS engine alone we have five Map/Unmap abstraction layers.

In js/src/jsgcchunk.cpp, used to allocate chunks for the GC heap.
in js/src/vm/Stack.cpp, used to allocate some stack space.
In js/src/nanojit/avmplus.cpp, used to allocate space for code generated by the trace JIT.
In js/src/assembler/jit/ExecutableAllocator*.cpp, used to allocate space for code generated by the method JIT.
In js/src/ctypes/libffi/src/dlmalloc.c, used to allocate chunks of memory that are handed out in pieces by the heap allocator defined in that file.

The duplication of 3, 4 and 5 are understandable — they all involve large chunks of code that were imported from other projects. (Furthermore, you can see that ctypes/ has its own heap allocator, thus duplicating jemalloc’s functionality.) The duplication between 1 and 2 is less forgiveable; neither of those cases were imported and so they should share an abstraction layer.

How many other Map/Unmap abstraction layers are there in the rest of Firefox? The JS engine may be more guilty of this than other parts of the code. Is there a sane way to avoid this duplication in a world where we import code from other projects?

Programming Valgrind

Using Valgrind to get stack traces

Post author By Nicholas Nethercote
Post date January 11, 2011
10 Comments on Using Valgrind to get stack traces

Sometimes I want to do some printf-style debugging where I print not only some values, but also the stack trace each time a particular code point is hit. GNU provides a backtrace() function that supposedly does this, but I tried it and got hopeless results, little more than code addresses.

Fortunately, you can do this pretty easily with Valgrind. First, add this line somewhere in your source code:

  #include <valgrind/valgrind.h>

Then, at the point where you want to print the stack trace, add this:

  VALGRIND_PRINTF_BACKTRACE("foo");

You can of course print something other than “foo”. In fact, VALGRIND_PRINTF_BACKTRACE is a variadic printf-style function, so you can do stuff like this:

  VALGRIND_PRINTF_BACKTRACE("%s: %d\n", str, i);

You then have to run the program under Valgrind as usual, except you probably should use --tool=none because that’ll run the quickest.

This is a trick I find occasionally invaluable.

Go Programming

Another Go at language design

Post author By Nicholas Nethercote
Post date September 10, 2010
2 Comments on Another Go at language design

The other day I attended a very interesting talk at the University of Melbourne given by Rob Pike. The title was “Another Go at language design” and it was all about Google’s new programming language, Go. It was a high level overview of the language with an emphasis on why certain design decisions were made.

Here is a random selection of the things that struck me as most interesting.

Pike lamented the state of modern industrial languages, by which he meant C++ and Java. He started with a quote from Dick Gabriel “Old programs read like quiet conversations between a well-spoken research worker and a well-studied mechanical colleague, not as a debate with a compiler.” In particular, one of the aims of Go is to show programmers whose only exposure to static typing is through C++ and Java that statically typed languages can have the simpler, more concise feel (fewer declarations!) of dynamically typed languages like JavaScript and Python.
Compile times of Go programs are small. This is because the compiler doesn’t need to know about transitive dependencies between packages (their name for modules). For example, if you have a file A.go which depends on B.go which depends on C.go, at first you have to compile C.go, then B.go, then A.go. But all the information exported from A.go is stored in the compiled B.o file, which means that if you recompile A.go you don’t have to recompile anything else. Hmm, now I’m unsure if that’s exactly right. But the broader point is that the language avoids C++’s problem where bazillions of header files have to be read for every module. He said with a completely straight face that their goal was to have a 1,000,000x speed-up over C++ for the compilation time of large programs, though they’d probably be satisfied with 100,000x. Impressive! Imagine if Firefox compiled in less than a second.
Identifiers that start with an upper-case letter are public, and identifiers that start with a lower-case letter are private. The other language I’m familiar with that has a similar distinction is Haskell, where identifiers that start with an upper-case letter are used for types, and identifiers that start with a lower-case letter are used for values. The nice thing about Go’s approach is that it gives you strictly more information: you can determine from an identifier’s use point whether it’s public or private, which saves you from having to find it’s declaration.
There is no automatic conversion between numeric types. This is for simplicity; Pike said (probably exaggerating) that 1/3 of the C standard deals with this topic. But the handling of numeric constants avoids many cases where explicit conversions are needed, because numeric constants are platonic in the sense that they don’t have a particular type until they are assigned to a variable.
They have a reformatting program, gofmt, that rewrites Go code into an “approved” layout. This ensures that all Go code looks consistent, and avoids fights over style. (Robert O’Callahan would approve!) Interestingly, gofmt is separate from the compiler, and the idea is that you run it once you have finished a change. At Google when code is committed into a repository, gofmt is run and if the layout doesn’t match the code is rejected. I asked why they made it a separate program, rather than having stronger syntax/layout checking in of the compiler. He said that (a) they hadn’t thought of putting it in the compiler, (b) it seemed like it would be less painful to just run it occasionally rather than having to worry about it every time you compile, and (c) it would slow down the compiler.

It was a good talk, Pike is an engaging speaker. I’m glad he made the trip to Melbourne.

Correctness Programming Tracemonkey

A win for code hygiene

Post author By Nicholas Nethercote
Post date February 1, 2010

I’ve written recently about clean-ups I’ve been making in Nanojit — simplifying code, refactoring, adding more internal sanity checking. I’m a firm believer that these kinds of changes are worth doing, that “if it ain’t broke, don’t fix it” is not always true. But there’s always a voice in the back of my head wondering “am I just polishing this code for little gain? Should I be working on something else instead?” Yesterday I received confirmation that this work has been worthwhile. But before I can explain why, I need to give some background.

Nanojit serves as the back-end of TraceMonkey. It converts LIR (a low-level intermediate representation) generated by TraceMonkey’s front-end into native code. So Nanojit is a critical part of TraceMonkey. And TraceMonkey is a critical part of Firefox.

However, code generators (and register allocators) are tricky beasts — easy to get wrong and hard to debug. (A long time ago I heard someone, I can’t remember who, say “code generators and garbage collectors should crash as early and noisily as possible”.) So with a piece of code this important and inherently difficult, it’s a good idea to arm yourself with some weapons that give you confidence that it is correct.

Weapon 1: clean IR semantics

One way to do this is to give your IR clean semantics. LIR’s semantics are mostly clean, but there are a few nasty cases. One of the worst is the ‘ov’ instruction. It performs an overflow check on an arithmetic (add, mul, sub, neg) operation.

JavaScript numbers are doubles by default, but TraceMonkey sometimes demotes them to integers for speed. TraceMonkey has to check for integer overflow in these cases; if an integer does overflow TraceMonkey has to step back and use doubles for the computation.

Here’s some example LIR that uses ‘ov’:

 ld6 = ld sp[-8]
 add1 = add ld6, 1
 ov1 = ov add1
 xt4: xt ov1 -> pc=0x883ed8 imacpc=0x0 sp+0 rp+0 (GuardID=218)

The first instructions loads a value from the stack. The second instruction adds 1 to that value. The third instruction performs an overflow check and puts the result in ‘ov1’. The fourth instruction is a guard; if ‘ov1’ is true it exits the code block.

So why is ‘ov’ semantically nasty? First consider ‘add’, which is a semantically cleaner instruction. Its result (in this case ‘add1’) is a function of its inputs (‘ld6’ and 1). In comparison, ‘ov’ does not have this property. The ‘ov’ doesn’t really take ‘add1’ as an input — you can’t tell by looking at ‘add1’ whether the addition overflowed. In fact you can’t really understand what is happening here at the LIR level; you have to know what happens in the native code that is generated from this LIR. It turns out that the native code for the ‘add’ also sets some condition codes, and the native code generated for the ‘ov’/’xt’ pair inspects those condition codes. For this to work, the ‘add’ has to immediately precede the ‘ov’; if it does not, whatever intermediate native code that is generated could overwrite the condition codes, in which case the behaviour of the guard becomes completely unpredictable. This is obviously bad.

The real problem is that the addition has two outputs: the result, and the overflow status indicator (which maps to condition codes in the hardware). But LIR has no explicit way to represent that second output. So the second output is implicit, and an extra constraint must be imposed on the LIR instead, i.e. ‘ov’ must immediately follow an arithmetic operation. Because this constraint is so arbitrary it’s easy to forget and easy to break.

In May 2009 Julian Seward opened a bug to improve LIR’s semantics, giving five suggestions. It generated a lot of discussion and Julian drafted a patch to replace ‘ov’ with some cleaner opcodes, but there was enough disagreement that it never went anywhere. But I didn’t forget about it, and ‘ov’ has annoyed me enough times recently that I decided to resurrect Julian’s idea, this time in a new bug to escape the baggage of the old one. The idea is simple: if ‘ov’ always has to follow an arithmetic operation, then why not create new opcodes that fuse the two parts together? These new opcodes are ‘addxov’, ‘subxov’ and ‘mulxov’. With my patch the code from above now looks like this:

 ld6 = ld sp[-8]
 add1 = addxov ld6, 1 -> pc=0x883ed8 imacpc=0x0 sp+0 rp+0 (GuardID=218)

The ‘addxov’ adds ‘ld6’ and 1, puts the result in ‘add1’, and exits if there was an overflow. The generated native code is unchanged, but the implicit output of the addition is now hidden within a single LIR instruction, and it’s impossible for overflow checks in the native code to become separated from the potentially-overflowing arithmetic operation.

Weapon 2: strong IR sanity checking

Compilers are usually built in a pipeline fashion, where you have multiple passes over the code representation, and each pass does a task such as optimisation, register allocation or code generation. It’s also a really good idea to have a pass that does a thorough sanity check of the code representation.

In November 2008 Jim Blandy filed a bug suggesting such a pass for Nanojit, one that performs type-checking of the LIR. The bug sat untouched for over six months until some extra discussion occurred and then (once again) Julian wrote a patch. It found at least one case where TraceMonkey was generating bad LIR code, but again the bug stalled, this time because we spent three months merging Mozilla’s and Adobe’s copies of Nanojit. I resurrected the bug again recently and added some “structure checking” for things like the ‘ov’ constraint, and it landed in the TraceMonkey repository on January 21st. Happily, my version of the checker was simpler than Julian’s; this was possible because LIR had had some other semantic clean-ups (e.g. properly distinguishing 64-bit integers from 64-bit floats). My patch replaced a very basic type-checker (called “SanityFilter”) that had been in TraceMonkey, and immediately found two bugs, one of which involved ‘ov’.

Synergy

It’s not often that a bug report involving your code makes you smile. Yesterday Gary Kwong hit an assertion failure in Nanojit when fuzz-testing:

    LIR structure error (end of writer pipeline):
    in instruction with opcode: ov
    argument 1 has opcode: add
    it should be: located immediately prior, but isn't
  One way to debug this:  change the failing NanoAssertMsgf(0, ...) call to a
  printf(...) call and rerun with verbose output.  If you're lucky, this error
  message will appear before the block containing the erroneous instruction.

In other words, there’s an ‘ov’ following an ‘add’, but there are one or more other LIR instructions between them. This means that the code generated will be bogus, as I explained above. This bug may have been in TraceMonkey for a long time, but it wasn’t detected until strong internal sanity checking was added. The good news is that the ‘ov’-removal patch (which hasn’t landed as it’s still awaiting review) will fix this patch.

Lessons learned

This story tied together a number of related ideas for me. In particular:

Clean IR semantics are worth the effort. Hacks may save time initially but they will come back to bite you.
Strong IR sanity checking is worth the effort. (And cleaner semantic allows for stronger checking.)
Listen to Julian, especially when he talks about correctness. (And read his code! VEX/pub/libvex_ir.h in the Valgrind source code describes Valgrind’s IR, which is a wonderfully clean IR, particularly in the way it separates statements (which have side-effects) from expressions (which don’t).)
Don’t put too many ideas into one bug report. Too much discussion can kill a bug, or at least put it in a coma.
Follow-through is important. A patch can’t do much good unless it lands.
Fuzz testing is awesome (but we already knew that).

Post-processing

counts in action

Other Uses

Worse is better

Ad Hoc Profiling

Tips

The Code

Conclusion

Weapon 1: clean IR semantics

Weapon 2: strong IR sanity checking

Synergy

Lessons learned

`counts` in action