Category Archives: Correctness

Asynchronous database connections must be closed with asyncClose()

TL;DR:  if you’re familiar with any Mozilla (C++ or JS) code that opens an async database connection, please go and check that it calls asyncClose() to close the connection;  not doing so can lead to crashes and/or memory leaks.

Specifically, in Firefox 6, when such a connection is destroyed (because it’s explicitly deallocated or its refcount drops to zero in C++, or it’s garbage collected in JS) it’ll currently cause sporadic crashes in about:memory or telemetry code or Test Pilot code.  This is because memory reporters end up pointing to connections that have been freed, and so when they are used they end up (innocently) accessing memory they shouldn’t.

I’ve opened bug 662989 to avoid the crashes, but even once it’s implemented, I think that if you fail to call asyncClose() it will still cause a memory leak, because sqlite3_close() is never called on the connection and so SQLite won’t free up memory used by the connection.  Also, connections that needlessly remain open can have active locks that can starve out other connections or cause extreme growth of the write-ahead-log.

As I write this, thanks to Marco Bonardo, I’m aware of three places in JS code that fail to call asyncClose() when they should:

  • browser/components/preferences/aboutPermissions.js.  Bug 654573 covers this, I’m about to land a patch to fix it.
  • toolkit/components/contentprefs/nsContentPrefService.js.  Bug 662986 covers this.
  • browser/app/profile/extensions/testpilot@labs.mozilla.com/modules/experiment_data_store.js.  Bug 662985 covers this.  Note that Test Pilot seems to fail to use asyncClose() when it should, and it also uses memory reporters.  So it’s unclear which of these two things is responsible for the Test Pilot crashes of this sort seen so far;  in other words, Test Pilot may be a culprit or an innocent bystander, or both.

These were found via a crude MXR search.  I haven’t looked for C++ code that makes this mistake.

If you only do synchronous transactions over your database connection, the connection will be cleaned up properly, even if you don’t call close(), when it is deallocated or garbage collected.  However, it’s better to close the connection as soon as you can so that the resources (memory, locks) are freed immediately.

See this MDC page for more details about database connections and the relevant APIs.  It appears that these APIs are somewhat error-prone.  As it happens, an NS_ENSURE_TRUE macro will fail when an async connection is destroyed without having been asyncClose’d first, and this causes a warning message to be printed to stderr.  Unfortunately, stderr is spammed with all sorts of warnings (something that was discussed in the comments of a previous post of mine), and so this warning gets lost among the noise.  Addressing this problem is something I’ll write about shortly.

Thanks to Andrew Sutherland and Marco Bonardo for helping me greatly with the bugs mentioned above.  Any errors in this post are mine!

UPDATE: I did some investigation and found that about:permissions leaked about 180KB of memory on my machine every time it failed to asyncClose a DB connection.   This showed up under the explicit/storage/sqlite/other reporter in about:memory.

What is the point of non-fatal assertions?

When you run a debug build of Firefox you get a lot of stuff printed to stderr.  Some of it looks like this:

 ++DOCSHELL 0x7f53d836b800 == 3
 ++DOMWINDOW == 5 (0x7f53d836c078) [serial = 5] [outer = (nil)]

and I imagine it’s occasionally useful.

You also see a lot of warnings like these, which I just saw (with duplicates removed) when I just started up Firefox, loaded www.mozilla.com, and then exited:

 WARNING: 1 sort operation has occurred for the SQL statement '0x7f53e3d32208'.  See https://developer.mozilla.org/En/Storage/Warnings details.: file /home/njn/moz/mc2/storage/src/mozStoragePrivateHelpers.cpp, line 144
 WARNING: GetDefaultCharsetForLocale: need to add multi locale support: file /home/njn/moz/mc2/intl/locale/src/unix/nsUNIXCharset.cpp, line 146
 WARNING: Ignoring duplicate observer.: file /home/njn/moz/mc2/modules/libpref/src/nsPrefBranch.cpp, line 594
 WARNING: not an nsIRDFRemoteDataSource: 'remote != nsnull', file /home/njn/moz/mc2/rdf/datasource/src/nsLocalStore.cpp, line 312
 WARNING: NS_ENSURE_SUCCESS(rv, 0) failed with result 0x8000FFFF: file /home/njn/moz/mc2/content/base/src/nsContentUtils.cpp, line 2527
 WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80040111: file /home/njn/moz/mc2/content/base/src/nsFrameLoader.cpp, line 415
 WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x8053000C: file /home/njn/moz/mc2/content/base/src/nsGenericElement.cpp, line 5484
 WARNING: NS_ENSURE_TRUE(aURI) failed: file /home/njn/moz/mc2/content/base/src/nsContentUtils.cpp, line 4706
 WARNING: NS_ENSURE_TRUE(!(mAsyncExecutionThread)) failed: file /home/njn/moz/mc2/storage/src/mozStorageConnection.cpp, line 784
 WARNING: NS_ENSURE_TRUE(pusher.Push(aBoundElement)) failed: file /home/njn/moz/mc2/content/xbl/src/nsXBLProtoImplMethod.cpp, line 327
 WARNING: nsExceptionService ignoring thread destruction after shutdown: file /home/njn/moz/mc2/xpcom/base/nsExceptionService.cpp, line 197
 WARNING: SQLite returned error code 1 , Storage will convert it to NS_ERROR_FAILURE: file /home/njn/moz/mc2/storage/src/mozStoragePrivateHelpers.cpp, line 113
 WARNING: Subdocument container has no content: file /home/njn/moz/mc2/layout/base/nsDocumentViewer.cpp, line 2398
 WARNING: Subdocument container has no frame: file /home/njn/moz/mc2/layout/base/nsDocumentViewer.cpp, line 2418
 WARNING: Subdocument container has no frame: file /home/njn/moz/mc2/layout/base/nsDocumentViewer.cpp, line 2418

The NS_ENSURE_SUCCESS and NS_ENSURE_TRUE ones in particular look like assertion failures.  I’ve heard before that Firefox assertions (outside the JS engine) are non-fatal.  I was wondering about this last week and had planned to ask someone about it in more detail.

But before I did that, I spent several hours yesterday debugging the crash in bug 654573.  Turns out the problem is that this assertion in mozStorageConnection.cpp fails:

 NS_ENSURE_FALSE(mAsyncExecutionThread, NS_ERROR_UNEXPECTED);

This causes this warning to be printed out at run-time:

 WARNING: NS_ENSURE_TRUE(!(mAsyncExecutionThread)) failed: file /home/njn/moz/mc1/storage/src/mozStorageConnection.cpp, line 799

Oh, and just to make things really complicated, here’s the definition of NS_ENSURE_FALSE, from nsDebug.h:

 #define NS_ENSURE_TRUE(x, ret)                                \
   PR_BEGIN_MACRO                                              \
     if (NS_UNLIKELY(!(x))) {                                  \
        NS_WARNING("NS_ENSURE_TRUE(" #x ") failed");           \
        return ret;                                            \
     }                                                         \
   PR_END_MACRO
 #define NS_ENSURE_FALSE(x, ret)
   NS_ENSURE_TRUE(!(x), ret)

Oh look;  if the condition fails it not only prints out a warning message, it also does an early return.  And that’s exactly what was causing the crash, because some code that followed the assertion wasn’t run, which meant that a database connection wasn’t closed properly, which caused a use-after-free error.

If this assertion had been fatal, I wouldn’t have spent several hours yesterday debugging this crash, because there wouldn’t have been a crash;  there would have been an assertion failure, and the problem would have been immediately obvious.  (As it happens, I eventually worked out the problem when I noticed that warning message among the pages of output that Firefox produces.)  If the assertion had been non-fatal but did not have the early return, there also probably wouldn’t have been a crash, but it’s hard to know, because once an assertion fails you’re pretty much in no-man’s land.

So, in summary, here are my thoughts about assertions:

  • Non-fatal assertions are close to useless.  They don’t get fixed.  The warnings just get lost among the noise of all the messages that go to stderr.
  • Non-fatal assertions that return early are worse than useless.  Especially since the early-return behaviour is entirely non-obvious from their names.

One suggestion I heard on IRC yesterday was that there just aren’t enough developers in general to fix all the assertion failures if we were to make them fatal.  But in contrast, the JS engine is able to use fatal assertions because it has a higher developer-to-code ratio than the rest of Firefox.  (This makes me also think of compiler warnings, where there’s a similar disparity between the JS engine and the rest of Firefox.)

Because this is Mozilla, I’m sure there is a ton of history behind this current situation that I’m unaware of.  I’d love to know what that is, and why we let obvious defects — because an assertion failure unambiguously indicates there’s a defect, be it in the code or the assertion itself — remain.  Thanks!

 

Limits of reliability

Julian Seward asked me an interesting question a while ago:  “what are the factors that limit Firefox’s reliability?”  (You can use “crash rate” as a reasonable definition of “reliability”.)

He suggested two things:

  1. Firefox depends on external code, such as plug-ins.
  2. Many crashes are hard to reproduce and so don’t get fixed.

For the first, Electrolysis (a.k.a. process separation) is on track to pretty much make it a non-problem.  It’s already in place for Flash, and will eventually be for other plug-ins.  So that’s good.

For the second, I see two main sub-factors.

  1. Firefox is implemented in C++ which is prone to memory-related bugs and data races, both of which can make crash reproduction difficult.  Using a safer language like Rust would make many (all?) of these bugs impossible.  Unfortunately, Rust isn’t production-ready, and rewriting even parts of the browser is a huge undertaking.  So we better get started ASAP :)
  2. Second, Firefox has some nasty low-level code like the garbage collector;  bugs in it be very difficult to reproduce.  I don’t see an obvious way to improve this other than the usual:  testing, code review, using simple algorithms, etc.

My best patch of 2010

I was going to write one of those “everything I did last year” posts, but now I don’t feel like it.  Here’s a “one thing I did last year” post instead.

The most important patch I landed in 2010 was probably the one that added LIR type-checking to Nanojit (which was based on an earlier patch from Julian Seward).  At the time of writing, it’s caught at least 14 type errors, most of which could have caused a crash or security problem.

Consistency checks within compilers are wonderful things.

Green

I landed a TraceMonkey patch yesterday that was completely green on the tinderbox pushlog:

All green TBPL!

I don’t think that’s ever happened to me before.  Even the full tinderbox was completely green except for a single burning N810 Talos column.

I pushed another patch afterwards, that one didn’t get all green :(

A win for code hygiene

I’ve written recently about clean-ups I’ve been making in Nanojit — simplifying code, refactoring, adding more internal sanity checking.  I’m a firm believer that these kinds of changes are worth doing, that “if it ain’t broke, don’t fix it” is not always true.  But there’s always a voice in the back of my head wondering “am I just polishing this code for little gain?  Should I be working on something else instead?”  Yesterday I received confirmation that this work has been worthwhile.  But before I can explain why, I need to give some background.

Nanojit serves as the back-end of TraceMonkey.  It converts LIR (a low-level intermediate representation) generated by TraceMonkey’s front-end into native code.  So Nanojit is a critical part of TraceMonkey.  And TraceMonkey is a critical part of Firefox.

However, code generators (and register allocators) are tricky beasts — easy to get wrong and hard to debug.  (A long time ago I heard someone, I can’t remember who, say “code generators and garbage collectors should crash as early and noisily as possible”.)  So with a piece of code this important and inherently difficult, it’s a good idea to arm yourself with some weapons that give you confidence that it is correct.

Weapon 1: clean IR semantics

One way to do this is to give your IR clean semantics.  LIR’s semantics are mostly clean, but there are a few nasty cases.  One of the worst is the ‘ov’ instruction.  It performs an overflow check on an arithmetic (add, mul, sub, neg) operation.

JavaScript numbers are doubles by default, but TraceMonkey sometimes demotes them to integers for speed.  TraceMonkey has to check for integer overflow in these cases;  if an integer does overflow TraceMonkey has to step back and use doubles for the computation.

Here’s some example LIR that uses ‘ov’:

 ld6 = ld sp[-8]
 add1 = add ld6, 1
 ov1 = ov add1
 xt4: xt ov1 -> pc=0x883ed8 imacpc=0x0 sp+0 rp+0 (GuardID=218)

The first instructions loads a value from the stack.  The second instruction adds 1 to that value.  The third instruction performs an overflow check and puts the result in ‘ov1′.  The fourth instruction is a guard;  if ‘ov1′ is true it exits the code block.

So why is ‘ov’ semantically nasty?  First consider ‘add’, which is a semantically cleaner instruction.  Its result (in this case ‘add1′) is a function of its inputs (‘ld6′ and 1).  In comparison, ‘ov’ does not have this property.  The ‘ov’ doesn’t really take ‘add1′ as an input –  you can’t tell by looking at ‘add1′ whether the addition overflowed.  In fact you can’t really understand what is happening here at the LIR level;  you have to know what happens in the native code that is generated from this LIR.  It turns out that the native code for the ‘add’ also sets some condition codes, and the native code generated for the ‘ov’/’xt’ pair inspects those condition codes.  For this to work, the ‘add’ has to immediately precede the ‘ov’;  if it does not, whatever intermediate native code that is generated could overwrite the condition codes, in which case the behaviour of the guard becomes completely unpredictable.  This is obviously bad.

The real problem is that the addition has two outputs:  the result, and the overflow status indicator (which maps to condition codes in the hardware).  But LIR has no explicit way to represent that second output.  So the second output is implicit, and an extra constraint must be imposed on the LIR instead, i.e. ‘ov’ must immediately follow an arithmetic operation.  Because this constraint is so arbitrary it’s easy to forget and easy to break.

In May 2009 Julian Seward opened a bug to improve LIR’s semantics, giving five suggestions.  It generated a lot of discussion and Julian drafted a patch to replace ‘ov’ with some cleaner opcodes, but there was enough disagreement that it never went anywhere.  But I didn’t forget about it, and ‘ov’ has annoyed me enough times recently that I decided to resurrect Julian’s idea, this time in a new bug to escape the baggage of the old one.  The idea is simple: if ‘ov’ always has to follow an arithmetic operation, then why not create new opcodes that fuse the two parts together?  These new opcodes are ‘addxov’, ‘subxov’ and ‘mulxov’.  With my patch the code from above now looks like this:

 ld6 = ld sp[-8]
 add1 = addxov ld6, 1 -> pc=0x883ed8 imacpc=0x0 sp+0 rp+0 (GuardID=218)

The ‘addxov’ adds ‘ld6′ and 1, puts the result in ‘add1′, and exits if there was an overflow.  The generated native code is unchanged, but the implicit output of the addition is now hidden within a single LIR instruction, and it’s impossible for overflow checks in the native code to become separated from the potentially-overflowing arithmetic operation.

Weapon 2: strong IR sanity checking

Compilers are usually built in a pipeline fashion, where you have multiple passes over the code representation, and each pass does a task such as optimisation, register allocation or code generation.  It’s also a really good idea to have a pass that does a thorough sanity check of the code representation.

In November 2008 Jim Blandy filed a bug suggesting such a pass for Nanojit, one that performs type-checking of the LIR.  The bug sat untouched for over six months until some extra discussion occurred and then (once again) Julian wrote a patch.  It found at least one case where TraceMonkey was generating bad LIR code, but again the bug stalled, this time because we spent three months merging Mozilla’s and Adobe’s copies of Nanojit.  I resurrected the bug again recently and added some “structure checking” for things like the ‘ov’ constraint, and it landed in the TraceMonkey repository on January 21st.  Happily, my version of the checker was simpler than Julian’s;  this was possible because LIR had had some other semantic clean-ups (e.g. properly distinguishing 64-bit integers from 64-bit floats).  My patch replaced a very basic type-checker (called “SanityFilter”) that had been in TraceMonkey, and immediately found two bugs, one of which involved ‘ov’.

Synergy

It’s not often that a bug report involving your code makes you smile.  Yesterday Gary Kwong hit an assertion failure in Nanojit when fuzz-testing:

    LIR structure error (end of writer pipeline):
    in instruction with opcode: ov
    argument 1 has opcode: add
    it should be: located immediately prior, but isn't
  One way to debug this:  change the failing NanoAssertMsgf(0, ...) call to a
  printf(...) call and rerun with verbose output.  If you're lucky, this error
  message will appear before the block containing the erroneous instruction.

In other words, there’s an ‘ov’ following an ‘add’, but there are one or more other LIR instructions between them.  This means that the code generated will be bogus, as I explained above.  This bug may have been in TraceMonkey for a long time, but it wasn’t detected until strong internal sanity checking was added.  The good news is that the ‘ov’-removal patch (which hasn’t landed as it’s still awaiting review) will fix this patch.

Lessons learned

This story tied together a number of related ideas for me.  In particular:

  • Clean IR semantics are worth the effort.  Hacks may save time initially but they will come back to bite you.
  • Strong IR sanity checking is worth the effort.  (And cleaner semantic allows for stronger checking.)
  • Listen to Julian, especially when he talks about correctness.  (And read his code!  VEX/pub/libvex_ir.h in the Valgrind source code describes Valgrind’s IR, which is a wonderfully clean IR, particularly in the way it separates statements (which have side-effects) from expressions (which don’t).)
  • Don’t put too many ideas into one bug report.  Too much discussion can kill a bug, or at least put it in a coma.
  • Follow-through is important.  A patch can’t do much good unless it lands.
  • Fuzz testing is awesome (but we already knew that).

Safer refactoring in Nanojit

Recently I’ve been making a lot of refactoring changes to Nanojit that I think of as “code hygiene” — changes that clean up the code, but don’t change its behaviour at all.  (Example 1, example 2.)  When making changes like these I like to do the changes carefully and gradually to make sure that I don’t affect behaviour unintentionally.  In Nanojit’s case, this means that the generated code shouldn’t change.  But it’s always possible to make mistakes, and sometimes a particular change is disruptive enough that you can’t evolve the code from one state to another without breaking it temporarily.

In such cases, if the regression tests pass it tells me that the new code works, but it doesn’t tell me if I’ve managed to avoid changing the behaviour as I intended.  So I’ve started doing comparisons of the native code generated by Nanojit for parts of SunSpider using TraceMonkey’s verbose output, which is controlled with the TMFLAGS environment variable.  For such changes, the generated code should be identical to an unchanged version.  The way I organise my workspaces is such that I always have an unchanged repository available, which makes such comparisons easy.

Except, the generated code is never quite identical because it contains addresses and the slightest change in the executable can affect these.  So I run the verbose output through this script:

perl -p -e 's/[0-9a-fA-F]{4,}/....../g'

It just replaces numbers with four or more digits with “……”, which is enough to filter out these address differences.

I tried this technique on all of SunSpider, but it turns out it doesn’t work reliably, because crypto-aes.js is non-deterministic — it contains a call to Date.getTime() which introduces enough non-determinism that TraceMonkey sometimes traces different code.  A couple of the V8 benchmarks have similar non-deterministic behaviours.  Non-determinism is a rather undesirable feature of a benchmark suite so I filed a SunSpider bug which is yet to be acted on.

Fortunately, experience has shown me that I don’t need to do code diffs on all of SunSpider to be confident that I haven’t changed Nanojit’s behaviour — doing code diffs on two or three of the bigger benchmarks suffices, as enough code is generated that even small behavioural differences are quickly found.

In fact, this code diff technique is useful enough that I’ve started using it sometimes even when I do want to change behaviour — I split such changes into two parts, one which doesn’t change the behaviour and one which does.  For example, when I added an opcode to distinguish between 64-bit integer loads and 64-bit integer floats I landed a non-behaviour-changing patch first, then did a follow-up change that improved x86-64 code generation.  This turned out to be a good idea because I introduced a regression with the follow-up change, and the fact that it was separated from the non-behavioural changes made it easy to determine what had gone wrong, because the regressing patch was much smaller than a combined patch would have been.

I also recently wrote a script to automate the diff process, which makes it that much easier to perform, which in turn makes it that much more likely that I’ll actually do it.

In summary, it’s always worth thinking about new ways to test your code, and when you find a good one, it’s worth making it as easy as possible to run those tests.

“No-else-after-return” considered harmful

(Update below)

At the time of writing this, the Mozilla Coding Style guidelines have this recommendation under “General C/C++ Practices”:

Don’t put an else right after a return. Delete the else, it’s unnecessary and increases indentation level.

I can appreciate this in some circumstances, such as when checking for an error case early in a long function:

void f()
{
    if (errorcase())
        return;

    a();
    b();
    c();
    d();
}

But as a blanket guideline I think it’s misguided because it promotes an impure style of programming.  And by “impure” here I mean the opposite of “pure” in this sense, i.e. “free from side-effects and other such error-prone junk”.  Let’s consider a simple example:  the max() function.  Here’s a way to write it that is quite sensible, in my opinion:

int max(int a, int b)
{
    if (a > b)
        return a;
    else
        return b;
}

But this violates the no-else-after-return guideline.  The Mozilla way to write this is like so:

int max(int a, int b)
{
    if (a > b)
        return a;
    return b;
}

(I’m not exaggerating, see this comment and this comment from Mozilla’s VP of Engineering and CTO respectively, about an example very similar to this.)

Urk.  That’s horrible.  Any time you’re using an ‘if’ without a corresponding ‘else’, you’re relying on an impure feature — you must be doing something with a side-effect in the ‘then’ branch, or some kind of impure control flow.  Mozilla’s coding guidelines insist that we write max(), a pure function, in an impure way.  Also, it’s just harder to read because the two return statements have different indentation levels despite the fact that they are closely related.

In this case there’s a way around the problem:

int max(int a, int b)
{
    return (a > b ? a : b);
}

Even though C’s if-then-else expression syntax is awful, it’s a better way to do it.

In general, statements suck, expressions rule.  Even in a language like C/C++ you can do something approaching functional programming by avoiding statements where possible.  In a tiny example like max() the difference may seem trivial, but in bigger examples the difference becomes more important.  Consider the following code from Nanojit’s register allocator which picks a register by looking at the free registers, the registers that are allowed to be used, and possibly the preferred registers.  You don’t have to understand it closely, but it helps to know that RegisterMask is just a bitfield with one bit per register.

Register Assembler::registerAlloc(RegisterMask allow)
{
    RegAlloc &regs = _allocator;
    RegisterMask prefer = SavedRegs & allow;
    RegisterMask free = regs.free & allow;

    RegisterMask set = prefer;
    if (set == 0) set = allow;

    if (free)
    {
        // at least one is free
        set &= free;

        // ok we have at least 1 free register so let's try to pick
        // the best one given the profile of the instruction
        if (!set)                                                             
        {
            // desired register class is not free so pick first of any class
            set = free;
        }
        NanoAssert((set & allow) != 0);
        Register r = nRegisterAllocFromSet(set);
        regs.used |= rmask(r);
        return r;
    }

    // ... more code here ...
}

I find this code incredibly hard to understand.  There are three ‘ifs’ that lack ‘elses’, so it relies heavily on state updates to do its work.  In fact, in order to understand it, I ended up rewriting it in a more pure style, by using a sequence of small refactorings:

Register Assembler::registerAlloc(RegisterMask allow)
{
    RegAlloc &regs = _allocator;
    RegisterMask allowedAndFree = allow & regs.free;

    if (allowedAndFree)
    {
        // At least one usable register is free -- no need to steal.  
        // Pick a preferred one if possible.
        RegisterMask preferredAndFree = allowedAndFree & SavedRegs;
        RegisterMask set = ( preferredAndFree ? preferredAndFree : allowedAndFree );
        Register r = nRegisterAllocFromSet(set);
        regs.used |= rmask(r);
        return r;
    }
    // ... more code here ...
}

There’s now only a single ‘if’ lacking an ‘else’, and it turns out even that can be removed by putting the “… more code here …” part into an else branch, and then the ‘return r;’ can be moved to the end of the function.

The code is now comprehensible — it picks a register that is free and allowed, and preferred as well if possible.

Part of the art of programming involves knowing when to avoid using a language feature.  Everybody knows that ‘goto’ is occasionally highly useful, but should be avoided whenever something less dangerous (a for-loop, a while-loop, a ‘break’ statement) can be used.  And everybody knows it because everybody is taught it.  But programmers aren’t taught to avoid unnecessary side-effects… at least, I wasn’t, certainly not via examples such as those I’ve given above.  I’ve only come to realise these things as I’ve switched several times between using C/C++ and some pure functional languages over the years, and come to appreciate how purity helps make programs more readable and less error-prone.

To come full circle, I’d like to see the no-else-after-return guideline removed from the Mozilla style guide, or at the very least, be qualified in a way that allows the cases I’ve argued for.  I’d also love to see a guideline (with examples) added such as “prefer expressions to statements”.

Update:

(The arguments had the wrong names in the initial post, I’ve fixed them now.)

All three versions of max() are pure, because it’s possible to combine impure pieces of code (such as an else-less if) into a pure piece of code.  So I’ll augment my recommendation: pure code is preferable, and furthermore, given that C/C++ makes it so easy to write impure code, it’s preferable that pure code be obviously pure. I argue that the 3rd version of max() with the ?: operator is the most obviously pure, because a ?: is pure if its operands are pure.  In comparison, it’s harder to tell if an if-then-else is pure (although this case is still pretty easy thanks to the matching return statements).  And an if-without-else almost never results in pure code;  the max() example above is a rare exception.

I also omitted a fourth formulation of max:

int max(int a, int b)
{
    int m;
    if (a > b)
        m = a;
    else
        m = b;
    return m;
}

For this very simple case, such a formulation is overkill.  But if the ‘then’ and ‘else’ cases are more complicated it makes more sense.  (Furthermore, much of this discussion should be thought of in more complex cases, as these kinds of small decision can make a bigger difference then.)  This version also doesn’t violate the no-else-after-return rule.  So perhaps that rule isn’t as bad as I thought, if only because it’s easier to work around than I thought… but I still think rewriting the 1st version of max() into the 2nd version is a bad idea.

What I currently hate most about C++

Everyone knows that global variables are bad and should be avoided wherever possible.  Why?  Because each global variable is, in effect, an implicit argument to every function that can see the global variable.  The same thing is true of any non-local state.

And the presence of non-local state means that you can’t reason locally about your code.  That makes your code more complex, and complex code is likely to have more defects.

And the thing I hate about C++ (and other object-oriented languages) is that it vigorously encourages non-local state.

Non-local state within classes

First, of all, C++ encourages (nay, forces) non-local state within classes, because all class methods have access to all fields within a class, even the ones they don’t need to.  In other words, every class field is an implicit argument to every class method.  This can work well for, let’s say, a “Date” class, because the number of fields is small, and most class methods will access most fields.

But problems appear when classes grow larger, when they start to look like what would be a whole module in a non-OO language like C.  For example, Nanojit, the compiler core in TraceMonkey, contains a class called Assembler, which encapsulates the translation of Nanojit’s low-level intermediate representation (called “LIR”) to assembly code.  If you exclude members that are only included when debugging is enabled, there are 18 data fields and 102 methods.  And some of those 18 data fields are pointers to objects that are themselves complex.

Let’s consider a single field, _thisfrag, which holds a fragment of LIR code. It gets set via an argument passed into the method beginAssembly().  It then gets overwritten — but with the same value! — via an argument passed into the method assemble().  It is accessed directly in only 7 of those 103 methods:

  • assemble(): which increments _thisfrag->compileNbr
  • gen(), printActivationState(), asmspilli(): which use _thisfrag->lirbuf->names, but only when verbose output is asked-for
  • assignSavedRegs(), reserveSavedRegs(), assignParamRegs(): where parts of _thisfrag->lirbuf are read

And that’s just one example, which I chose because I’d been thinking about this problem and then just this morning I had to hunt down all those uses of _thisfrag in order to understand its purpose and whether I could change some related code safely.  I’m sure a similar story will hold for a lot of the fields in this class.

Just imagine, if you were writing Assembler as a C module, would you make _thisfrag a (module-level) global variable?  Almost certainly not, you’d pass it only to the functions that need it;  actually you’d probably only pass parts of _thisfrag around.  But C++ encourages you to make everything a class, and stick everything a class ever needs in as a data field, creating lots of non-local state that complicates everything.

(An aside:  Assembler probably also isn’t a very good basis for a class because it’s a *process*.  I figure that if you’d write something as a struct in C, then it makes for a good class in C++.  But I need to think about that some more.)

Non-local state beyond classes

But it gets even worse.  Good C++ practice encourages everyone to create private fields and use public get/set methods to access class data fields from outside the class.  But get/set methods are just lipstick on a pig; all too often you end up with something like this example, again from the Assembler class:

    private:
        AssmError   _err;

    public:
        void        setError(AssmError e) { _err = e; }
        AssmError   error() { return _err; }

Oh great, I feel much safer now.

It would be better to just make _err public and avoid the get/set obfuscation;  at least then it would be obvious how exposed _err is.  It also saves you from having to check the definitions of error() and setError().

Even better, in this case _err gets set from various places within class Assembler, but also from various places outside class Assembler.  I’ve tried twice to simplify this, by passing error codes around explicitly instead of implicitly through this quasi-global variable, but both times I was defeated by the complexity of the control flow governing how _err is accessed, in particular the fact that’s it’s set on some control paths but not others.  This is a big part of the reason why out-of-memory handling in Nanojit is a total nightmare.

The end result

Currently Nanojit has a number of large, complex classes, and many of them link to other large complex classes.  At many points in the code there is a bewildering amount of accessible non-local state.  (And I haven’t even mentioned how this can complicate memory management, if you end up with multiple pointers to objects.)  The complexity caused by this is a tax on development that we are all paying daily.

A better way

Before joining Mozilla, I spent three years programming in a functional language called Mercury.  Mercury entirely lacks global variables (except for some very restricted cases which are rarely used).  This means that you have to pass more data around as arguments than you do in C++.  But it also means that when you look at a function, you know exactly what its inputs and outputs are, and so you can use purely local reasoning to understand what it does.  This is an *enormous* help, and one that’s easy to underestimate if you haven’t experienced it.

Obviously we’re not going to rewrite Firefox in a functional language any time soon.  And of course non-local state is necessary sometimes. But even C is better than C++ in this respect, because at least in C global variables are obvious and everyone knows that you should minimise their use — the language doesn’t actively encourage you to put non-local state everywhere and let you feel good about it.  Information hiding is one of the fundamental principles of programming, and object-oriented programming is meant to promote it, but unless you are very disciplined it tends to do the opposite.

So next time you are thinking about adding a field to a class, ask yourself: is it really necessary?  Could it be passed in as an argument instead, or something else?  Can you make your life easier by avoiding some non-local state?

atol() considered harmful

Here’s what sounds like a simple question: in C, what’s the best way to convert a string representing an integer to an integer?

The good way

One way is to use the standard library function called strtol(). There are a number of similar functions, such as strtoll() (convert to a long long) and strtod (convert to a double), but they all work much the same so I’ll focus on strtol().

Here’s the prototype for strtol() from the man page on my Linux box:

#include <stdlib.h>
long int strtol(const char *nptr, char **endptr, int base);

The first argument is the string to convert.  The third argument is the base;  if you give it zero the base used will be 10 unless the string begins with “0x” (hexadecimal) or ’0′ (octal).  Any whitespace at the front is skipped over, and a ‘+’ or ‘-’ just before the digits is allowed.  The return value is the long integer value that the string was converted into.

The second argument is a call-by-reference return value.  If it’s non-NULL, it gets filled in with a pointer to the first non-converted char in the string.  If the string was entirely numbers, such as “123″, endptr will point to the terminating NUL char.  If the string had no valid integral prefix, e.g. “one two three”, endptr will point to the start of the string.  If the string was “123xyz”, i.e. contains numbers followed by non-numbers, endptr will point to the ‘x’ char in the string.

There are two good things about endptr.  First, it lets you do error-detection.  For example, if you are expecting a number without any extra chars at the end, endptr must point to a NUL char:

char* endptr;
long int x = strtol(s, &endptr, 0);
if (*endptr) { /* error case */ }

Second, it allows more flexible parsing.  For example, if you need to parse a string consisting of three comma-separated integers (e.g. “1, 2, 3″) you can do this:

int i1 = strtol(s,        &endptr, 0);  if (*endptr != ',')  goto bad;
int i2 = strtol(endptr+1, &endptr, 0);  if (*endptr != ',')  goto bad;
int i3 = strtol(endptr+1, &endptr, 0);  if (*endptr != '\0') goto bad;
...
bad: /* error case */

(Nb: This example allows whitespace before each number but not before each comma.)

Finally, strtol() returns 0 and sets errno to EINVAL if the conversion could not be performed because the given base was invalid.  Also, it clamps the value and sets errno to ERANGE if an overflow or underflow occurrs (as could be the case in the above code examples).

The bad way

Another way is to use the standard library function called atol().  (Again, it’s one of a family of similar functions, such as atoi() and atof().)  atol() is equivalent to this call to strtol():

strtol(str, (char **)NULL, 10);

Seems like a nice simplification, especially if you know you want base 10 numbers, right?  The problem is that there is no scope for error-detection.  atol(“123xyz”) will return 123.  atol(“John Smith”) will return 0.  Even better, atol() doesn’t have to set errno in any case!  (And this means it’s actually not quite equivalent to the above strtol() call).

It’s quite amazing, really;  I’m not aware of any other C standard library functions that actually make error-detection impossible.  Furthermore, the documentation involving atol() doesn’t make this clear.  Compare this to the dangerous function gets():  on my Linux box, the man page has a BUGS section that says “Never use gets().”  The man page on my Mac is a little less forthright;  in the SECURITY CONSIDERATIONS section it says “The gets() function cannot be used securely.”

The POSIX specification is a little more forthright about atol():

The atol() function is subsumed by strtol() but is retained because it is used extensively in existing code. If the number is not known to be in range, strtol() should be used because atol() is not required to perform any error checking.

The only use I can think of for atol() is in the case where you’ve already scanned the string for some reason and know it contains only digits.

A few of the options in the current release of Valgrind (3.4.0) that take numbers currently accept non-numbers because they are implemented using atoll().  (Most of them use strtoll(), however, and the discrepancy has been removed in the trunk.)  For example, if you pass the option –log-fd=foo it will interpret the “foo” as 0.  Lovely.