Bugs, defects, infections and failures

People often use the word “bug”. Unfortunately it’s a very imprecise word. “Error” suffers from the same problems. Both of them are used at different times to mean incorrect program code, incorrect program states, and visibly incorrect program behaviour. The conflation of these distinct things inhibits clear thinking and can mask subtleties regarding program correctness.

Better terminology

Because of this, I like the following terminology used by Andreas Zeller in his book Why Programs Fail:

A defect is an erroneous piece of code, one that can cause an infection when executed (but it may not always, i.e. it may be masked). Defects are created by programmers.
An infection is an erroneous piece of program state, i.e. one for which there is a discrepancy between the intended and actual program state. Infections are caused by defects and/or prior infections. Infections can be masked via overwriting or correction.
A failure is an erroneous user-visible behaviour, i.e. one for which there is discrepancy between the intended and actual user-visible behaviour. Failures are caused by infections.
An infection chain is a cause-effect chain from a defect to one or more infections to a failure.

(Nb: “Intended state” and “intended behaviour” can be fuzzy concepts. Few programs have complete specifications, and these concepts typically reside partly in the mind of the programmer, partly in the mind of the user, partly in the documentation, and partly nowhere!)

Zeller’s book has received high praise from many quarters. I personally found these definitions, which appear in chapter 1, the single best part of the whole book.

Some examples

Common terminology for memory-related “bugs” haphazardly covers all of these concepts. Consider the following examples.

A double free is a defect.
A memory leak is an infection; the underlying defect is (in a C program) a missing call to free() of a heap block, and the resulting failure may be degraded performance or an out-of-memory abort. (If the leak is minor enough that the user doesn’t notice any difference, then it’s arguably not a failure. There’s a whole separate philosophical discussion to be had on whether poor performance could be considered a failure, depending on what the user’s implicit mental specification of “fast enough” is.)
A segmentation fault (just one kind of crash) is a failure which may be caused by a number of different infections, each of which may be caused by a number of different defects.
A buffer overflow attack involves an entire infection chain; for example, a particular defect (a missing bounds check), causes an infection (an incorrect pointer value), which causes more infections (incorrect values on the stack), which causes yet more infections (an incorrect value for the program counter), which causes failures (incorrect and malicious behaviour from injected code).

Users care about failures, programmers care about defects

Failures affect users. But defects are the root cause of failures, and they are what programmers must fix. Furthermore, defects are still defects (and infections and still infections) even if they cannot cause failures; such defects may not cause problems now, but if the program is changed later, the defect may cause a failure.

Therefore, the aim of “bug detection” tools such as those built with Valgrind is to help the programmer identify defects, whether they cause failures or not. Sometimes a tool can identify defects directly; more often a tool will identify infections or failures, and it is the programmer’s task to work back through the infection chain to identify the defect. The usability of such tools is greatly affected by how easy this task is, and my next post will discuss a particular example in more detail, and may teach even veteran Valgrind users a new trick or two that will make their lives easier.