Data structures – Nicholas Nethercote

mfbt/SegmentedVector.h

I just landed a new container type called mozilla::SegmentedVector in MFBT. It’s similar to nsTArray and mozilla::Vector, but the the element storage is broken into segments rather than being contiguous. This makes it less flexible than those types — currently you can only append elements and iterate over all elements.

Hoever, in cases where those operations suffice, you should strongly consider using it. It uses multiple moderate-sized storage chunks instead of a single large storage chunk, which greatly reduces the likelihood of OOM crashes, especially on Win32 where large chunks of address space can be difficult to find. (See bug 1096624 for a good example; use of a large nsTArray was triggering ~1% of all OOM crashes.) It also avoids the need for repeatedly allocating new buffers and copying existing data into them as it grows.

The declaration is as follows.

template<typename T,
         size_t IdealSegmentSize,
         typename AllocPolicy = MallocAllocPolicy>
class SegmentedVector

T is the element type.
IdealSegmentSize is the size of each segment, in bytes. It should be a power-of-two (to avoid slop), not too small (so you can fit a reasonable number of elements in each chunk, which keeps the per-segmente book-keeping overhead low) and not too big (so virtual OOM crashes are unlikely). A value like 4,096 or 8,192 or 16,384 is likely to be good.
AllocPolicy is the allocation policy. A SegmentedVector can be made infallible by using InfallibleAllocPolicy from mozalloc.h.

If you want to use SegmentedVector but it doesn’t support the operations you need, please talk to me. While it will never be as flexible as a contiguous vector type, there’s definitely scope for adding new operations.

N-ary trees in C

If you want to implement an N-ary tree of integers in C, the obvious way to represent each node is something like the following.

struct t {
    int n;
    int numKids;
    struct t **kids;
}

Here kids is an array of kid pointers. Here’s an example tree.

This representation is a bit annoying because the kids array must be allocated separately from the node, and you need to reallocate the kids array every time you add a new kid to a node. Alternatively, you could instead over-allocate in anticipation of more nodes being added, but then you have to record the array capacity within the node. (The latter representation is exactly what I used for the tree of stack traces in Massif.)

While reading through the code for trace-malloc yesterday, I learnt a better representation.

struct t {
    int n;
    struct t *kids;
    struct t *siblings;
}

This looks odd at first. Here’s the example tree with this new representation.

From any single node you can only directly access its first kid, through the kids pointer. You access the second kid through kids->siblings, the third kid through kids->siblings->siblings, and so on. In other words, instead of storing a node’s kids in an array, you store them in a list; it just so happens that this list can be embedded elegantly in the nodes themselves. (There’s an argument to be made that the fields should be called firstKid and nextSibling, though I can see the sense in both naming schemes.)

There are several nice things about this representation.

Each node is a fixed size; no auxiliary arrays are needed.
Many algorithms can be expressed more easily because it’s just a binary tree.
You can treat each sibling list as a move-to-front list. In the example above, if you accessed node 11 you would move nodes 4 and 11 to the front of their sibling lists. This may make subsequent accesses faster if you have locality in your access patterns. (You could also do this with the first representation, but you’d have to shuffle some of the pointers in the kids array along by one.)

It appears this is a moderately well-known representation — this question on Stack Overflow has two answers, one for each of the representations I described. But I didn’t know about it, so I thought I’d share it.