Slimmer and simpler static atoms – Nicholas Nethercote

String interning is:

a method of storing only one copy of each distinct string value, which must be immutable. Interning strings makes some string processing tasks more time- or space-efficient at the cost of requiring more time when the string is created or interned. The distinct values are stored in a string intern pool. The single copy of each string is called its intern.

In Firefox’s code we use the term atom rather than intern, and atom table rather than string intern pool. I don’t know why; those names have been used for a long time.

Furthermore, Firefox distinguishes between static atoms, which are those that are chosen at compile time and can be directly referred to via an identifier, and dynamic atoms, which are added on-demand at runtime. This post is about the former.

In 2016, Firefox’s implementation of static atoms was complex and inefficient. I filed a bug about this that included the following ASCII diagram showing all the data structures involved for a single atom for the string “foobar”.

static nsFakeStringBuffer<N=7> foobar_buffer (.data, 8+2N bytes)
/-----------------------------------------\ <------+
| int32_t mRefCnt = 1 // never reaches 0  |        | 
| uint32_t mSize = 14 // 7 x 16-bit chars |        | 
| u"foobar"           // the actual chars | <----+ | 
\-----------------------------------------/      | | 
                                                 | | 
PermanentAtomImpl (heap, 32 bytes)               | | 
/----------------------------------------------\ | | <-+
| void* vtablePtr    // implicit               | | |   | 
| uint32_t mLength = 6                         | | |   | 
| uint32_t mHash = ...                         | | |   | 
| char16_t* mString = @------------------------|-+ |   | 
| uintptr_t mRefCnt  // from NS_DECL_ISUPPORTS |   |   | 
\----------------------------------------------/   |   | 
                                                   |   | 
static nsIAtom* foobar (.bss, 8 bytes)             |   | 
/---\ <-----------------------------------+        |   | 
| @-|-------------------------------------|------------+
\---/                                     |        |   | 
                                          |        |   | 
static nsStaticAtom (.d.r.ro.l, 16 bytes) |        |   | 
(this element is part of a larger array)  |        |   | 
/------------------------------------\    |        |   | 
| nsStringBuffer* mStringBuffer = O--|----|--------+   | 
| nsIAtom** mAtom = @----------------|----+            | 
\------------------------------------/                 | 
                                                       | 
AtomTableEntry (heap, ~2 x 16 bytes[*])                | 
(this entry is part of gAtomTable)                     | 
/-------------------------\                            | 
| uint32_t mKeyHash = ... |                            | 
| AtomImpl* mAtom = @-----|----------------------------+
\-------------------------/                            | 
                                                       | 
StaticAtomEntry (heap, ~2 x 16 bytes[*])               | 
(this entry is part of gStaticAtomTable)               | 
/-------------------------\                            | 
| uint32_t mKeyHash = ... |                            | 
| nsIAtom* mAtom = @------|----------------------------+
\-------------------------/

[*] Each hash table is half full on average, so each entry takes up
approximately twice its actual size.

There is a lot going on in that diagram, but putting that all together gave the following overhead per atom.

Static shared: 0 bytes
Static unshared: 8 + 2(length+1) + 8 + 16
Dynamic: 32 + ~32 + ~32 bytes
Total bytes: (2(length+1) + 64 + ~64) * num_processes

(Although these atoms are “static” in the sense of being known at compile-time, a lot of the associated data was allocated dynamically.)

At the time there were about 2,700 static atoms, and avg_length was about 11, so the overhead was roughly:

0 bytes fixed, and
410,400 bytes per process. (Or more, depending on how the relocations required for the static pointers were represented, which depended on the platform.)

Today, things have improved greatly and now look like the following.

const char16_t[7] (.rodata, 2(N+1) bytes)
(this is detail::gGkAtoms.foobar_string)
/-----------------------------------------\ <--+
| u"foobar"           // the actual chars |    | 
\-----------------------------------------/    | 
                                               | 
const nsStaticAtom (.rodata, 12 bytes)         | 
(this is within detail::gGkAtoms.mAtoms[])     | 
/-------------------------------------\ <---+  | 
| uint32_t mLength:30 = 6             |     |  | 
| uint32_t mKind:2 = AtomKind::Static |     |  | 
| uint32_t mHash = ...                |     |  | 
| uint32_t mStringOffset = @----------|-----|--+
\-------------------------------------/     | 
                                            | 
constexpr nsStaticAtom* (0 bytes) @---------+
(this is nsGkAtoms::foobar)                 | 
                                            | 
AtomTableEntry (heap, ~2 x 16 bytes[*])     | 
(this entry is part of gAtomTable)          | 
/-------------------------\                 | 
| uint32_t mKeyHash = ... |                 | 
| nsAtom* mAtom = @-------|-----------------+
\-------------------------/

[*] Each hash table is half full on average, so each entry takes up
approximately twice its actual size.

That gives the following overhead per atom.

Static shared: 12 + 2(length+1) bytes
Static unshared: 0 bytes
Dynamic: ~32 bytes
Total: 12 + 2(length+1) + ~32 * num_processes

We now have about 2,300 static atoms and avg_length is still around 11, so the overhead is roughly:

82,800 bytes fixed, and
73,600 bytes per process.

I won’t explain all the parts of the two diagrams, but it can be seen that we’ve gone from six pieces per static atom to four; the size and complexity of the remaining pieces are greatly reduced; there are no static pointers (only constexpr pointers and integral offsets) and thus no relocations; and there is a lot more interprocess sharing thanks to more use of const. Also, there is no need for a separate static atom table any more, because the main atom table is thread-safe and the HTML5 parser (the primary user of the separate static atom table) now has a small but highly effective static atoms cache.

Things that aren’t visible from the diagrams: atoms are no longer exposed to JavaScript code via XPIDL, there are no longer any virtual methods involved, and all atoms are defined in a single place (with no duplicates) instead of 7 or 8 different places. Notably, the last few steps were blocked for some time by a bug in MSVC involving the handling of constexpr.

The bug dependency tree gives a good indication of how many separate steps were involved in this work. If there is any lesson to be had here, it’s that small improvements add up over time.

2 replies on “Slimmer and simpler static atoms”