Categories
about:memory AdBlock Plus Firefox Memory consumption MemShrink

Firefox 41 will use less memory when running AdBlock Plus

Last year I wrote about AdBlock Plus’s effect on Firefox’s memory usage. The most important part was the following.

First, there’s a constant overhead just from enabling ABP of something like 60–70 MiB. […] This appears to be mostly due to additional JavaScript memory usage, though there’s also some due to extra layout memory.

Second, there’s an overhead of about 4 MiB per iframe, which is mostly due to ABP injecting a giant stylesheet into every iframe. Many pages have multiple iframes, so this can add up quickly. For example, if I load TechCrunch and roll over the social buttons on every story […], without ABP, Firefox uses about 194 MiB of physical memory. With ABP, that number more than doubles, to 417 MiB.

An even more extreme example is this page, which contains over 400 iframes. Without ABP, Firefox uses about 370 MiB. With ABP, that number jumps to 1960 MiB.

(This description was imprecise; the overhead is actually per document, which includes both top-level documents in a tab and documents in iframes.)

Last week Mozilla developer Cameron McCormack landed patches to fix bug 77999, which was filed more than 14 years ago. These patches enable sharing of CSS-related data — more specifically, they add data structures that share the results of cascading user agent style sheets — and in doing so they entirely fix the second issue, which is the more important of the two.

For example, on the above-mentioned “extreme example” (a.k.a. the Vim Color Scheme Test) memory usage dropped by 3.62 MiB per document. There are 429 documents on that page, which is a total reduction of about 1,550 MiB, reducing memory usage for that page down to about 450 MiB, which is not that much more than when AdBlock Plus is absent. (All these measurements are on a 64-bit build.)

I also did measurements on various other sites and confirmed the consistent saving of ~3.6 MiB per document when AdBlock Plus is enabled. The number of documents varies widely from page to page, so the exact effect depends greatly on workload. (I wanted to test TechCrunch again, but its front page has been significantly changed so it no longer triggers such high memory usage.) For example, for one of my measurements I tried opening the front page and four articles from each of nytimes.com, cnn.com and bbc.co.uk, for a total of 15 tabs. With Cameron’s patches applied Firefox with AdBlock Plus used about 90 MiB less physical memory, which is a reduction of over 10%.

Even when AdBlock Plus is not enabled this change has a moderate benefit. For example, in the Vim Color Scheme Test the memory usage for each document dropped by 0.09 MiB, reducing memory usage by about 40 MiB.

If you want to test this change out yourself, you’ll need a Nightly build of Firefox and a development build of AdBlock Plus. (Older versions of AdBlock Plus don’t work with Nightly due to a recent regression related to JavaScript parsing). In Firefox’s about:memory page you’ll see the reduction in the “style-sets” measurements. You’ll also see a new entry under “layout/rule-processor-cache”, which is the measurement of the newly shared data; it’s usually just a few MiB.

This improvement is on track to make it into Firefox 41, which is scheduled for release on September 22, 2015. For users on other release channels, Firefox 41 Beta is scheduled for release on August 11, and Firefox 41 Developer Edition is scheduled to be released in the next day or two.

Categories
about:memory Firefox Memory consumption Rust Servo

Measuring data structure sizes: Firefox (C++) vs. Servo (Rust)

Firefox’s about:memory page presents fine-grained measurements of memory usage. Here’s a short example.

725.84 MB (100.0%) -- explicit
├──504.36 MB (69.49%) -- window-objects
│ ├──115.84 MB (15.96%) -- top(https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound, id=2147483655)
│ │ ├───85.30 MB (11.75%) -- active
│ │ │ ├──84.75 MB (11.68%) -- window(https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound)
│ │ │ │ ├──36.51 MB (05.03%) -- dom
│ │ │ │ │ ├──16.46 MB (02.27%) ── element-nodes
│ │ │ │ │ ├──13.08 MB (01.80%) ── orphan-nodes
│ │ │ │ │ └───6.97 MB (00.96%) ++ (4 tiny)
│ │ │ │ ├──25.17 MB (03.47%) -- js-compartment(https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound)
│ │ │ │ │ ├──23.29 MB (03.21%) ++ classes
│ │ │ │ │ └───1.87 MB (00.26%) ++ (7 tiny)
│ │ │ │ ├──21.69 MB (02.99%) ++ layout
│ │ │ │ └───1.39 MB (00.19%) ++ (2 tiny)
│ │ │ └───0.55 MB (00.08%) ++ window(https://login.persona.org/communication_iframe)
│ │ └───30.54 MB (04.21%) ++ js-zone(0x7f131ed6e000)

A typical about:memory invocation contains many thousands of measurements. Although they can be hard for non-experts to interpret, they are immensely useful to Firefox developers. For this reason, I’m currently implementing a similar system in Servo, which is a next-generation browser engine that’s implemented in Rust. Although the implementation in Servo is heavily based on the Firefox implementation, Rust has some features that make the Servo implementation a lot nicer than the Firefox implementation, which is written in C++. This blog post is a deep dive that explains how and why.

Measuring data structures in Firefox

A lot of the measurements done for about:memory are of heterogeneous data structures that live on the heap and contain pointers. We want such data structures to be able to measure themselves. Consider the following simple example.

struct CookieDomainTuple
{
  nsCookieKey key;
  nsRefPtr<nsCookie> cookie;
 
  size_t SizeOfExcludingThis(mozilla::MallocSizeOf aMallocSizeOf) const;
};

The things to immediately note about this type are as follows.

  • The details of nsCookieKey and nsCookie don’t matter here.
  • nsRefPtr is a smart pointer type.
  • There is a method, called SizeOfExcludingThis, for measuring the size of a CookieDomainTuple.

That measurement method has the following form.

size_t
CookieDomainTuple::SizeOfExcludingThis(MallocSizeOf aMallocSizeOf) const
{
  size_t amount = 0;
  amount += key.SizeOfExcludingThis(aMallocSizeOf);
  amount += cookie->SizeOfIncludingThis(aMallocSizeOf);
  return amount;
}

Things to note here are as follows.

  • aMallocSizeOf is a pointer to a function that takes a pointer to a heap block and returns the size of that block in bytes. Under the covers it’s implemented with a function like malloc_usable_size. Using a function like this is superior to computing the size analytically, because (a) it’s less error-prone and (b) it measures the actual size of heap blocks, which is often larger than the requested size because heap allocators round up some request sizes. It will also naturally measure any padding between members.
  • The two data members are measured by invocations to size measurement methods that they provide.
  • The first of these is called SizeOfExcludingThis. The “excluding this” here is necessary because key is an nsCookieKey that sits within a CookieDomainTuple. We don’t want to measure the nsCookieKey struct itself, just any additional heap blocks that it has pointers to.
  • The second of these is called SizeOfIncludingThis. The “including this” here is necessary because cookie is just a pointer to an nsCookie struct, which we do want to measure, along with any additional heap blocks it has pointers to.
  • We need to be careful with these calls. If we call SizeOfIncludingThis when we should call SizeOfExcludingThis, we’ll likely get a crash due to calling aMallocSizeOf on a non-heap pointer. And if we call SizeOfExcludingThis when we should call SizeOfIncludingThis, we’ll miss measuring the struct.
  • If this struct had a pointer to a raw heap buffer — e.g. a char* member — it would measure it by calling aMallocSizeOf directly on the pointer.

With that in mind, you can see that this method is itself a SizeOfExcludingThis method, and indeed, it doesn’t measure the memory used by the struct instance itself. A method that did include that memory would look like the following.

size_t
CookieDomainTuple::SizeOfIncludingThis(MallocSizeOf aMallocSizeOf)
{ 
  return aMallocSizeOf(this) + SizeOfExcludingThis(aMallocSizeOf);
}

All it does is measure the CookieDomainTuple struct itself — i.e. this — and then call the SizeOfExcludingThis method, which measures all child structures.

There are a few other wrinkles.

  • Often we want to ignore a data member. Perhaps it’s a scalar value, such as an integer. Perhaps it’s a non-owning pointer to something and that thing would be better measured as part of the measurement of another data structure. Perhaps it’s something small that isn’t worth measuring. In these cases we generally use comments in the measurement method to explain why a field isn’t measured, but it’s easy for these comments to fall out-of-date. It’s also easy to forget to update the measurement method when a new data member is added.
  • Every SizeOfIncludingThis method body looks the same: return aMallocSizeOf(this) + SizeOfExcludingThis(aMallocSizeOf);
  • Reference-counting complicates things, because you end up with pointers that conceptually own a fraction of another structure.
  • Inheritance complicates things.

(The full documentation goes into more detail.)

Even with all the wrinkles, it all works fairly well. Having said that, there are a lot of SizeOfExcludingThis and SizeOfIncludingThis methods that are boilerplate-y and tedious to write.

Measuring data structures in SERVO

When I started implementing a similar system in Servo, I naturally followed a similar design. But I soon found I was able to improve upon it.

With the same functions defined for lots of types, it was natural to define a Rust trait, like the following.

pub trait HeapSizeOf { 
  fn size_of_including_self(&self) -> usize;
  fn size_of_excluding_self(&self) -> usize; 
}

Having to repeatedly define size_of_including_self when its definition always looks the same is a pain. But heap pointers in Rust are handled via the parameterized Box type, and it’s possible to implement traits for this type. This means we can implement size_of_excluding_this for all Box types — thus removing the need for size_of_including_this — in one fell swoop, as the following code shows.

impl<T: HeapSizeOf> HeapSizeOf for Box<T> {
  fn size_of_excluding_self(&self) -> usize {
    heap_size_of(&**self as *const T as *const c_void) + (**self).size_of_excluding_self()
  }
}

The pointer manipulations are hairy, but basically it says that if T implements the HeapSizeOf trait, then we can measure Box<T> by measuring the T struct itself (via heap_size_of, which is similar to the aMallocSizeOf function in the Firefox example), and then measuring the things hanging off T (via the size_of_excluding_self call). Excellent!

With the including/excluding distinction gone, I renamed size_of_excluding_self as heap_size_of_children, which I thought communicated the same idea more clearly; it seems better for the name to describe what it is measuring rather than what it is not measuring.

But there was still a need for a lot of tedious boilerplate code, as this example shows.

pub struct DisplayList {
  pub background_and_borders: LinkedList<DisplayItem>,
  pub block_backgrounds_and_borders: LinkedList<DisplayItem>,
  pub floats: LinkedList<DisplayItem>,
  pub content: LinkedList<DisplayItem>,
  pub positioned_content: LinkedList<DisplayItem>,
  pub outlines: LinkedList<DisplayItem>,
  pub children: LinkedList<Arc<StackingContext>>,
}

impl HeapSizeOf for DisplayList {
  fn heap_size_of_children(&self) -> usize {
    self.background_and_borders.heap_size_of_children() +
    self.block_backgrounds_and_borders.heap_size_of_children() +
    self.floats.heap_size_of_children() +
    self.content.heap_size_of_children() +
    self.positioned_content.heap_size_of_children() +
    self.outlines.heap_size_of_children() +
    self.children.heap_size_of_children()
  }
}

However, the Rust compiler has the ability to automatically derive implementations for some built-in traits. Even better, the compiler lets you write plug-ins that do arbitrary transformations of the syntax tree, which makes it possible to write a plug-in that does the same for non-built-in traits on request. And the delightful Manish Goregaokar has done exactly that. This allows the example above to be reduced to the following.

#[derive(HeapSizeOf)]
pub struct DisplayList {
  pub background_and_borders: LinkedList<DisplayItem>,
  pub block_backgrounds_and_borders: LinkedList<DisplayItem>,
  pub floats: LinkedList<DisplayItem>,
  pub content: LinkedList<DisplayItem>,
  pub positioned_content: LinkedList<DisplayItem>,
  pub outlines: LinkedList<DisplayItem>,
  pub children: LinkedList<Arc<StackingContext>>,
}

The first line is an annotation that triggers the plug-in to do the obvious thing: generate a heap_size_of_children definition that just calls heap_size_of_children on all the struct fields. Wonderful!

But you may remember that I mentioned that sometimes in Firefox’s C++ code we want to ignore a particular member. This is also true in Servo’s Rust code, so the plug-in supports an ignore_heap_size annotation which can be applied to any field in the struct definition; the plug-in will duly ignore any such field.

If a new field is added which has a type for which HeapSizeOf has not been implemented, the compiler will complain. This means that we can’t add a new field to a struct and forget to measure it. The ignore_heap_size_of annotation also requires a string argument, which (by convention) holds a brief explanation why the member is ignored, as the following example shows.

#[ignore_heap_size_of = "Because it is a non-owning reference."]
pub image: Arc<Image>,

(An aside: the best way to handle Arc is an open question. If one of the references is clearly the owner, it probably makes sense to count the full size for that one reference. Otherwise, it is probably best to divide the size equally among all the references.)

The plug-in also has a known_heap_size_of! macro that lets us easily dictate the heap size of built-in types (such as integral types, whose heap size is zero). This works because Rust allows implementations of custom traits for built-in types. It provides additional uniformity because built-in types don’t need special treatment. The following line says that all the built-in signed integer types have a heap_size_of_children value of zero.

known_heap_size_of!(0, i8, i16, i32, i64, isize);

Finally, if there is a type for which the measurement needs to do something more complicated, we can still implement heap_size_of_children manually.

Conclusion

The Servo implementation is much nicer than the Firefox implementation, in the following ways.

  •  There is no need for an including/excluding split thanks to trait implementations on Box. This avoids boilerplate some code and makes it impossible to accidentally call the wrong method.
  • Struct fields that use built-in types are handled the same way as all others, because Rust trait implementations can be defined for built-in types.
  • Even more boilerplate is avoided thanks to the compiler plug-in that auto-derives HeapSizeOf implementations; it can even ignore fields.
  • For ignored fields, the required string parameter makes it impossible to forget to explain why the field is ignored.

These are possible due to several powerful language and compiler features of Rust that C++ lacks. There may be some C++ features that could improve the Firefox code — and I’d love to hear suggestions — but it’s never going to be as nice as the Rust code.

Categories
about:memory Firefox Memory allocation Memory consumption MemShrink Performance

Better documentation for memory profiling and leak detection tools

Until recently, the documentation for all of Mozilla’s memory profiling and leak detection tools had some major problems.

  • It was scattered across MDN, the Mozilla Wiki, and the Mozilla archive site (yes, really).
  • Documentation for several tools was spread across multiple pages.
  • Documentation for some tools was meagre, non-existent, or overly verbose.
  • Some of the documentation was out of date, e.g. describing tools that no longer exist.

A little while back I fixed these problems.

  • The documentation for these tools is now all on MDN. If you look at the MDN Performance page in the “Memory profiling and leak detection tools” section, you’ll see a brief description of each tool that explains the circumstances in which it is useful, and a link to the relevant documentation.
  • The full list of documented tools includes: about:memory, DMD, areweslimyet.com, BloatView, Refcount tracing and balancing, GC and CC logs, Valgrind, LeakSanitizer, Apple tools, TraceMalloc, Leak Gauge, and LogAlloc.
  • As well as consolidating all the pages in one place, I also improved some of the pages (with the help of people like Andrew McCreight). In particular, about:memory now has reasonably detailed documentation, something it has lacked until now.

Please take a look, and if you see any problems let me know. Or, if you’re feeling confident just fix things yourself! Thanks.

Categories
about:memory Firefox MemShrink

Per-class JS object and shape measurements in Firefox’s about:memory

A few days ago I landed support for per-class reporting of JavaScript objects and shapes in about:memory. (Shapes are auxiliary, engine-internal data structures that are used to facilitate object property accesses. They can use large amounts of memory.)

Prior to this patch, the JavaScript objects and shapes within a single compartment (which corresponds to a JavaScript window or global object) would be covered by measurements in a small number of fixed categories.

10,179,152 B (02.59%) -- objects
├───6,749,600 B (01.72%) -- gc-heap
│   ├──3,512,640 B (00.89%) ── dense-array
│   ├──2,965,184 B (00.75%) ── ordinary
│   └────271,776 B (00.07%) ── function
├───3,429,552 B (00.87%) -- malloc-heap
│   ├──2,377,600 B (00.61%) ── slots
│   └──1,051,952 B (00.27%) ── elements/non-asm.js
└───────────0 B (00.00%) ── non-heap/code/asm.js
474,144 B (00.12%) -- shapes
├──316,832 B (00.08%) -- gc-heap
│  ├──167,320 B (00.04%) -- tree
│  │  ├──152,400 B (00.04%) ── global-parented
│  │  └───14,920 B (00.00%) ── non-global-parented
│  ├──125,352 B (00.03%) ── base
│  └───24,160 B (00.01%) ── dict
└──157,312 B (00.04%) -- malloc-heap
   ├───99,328 B (00.03%) ── compartment-tables
   ├───35,040 B (00.01%) ── tree-tables
   ├───12,704 B (00.00%) ── dict-tables
   └───10,240 B (00.00%) ── tree-shape-kids

These measurements are only interesting to those who understand the guts of the JavaScript engine.

In contrast, objects and shapes are now grouped by their class. Per-class measurements relate back to the JavaScript code in a more obvious way, making these measurements useful to a wider range of people.

10,515,296 B (02.69%) -- classes
├───4,566,840 B (01.17%) ++ class(Array)
├───3,618,464 B (00.93%) ++ class(Object)
├───1,755,232 B (00.45%) ++ class(HTMLDivElement)
├─────333,624 B (00.09%) ++ class(Function)
├─────165,624 B (00.04%) ++ class(<non-notable classes>)
├──────38,736 B (00.01%) ++ class(Window)
└──────36,776 B (00.01%) ++ class(CSS2PropertiesPrototype)

(The <non-notable classes> entry aggregates all classes that are smaller than a certain threshold. This prevents any long tail of classes from bloating about:memory too much.)

Expanding the sub-tree for the Object class, we see that the fixed categories are still present, for those who are interested in them.

3,618,464 B (00.93%) -- class(Object)
├──3,540,672 B (00.91%) -- objects
│  ├──2,349,632 B (00.60%) -- malloc-heap
│  │  ├──2,348,480 B (00.60%) ── slots
│  │  └──────1,152 B (00.00%) ── elements/non-asm.js
│  └──1,191,040 B (00.30%) ── gc-heap
└─────77,792 B (00.02%) -- shapes
      ├──57,376 B (00.01%) -- gc-heap
      │  ├──47,120 B (00.01%) ── tree
      │  ├───5,360 B (00.00%) ── dict
      │  └───4,896 B (00.00%) ── base
      └──20,416 B (00.01%) -- malloc-heap
         ├──11,552 B (00.00%) ── tree-tables
         ├───6,912 B (00.00%) ── tree-kids
         └───1,952 B (00.00%) ── dict-tables

Although the per-class measurements often aren’t surprising — Object and Array objects and shapes often dominate — sometimes they are. Consider the following examples.

  • The above example has 1.7 MiB of HTMLDivElement objects and shapes, which indicates that the compartment contains many div elements.
  • If you have lots of memory used by Function objects and shapes, it suggests that the code is creating excessive numbers of closures.
  • Just this morning a visitor to the #memshrink IRC channel was wondering why they had 11 MiB of XPC_WN_NoMods_NoCall_Proto_JSClass objects and shapes in one compartment. (This is a question I currently don’t have a good answer for.)

Historically, the data-dependent measurements in about:memory — e.g. those done on a per-tab, or per-compartment, or per-image, or per-script basis — have been more useful and interesting than the ones in fixed categories, because they map obviously to browser and code artifacts. For example, per-tab measurements let you know if a particular web page is using excessive memory, and per-compartment measurements revealed the existence of zombie compartments, a kind of bad memory leak that used to be common in Firefox and its add-ons.

I’m hoping that these per-class measurements will prove similarly useful. Keep an eye on them, and please let me know and/or file bugs if you see any surprising cases.

A final note: Mozilla’s devtools team is currently making great progress on a JavaScript memory profiler, which will give finer-grained measurements of JavaScript memory usage in web content. Although there will be some overlap between that tool and these new measurements in about:memory, it will useful to have both tools, because each one will be appropriate in different circumstances.

Categories
about:memory DMD Firefox Memory consumption MemShrink

Measuring memory used by third-party code

Firefox’s memory reporting infrastructure, which underlies about:memory, is great. And when it lacks coverage — causing the “heap-unclassified” number to get large — we can use DMD to identify where the unreported allocations are coming from. Using this information, we can extend existing memory reporters or write new ones to cover the missing heap blocks.

But there is one exception: third-party code. Well… some libraries support custom allocators, which is great, because it lets us provide a counting allocator. And if we have a copy of the third-party code within Firefox, we can even use some pre-processor hacks to forcibly provide custom counting allocators for code that doesn’t support them.

But some heap allocations are done by code we have no control over, like OpenGL drivers. For example, after opening a simple WebGL demo on my Linux box, I have over 50% “heap-unclassified”.

208.11 MB (100.0%) -- explicit
├──107.76 MB (51.78%) ── heap-unclassified

DMD’s output makes it clear that the OpenGL drivers are responsible. The following record is indicative.

Unreported: 1 block in stack trace record 2 of 3,597
 15,486,976 bytes (15,482,896 requested / 4,080 slop)
 6.92% of the heap (20.75% cumulative); 10.56% of unreported (31.67% cumulative)
 Allocated at
 replace_malloc (/home/njn/moz/mi8/co64dmd/memory/replace/dmd/../../../../memory/replace/dmd/DMD.cpp:1245) 0x7bf895f1
 _swrast_CreateContext (??:?) 0x3c907f03
 ??? (/usr/lib/x86_64-linux-gnu/dri/i965_dri.so) 0x3cd84fa8
 ??? (/usr/lib/x86_64-linux-gnu/dri/i965_dri.so) 0x3cd9fa2c
 ??? (/usr/lib/x86_64-linux-gnu/dri/i965_dri.so) 0x3cd8b996
 ??? (/usr/lib/x86_64-linux-gnu/dri/i965_dri.so) 0x3ce1f790
 ??? (/usr/lib/x86_64-linux-gnu/dri/i965_dri.so) 0x3ce1f935
 glXGetDriverConfig (??:?) 0x3dce1827
 glXDestroyGLXPixmap (??:?) 0x3dcbc213
 glXCreateNewContext (??:?) 0x3dcbc48a
 mozilla::gl::GLContextGLX::CreateGLContext(mozilla::gfx::SurfaceCaps const&, mozilla::gl::GLContextGLX*, bool, _XDisplay*, unsigned long, __GLXFBConfigRec*, bool, gfxXlibSurface*) (/home/njn/moz/mi8/co64dmd/gfx/gl/../../../gfx/gl/GLContextProviderGLX.cpp:783) 0x753c99f4

The bottom-most frame is for a function (CreateGLContext) within Firefox’s codebase, and then control passes to the OpenGL driver, which eventually does a heap allocation, which ends up in DMD’s replace_malloc function.

The following DMD report is a similar case that shows up on Firefox OS.

Unreported: 1 block in stack trace record 1 of 463
 1,454,080 bytes (1,454,080 requested / 0 slop)
 9.75% of the heap (9.75% cumulative); 21.20% of unreported (21.20% cumulative)
 Allocated at
 replace_calloc /Volumes/firefoxos/B2G/gecko/memory/replace/dmd/DMD.cpp:1264 (0xb6f90744 libdmd.so+0x5744)
 os_calloc (0xb25aba16 libgsl.so+0xda16) (no addr2line)
 rb_alloc_primitive_lists (0xb1646ebc libGLESv2_adreno.so+0x74ebc) (no addr2line)
 rb_context_create (0xb16446c6 libGLESv2_adreno.so+0x726c6) (no addr2line)
 gl2_context_create (0xb16216f6 libGLESv2_adreno.so+0x4f6f6) (no addr2line)
 eglCreateClientApiContext (0xb25d3048 libEGL_adreno.so+0x1a048) (no addr2line)
 qeglDrvAPI_eglCreateContext (0xb25c931c libEGL_adreno.so+0x1031c) (no addr2line)
 eglCreateContext (0xb25bfb58 libEGL_adreno.so+0x6b58) (no addr2line)
 eglCreateContext /Volumes/firefoxos/B2G/frameworks/native/opengl/libs/EGL/eglApi.cpp:527 (0xb423dda2 libEGL.so+0xeda2)
 mozilla::gl::GLLibraryEGL::fCreateContext(void*, void*, void*, int const*) /Volumes/firefoxos/B2G/gecko/gfx/gl/GLLibraryEGL.h:180 (discriminator 3) (0xb4e88f4c libxul.so+0x789f4c)

We can’t traverse these allocations in the usual manner to measure them, because we have no idea about the layout of the relevant data structures. And we can’t provide a custom counting allocator to code outside of Firefox’s codebase.

However, although we pass control to the driver, control eventually comes back to the heap allocator, and that is something that we do have some power to change. So I had an idea to toggle some kind of mode that records all the allocations that occur within a section of code, as the following code snippet demonstrates.

SetHeapBlockTagForThread("webgl-create-new-context");
context = glx.xCreateNewContext(display, cfg, LOCAL_GLX_RGBA_TYPE, glxContext, True);
ClearHeapBlockTagForThread();

The calls on either side of glx.xCreateNewContext tell the allocator that it should tag all allocations done within that call. And later on, the relevant memory reporter can ask the allocator how many of these allocations remain and how big they are. I’ve implemented a draft version of this, and it basically works, as the following about:memory output shows.

216.97 MB (100.0%) -- explicit
├───78.50 MB (36.18%) ── webgl-contexts
├───32.37 MB (14.92%) ── heap-unclassified

The implementation is fairly simple.

  • There’s a global hash table which records which live heap blocks have a tag associated with them. (Most heap blocks don’t have a tag, so this table stays small.)
  • When SetHeapBlockTagForThread is called, the given tag is stored in thread-local storage. When ClearHeapBlockTagForThread is called, the tag is cleared.
  • When an allocation happens, we (quickly) check if there’s a tag set for the current thread and if so, put a (pointer,tag) pair into the table. Otherwise, we do nothing extra.
  • When a deallocation happens, we check if the deallocated block is in the table, and remove it if so.
  • To find all the live heap blocks with a particular tag, we simply iterate over the table looking for tag matches. This can be used by a memory reporter.

Unfortunately, the implementation isn’t suitable for landing in Firefox’s code, for several reasons.

  • It uses Mike Hommey’s replace_malloc infrastructure to wrap the default allocator (jemalloc). This works well — DMD does the same thing — but using it requires doing a special build and then setting some environment variables at start-up. This is ok for an occasional-use tool that’s only used by Firefox developers, but it’s important that about:memory works in vanilla builds without any additional effort.
  • Alternatively, I could modify jemalloc directly, but we’re hoping to one day move away from our old, heavily-modified version of jemalloc and start using an unmodified jemalloc3.
  • It may have a non-trivial performance hit. Although I haven’t measured performance yet — the above points are a bigger show-stopper at the moment — I’m worried about having to do a hash table lookup on every deallocation. Alternative implementations (store a marker in each block, or store tagged blocks in their own zone) are possible but present their own difficulties.
  • It can miss some allocations. When adding a tag for a particular section of code, you don’t want to mark every allocation that occurs while that section executes, because there could be multiple threads running and you don’t want to mark allocations from other threads. So it restricts the marking to a single thread, but if the section creates a new thread itself, any allocations done on that new thread will be missed. This might sound unlikely, but my implementation appears to miss some allocations and this is my best theory as to why.

This issue of OpenGL drivers and other kinds of third-party code has been a long-term shortcoming with about:memory. For the first time I have a partial solution, though it still has major problems. I’d love to hear if anyone has additional ideas on how to make it better.

Categories
about:memory Memory consumption MemShrink

areweslimyet.com data is now exportable and diffable

areweslimyet.com (a.k.a. AWSY) tracks Firefox’s memory usage on a basic workload that opens lots of websites. It’s not a perfect tool — you shouldn’t consider its measurements as a reliable proxy for Firefox’s memory usage in general — but it does help detect regressions.

One thing it doesn’t support is doing diffs between separate runs. Until now, that is! Thanks to work done by Eric Rahm, it’s now possible to download the data for each snapshot done during a run. This file can then be loaded in about:memory. It’s also possible to download the data for two snapshots and diff them in about:memory. Yay! This diff workflow isn’t super slick, as it requires the downloading of two files and then the loading of them in about:memory. But it’s a lot better than manually eyeballing two sets of data in two separate AWSY tabs, which was the best we could do previously. Furthermore, AWSY and about:memory already duplicate some functionality, and this implementation avoids increasing the amount of duplication.

To do the export, select a single run (zooming in on the graph appropriately) and click on the red “[export]” link next to the appropriate snapshot, as seen in the following screenshot.

Screenshot of AWSY showing the export links

Once it has finished generating the data, the “[export]” link changes to “[download]”, and you can click on it again to do the download.

This is a first step towards improving AWSY so that it can detect memory usage regressions with much higher sensitivity than it currently has.

Categories
about:memory B2G Memory consumption MemShrink

Nuwa has landed

A big milestone for Firefox OS was reached this week: after several bounces spread over several weeks, Nuwa finally landed and stuck.

Nuwa is a special Firefox OS process from which all other app processes are forked. (The name “Nuwa” comes from the Chinese creation goddess.) It allows lots of unchanging data (such as low-level Gecko things like XPCOM structures) to be shared among app processes, thanks to Linux’s copy-on-write forking semantics. This greatly increases the number of app processes that can be run concurrently, which is why it was the #3 item on the MemShrink “big ticket items” list.

One downside of this increased sharing is that it renders about:memory’s measurements less accurate than before, because about:memory does not know about the sharing, and so will over-report shared memory. Unfortunately, this is very difficult to fix, because about:memory’s reports are generated entirely within Firefox, whereas the sharing information is only available at the OS level. Something to be aware of.

Thanks to Cervantes Yu (Nuwa’s primary author), along with those who helped, including Thinker Li, Fabrice Desré, and Kyle Huey.

Categories
about:memory B2G Memory consumption MemShrink

System-wide memory measurement for Firefox OS

Have you ever wondered exactly how all the physical memory in a Firefox OS device is used?   Wonder no more.  I just landed a system-wide memory reporter which works on any Firefox product running on a Linux system.  This includes desktop Firefox builds on Linux, Firefox for Android, and Firefox OS.

This memory reporter is a bit different to the existing ones, which work entirely within Mozilla processes.  The new reporter provides measurements for the entire system, including every user-space process (Mozilla or non-Mozilla) that is running.  It’s aimed primarily at profiling Firefox OS devices, because we have full control over the code running on those devices, and so it’s there that a system-wide view is most useful.

Here is some example output from a GeeksPhone Keon.

System
Other Measurements 
397.24 MB (100.0%) -- mem
├──215.41 MB (54.23%) ── free
├──105.72 MB (26.61%) -- processes
│  ├───57.59 MB (14.50%) -- process(/system/b2g/b2g, pid=709)
│  │   ├──42.29 MB (10.65%) -- anonymous
│  │   │  ├──42.25 MB (10.63%) -- outside-brk
│  │   │  │  ├──41.94 MB (10.56%) ── [rw-p] [69]
│  │   │  │  └───0.31 MB (00.08%) ++ (2 tiny)
│  │   │  └───0.05 MB (00.01%) ── brk-heap/[rw-p]
│  │   ├──13.03 MB (03.28%) -- shared-libraries
│  │   │  ├───8.39 MB (02.11%) -- libxul.so
│  │   │  │   ├──6.05 MB (01.52%) ── [r-xp]
│  │   │  │   └──2.34 MB (00.59%) ── [rw-p]
│  │   │  └───4.64 MB (01.17%) ++ (69 tiny)
│  │   └───2.27 MB (00.57%) ++ (2 tiny)
│  ├───21.73 MB (05.47%) -- process(/system/b2g/plugin-container, pid=756)
│  │   ├──12.49 MB (03.14%) -- anonymous
│  │   │  ├──12.48 MB (03.14%) -- outside-brk
│  │   │  │  ├──12.41 MB (03.12%) ── [rw-p] [30]
│  │   │  │  └───0.07 MB (00.02%) ++ (2 tiny)
│  │   │  └───0.02 MB (00.00%) ── brk-heap/[rw-p]
│  │   ├───8.88 MB (02.23%) -- shared-libraries
│  │   │   ├──7.33 MB (01.85%) -- libxul.so
│  │   │   │  ├──4.99 MB (01.26%) ── [r-xp]
│  │   │   │  └──2.34 MB (00.59%) ── [rw-p]
│  │   │   └──1.54 MB (00.39%) ++ (50 tiny)
│  │   └───0.36 MB (00.09%) ++ (2 tiny)
│  ├───14.08 MB (03.54%) -- process(/system/b2g/plugin-container, pid=836)
│  │   ├───7.53 MB (01.89%) -- shared-libraries
│  │   │   ├──6.02 MB (01.52%) ++ libxul.so
│  │   │   └──1.51 MB (00.38%) ++ (47 tiny)
│  │   ├───6.24 MB (01.57%) -- anonymous
│  │   │   ├──6.23 MB (01.57%) -- outside-brk
│  │   │   │  ├──6.23 MB (01.57%) ── [rw-p] [22]
│  │   │   │  └──0.00 MB (00.00%) ── [r--p]
│  │   │   └──0.01 MB (00.00%) ── brk-heap/[rw-p]
│  │   └───0.31 MB (00.08%) ++ (2 tiny)
│  └───12.32 MB (03.10%) ++ (23 tiny)
└───76.11 MB (19.16%) ── other

The data is obtained entirely from the operating system, specifically from /proc/meminfo and the /proc/<pid>/smaps files, which are files provided by the Linux kernel specifically for measuring memory consumption.

I wish that the mem entry at the top was the amount of physical memory available. Unfortunately there is no way to get that on a Linux system, and so it’s instead the MemTotal value from /proc/meminfo, which is “Total usable RAM (i.e. physical RAM minus a few reserved bits and the kernel binary code)”.  And if you’re wondering about the exact meaning of the other entries, as usual if you hover the cursor over an entry in about:memory you’ll get a tool-tip explaining what it means.

The measurements given for each process are the PSS (proportional set size) measurements.  These attribute any shared memory equally among all processes that share it, and so PSS is the only measurement that can be sensibly summed across processes (unlike “Size” or “RSS”, for example).

For each process there is a wealth of detail about static code and data.  (The above example only shows a tiny fraction of it, because a number of the sub-trees are collapsed.  If you were viewing it in about:memory, you could expand and collapse sub-trees to your heart’s content.)  Unfortunately, there is little information about anonymous mappings, which constitute much of the non-static memory consumption.  I have some patches that will add an extra level of detail there, distinguishing major regions such as the jemalloc heap, the JS GC heap, and JS JIT code.  For more detail than that, the existing per-process memory reports in about:memory can be consulted.  Unfortunately the new system-wide reporter cannot be sensibly combined with the existing per-process memory reporters because the latter are unaware of implicit sharing between processes.  (And note that the amount of implicit sharing is increased significantly by the new Nuwa process.)

Because this works with our existing memory reporting infrastructure, anyone already using the get_about_memory.py script with Firefox OS will automatically get these reports along with all the usual ones once they update their source code, and the system-wide reports can be loaded and viewed in about:memory as usual. On Firefox and Firefox for Android, you’ll need to set the memory.system_memory_reporter flag in about:config to enable it.

My hope is that this reporter will supplant most or all of the existing tools that are commonly used to understand system-wide memory consumption on Firefox OS devices, such as ps, top and procrank.  And there will certainly be other interesting, available OS-level measurements that are not currently obtained. For example, Jed Davis has plans to measure the pmem subsystem.  Please file a bug or email me if you have other suggestions for adding such measurements.

Categories
about:memory DMD Firefox MemShrink

DMD now works on Windows

DMD is our tool for improving Firefox’s memory reporting.  It helps identify where new memory reporters need to be added in order to reduce the “heap-unclassified” value in about:memory.

DMD has always worked well on Linux, and moderately well on Mac (it is crashy for some people).  And it works on Android and B2G.  But it has never worked on Windows.

So I’m happy to report that DMD now does work on Windows, thanks to the excellent efforts of Catalin Iacob.  If you’re on Windows and you’ve been seeing high “heap-unclassified” values, and you’re able to build Firefox yourself, please give DMD a try.

Categories
about:memory MemShrink Programming

Libraries should permit custom allocators

Some C and C++ libraries permit the use of custom allocators, which are registered through some kind of external API.  For example, the following libraries used by Firefox provide this facility.

  • FreeType provides this via the FT_MemoryRec_ argument of the FT_New_Library() function.
  • ICU provides this via the u_setMemoryFunctions() function.
  • SQLite provides this via the sqlite3_config() function.

This gives the users of these libraries additional flexibility that can be very helpful.  For example, in Firefox we provide custom allocators that measure the size of all the live allocations done by the library;  these measurements are shown in about:memory.

In contrast, libraries that don’t allow custom allocator are very hard to account for in about:memory.  Such libraries are major contributors to the dreaded “heap-unclassified” value in about:memory.  These include Cairo and the WebRTC libraries.

Now, supporting custom allocators in a library takes some effort.  You have to be careful to always allocate in a fashion that will use the custom allocators if they have been registered.  Direct calls to vanilla allocation/free functions like malloc(), realloc(), and free() must be avoided.  For example, SpiderMonkey allows custom allocators (although Firefox doesn’t need to use that functionality), and I just fixed a handful of cases where it was accidentally using vanilla allocation/free functions.

But, it’s a very useful facility to provide, and I encourage all library writers to consider it.