<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Julian Seward&#039;s blog</title>
	<atom:link href="http://blog.mozilla.org/jseward/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.mozilla.org/jseward</link>
	<description>Just another Blog.mozilla.com weblog</description>
	<lastBuildDate>Wed, 29 Feb 2012 23:03:24 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Visualising profiling results by aggregating stack traces</title>
		<link>http://blog.mozilla.org/jseward/2012/03/01/visualising-profiling-results-by-aggregating-stack-traces/</link>
		<comments>http://blog.mozilla.org/jseward/2012/03/01/visualising-profiling-results-by-aggregating-stack-traces/#comments</comments>
		<pubDate>Wed, 29 Feb 2012 23:03:24 +0000</pubDate>
		<dc:creator>jseward</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.mozilla.org/jseward/?p=169</guid>
		<description><![CDATA[On Android, we have a couple of ways to periodically extract stack traces from a running Firefox: the built in profiler, and Valgrind.  In principle oprofile could also produce stacks at intervals, although it doesn&#8217;t. Looking at zillions of stacks, or text-only output derived from them, isn&#8217;t much fun.  What I&#8217;d really like to do [...]]]></description>
			<content:encoded><![CDATA[<p>On Android, we have a couple of ways to periodically extract stack traces from a running Firefox: the built in profiler, and Valgrind.  In principle oprofile could also produce stacks at intervals, although it doesn&#8217;t.</p>
<p>Looking at zillions of stacks, or text-only output derived from them, isn&#8217;t much fun.  What I&#8217;d really like to do is take a large number of stacks sampled at equal cost intervals, and use Josef Weidendorfer&#8217;s excellent <a title="KCachegrind GUI" href="http://kcachegrind.sourceforge.net/html/Home.html">KCachegrind GUI</a> to visualise the data.  KCachegrind, if you haven&#8217;t tried it, makes it easy to examine the dynamic call graph and to see how costs flow between callers and callees.</p>
<p>To do this, the call stacks need to be merged into an approximate dynamic call graph, cycles found and collapsed, and costs propagated back up from leaf nodes.  This produces the approximate inclusive and exclusive costs for each node (function) encountered.</p>
<p>So I wrote a program to do this.  It takes Valgrind-style call stacks and produces KCachegrind output files.  It could easily be generalised to other call stack generators &#8212; the call stack elements only need to be comparable for equality.</p>
<p>I ran Firefox on Android, used the STR in <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=728846">bug 728846</a> as a test workload, and obtained about 32000 16-entry stacks, sampled at approximately million-instruction intervals.  Merging them and viewing the results in KCachegrind produces output which at least roughly tallies with the ad-hoc observations listed in the bug report.</p>
<p>There&#8217;s clearly room for improvement, particularly in the handling of cycles, and the fact that it&#8217;s limited by how often the stack unwinder produces an incomplete trace.  Nevertheless it&#8217;s an interesting bit of code to have around.  Here&#8217;s a snapshot of a bit of the graph leading up to fast_composite_over_8888_0565, which features prominently in the report. <a href="http://blog.mozilla.org/jseward/files/2012/02/stack-merge-1.png"><img class="alignleft size-full wp-image-171" title="stack-merge-1" src="http://blog.mozilla.org/jseward/files/2012/02/stack-merge-1.png" alt="" width="582" height="691" /></a></p>
<p>&nbsp;</p>
<p>And here&#8217;s a picture of KCachegrind showing the caller and callee relationships for moz_pixman_image_composite32.</p>
<p><a href="http://blog.mozilla.org/jseward/files/2012/02/stack-merge-2.png"><img class="alignleft size-full wp-image-172" title="stack-merge-2" src="http://blog.mozilla.org/jseward/files/2012/02/stack-merge-2.png" alt="" width="418" height="441" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.org/jseward/2012/03/01/visualising-profiling-results-by-aggregating-stack-traces/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Valgrind for OSX 10.7 (Lion) status update</title>
		<link>http://blog.mozilla.org/jseward/2011/10/05/valgrind-for-osx-10-7-lion-status-update/</link>
		<comments>http://blog.mozilla.org/jseward/2011/10/05/valgrind-for-osx-10-7-lion-status-update/#comments</comments>
		<pubDate>Wed, 05 Oct 2011 08:38:45 +0000</pubDate>
		<dc:creator>jseward</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.mozilla.org/jseward/?p=159</guid>
		<description><![CDATA[Various sets of fixes have been committed for Valgrind on Lion.  It now works well enough to run 64 bit Firefox builds and get sane results.  32 bit builds run too, but appear to hang, for threading related reasons I cannot figure out, despite quite some investigative effort. There may be some false positives from [...]]]></description>
			<content:encoded><![CDATA[<p>Various sets of fixes have been committed for Valgrind on Lion.  It now works well enough to run 64 bit Firefox builds and get sane results.  32 bit builds run too, but appear to hang, for threading related reasons I cannot figure out, despite quite some investigative effort.</p>
<p>There may be some false positives from Memcheck as a result of kludged-up syscall wrappers for some new syscalls that are 10.7-specific.  Let me know if you see errors which you think are obviously bogus.</p>
<p>Feedback is welcomed.  If you&#8217;re developing on Mac and have migrated to 10.7, I&#8217;d be interested to hear if it works for you. If you&#8217;re still on 10.6, I&#8217;d be interested to hear if I broke anything :-)  Btw, 10.5 support is pretty much dropped now &#8212; is anybody still using 10.5 for development?</p>
<p>The tracking bug report is <a href="http://bugs.kde.org/show_bug.cgi?id=275168">valgrind bug 275168</a>.</p>
<p>Quick reminder of how to check out and build:</p>
<pre>  svn co svn://svn.valgrind.org/valgrind/trunk
  cd trunk
  ./autogen.sh
  ./configure --prefix=`pwd`/Inst
  make -j2
  make -j2 install</pre>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.org/jseward/2011/10/05/valgrind-for-osx-10-7-lion-status-update/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Valgrind on Android &#8212; Current Status</title>
		<link>http://blog.mozilla.org/jseward/2011/09/27/valgrind-on-android-current-status/</link>
		<comments>http://blog.mozilla.org/jseward/2011/09/27/valgrind-on-android-current-status/#comments</comments>
		<pubDate>Tue, 27 Sep 2011 21:39:23 +0000</pubDate>
		<dc:creator>jseward</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.mozilla.org/jseward/?p=135</guid>
		<description><![CDATA[This is a long post.  Here&#8217;s a summary. the good news: Valgrind&#8217;s Memcheck tool now works well enough on Nexus S that it can run Firefox and find real bugs, for example bug 688733.  The false error rate from Memcheck is pretty low, certainly low enough to be usable. as of this morning, all required [...]]]></description>
			<content:encoded><![CDATA[<p>This is a long post.  Here&#8217;s a summary.</p>
<ul>
<li>the good news: Valgrind&#8217;s Memcheck tool now works well enough on Nexus S that it can run Firefox and find real bugs, for example <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=688733">bug 688733</a>.  The false error rate from Memcheck is pretty low, certainly low enough to be usable.</li>
</ul>
<ul>
<li>as of this morning, all required Valgrind fixes are in the Valgrind SVN trunk repo.</li>
</ul>
<ul>
<li>the bad news: this requires building a custom ROM and kernel for the Nexus S, that is to say, mucho hoop jumping.  Not the faint hearted.</li>
</ul>
<p>The rest of this post explains what the constraints are and approximately what is necessary to get started.</p>
<h3>Constraints</h3>
<p>The main difficulty is the need to build a custom ROM and kernel.  There are three reasons for this:</p>
<ul>
<li>To have a swap enabled kernel.  Starting a debuggable build of Firefox on Memcheck gives a process size of around 800MB.  Without swap, it gets OOM-killed at around 270MB.  But the default kernel isn&#8217;t configured for swap, hence a rebuild is required, as per <a href="http://glandium.org/blog/?p=2214">Mike Hommey&#8217;s instructions</a>.  Once in place, I gave the kernel a 1GB swap file parked in /sdcard, and that seemed to work ok.</li>
</ul>
<ul>
<li>Libraries with symbols.  Memcheck needs to have symbol information for /system/lib/libc.so and /system/bin/linker (libc and the dynamic linker).  Without these it generates huge numbers of false error reports and is unusable.  Symbols for the other libraries are nice (better stacktraces) but not essential.</li>
</ul>
<ul>
<li>To make it possible to start Firefox on Valgrind.  Valgrind can only insert itself into a process at an exec() transition, that is, when the process starts from new.  On normal Unixes this is no problem, since the shell from which you start Valgrind does the normal fork-exec thing.  But application start on Android is completely different, and doesn&#8217;t involve exec().</li>
</ul>
<p style="padding-left: 30px;">Instead, there is a master process called the Zygote.  To start an application (eg Firefox), a message is sent via a socket to the Zygote.  This creates a child with fork(), and the child then goes on to load the relevant bytecode and (presumably) native code and &#8220;becomes&#8221; Firefox.  So there&#8217;s no exec()  boundary for Valgrind to enter at.</p>
<p style="padding-left: 30px;">Fortunately the AOSP folks provided a solution a couple of months back.  They modified Zygote so that it can start selected processes under the control of a user-specified wrapper, which is precisely the hook we need.  The AOSP tree now has this fix.</p>
<h3>Overview of getting started</h3>
<p>Here&#8217;s an overview of the process.  It doesn&#8217;t contain enough details to simply copy and paste, but it does give some idea of the hoop jumping that is unfortunately still required.</p>
<p>Download sources and build Android images, as per directions at <a href="http://source.android.com/source/building.html">http://source.android.com/source/building.html</a>.  This in itself is a major exercise.  The relevant &#8220;lunch&#8221; flavour is full_crespo-eng, I think.  At the end of this stage, you&#8217;ll have (amongst things) libraries with symbols and a wrapper-enabled Zygote.  But not a swap enabled kernel.</p>
<p>Build a swap enabled kernel as per <a href="http://glandium.org/blog/?p=2214">Mike Hommey&#8217;s instructions</a>, and incorporate it into the images built in the previous stage.  In fact, I skipped this step &#8212; Mike kindly did it for me.</p>
<p>Push the images onto the phone, reboot, check it&#8217;s still alive.</p>
<p>Check out a copy of the Valgrind trunk from svn://svn.valgrind.org/valgrind/trunk, and build as described in detail in README.android.  If you complete that successfully, you&#8217;ll have a working installation of Valgrind on the phone at /data/local/Inst/bin/valgrind.</p>
<p>Install busybox on the phone, to make life a little less austere in the shell.</p>
<p>On the Linux host, generate a 1GB swap file and transfer it to /sdcard on the phone (that&#8217;s the only place it will fit).  Then enable swapping:</p>
<pre>  cat /proc/swaps
  /data/local/Bin/busybox swapon /sdcard/swapfile1G
  cat /proc/swaps</pre>
<p>Note you&#8217;ll have to manually re-enable swapping every time the phone is rebooted.</p>
<p>Copy from the host, the contents of out/target/product/crespo/symbols/system to /sdcard/symbols/system.  These are the debuginfo objects for the system libraries.  Valgrind expects them to be present, as per comments above, so it can read symbols for libc.so and /system/bin/linker.  This will copy far more than that, which is not essential but nice for debugging.</p>
<p>Build a Firefox you want to debug.  That of course means with line number info and with the flags &#8211;disable-jemalloc &#8211;enable-valgrind.  I strongly suggest you use &#8220;-O -g&#8221; for a good compromise between speed and debuggability.  When the build finishes, ask the build system to make an .apk file with the debug info in place, and install it.  The .apk will be huge, about 125MB:</p>
<pre>  (cd $objdir &amp;&amp; make package PKG_SKIP_STRIP=1)
  adb install -r $objdir/dist/fennec-9.0a1.en-US.eabi-arm.apk</pre>
<p>We&#8217;re nearly there.  We have a device which is all set up, and a debuggable Firefox on it.  But we need to tell Zygote that we want to start Firefox with a wrapper, namely Valgrind.  In the shell on the phone, do this:</p>
<pre>  setprop wrap.org.mozilla.fennec_sewardj "logwrapper /data/local/start_valgrind_fennec"</pre>
<p>This tells Zygote that any startup of &#8220;org.mozilla.fennec_sewardj&#8221; should be done via an exec() of /data/local/start_valgrind_fennec applied to Zygote-specified arguments.  So, now we can put any old thing in a shell script, and Zygote will run it.  Here&#8217;s what I have for /data/local/start_valgrind_fennec:</p>
<pre>  #!/system/bin/sh
  VGPARAMS='--error-limit=no'
  export TMPDIR=/data/data/org.mozilla.fennec_sewardj
  exec /data/local/Inst/bin/valgrind $VGPARAMS $*</pre>
<p>Obviously you can put any Valgrind params you want in VGPARAMS; you get the general idea.  Note that this is ARM, so you don&#8217;t need the &#8211;smc-check= flag that&#8217;s necessary on x86 targets.</p>
<p>Only two more hoops to jump through now.  One question is where the Valgrind output should go.  Initially I tried using Valgrind&#8217;s little-known but very useful &#8211;log-socket= parameter (see <a href="http://valgrind.org/docs/manual/manual-core.html#manual-core.basicopts">here</a> for details), but it seemed to crash the phone on a regular basis.</p>
<p>So I abandoned that.  By default, Valgrind&#8217;s output winds up in the phone&#8217;s system log, mixed up with lots of other stuff.  In the end I wound up running the following on the host, which works pretty well:</p>
<pre>  adb logcat -c ; adb logcat | grep --line-buffered start_valgrind \
    | sed -u sQ/data/local/start_valgrind_QQg | tee logfile.txt</pre>
<p>And finally .. we need to start Firefox.  Now, due to recent changes in how the libraries are packaged for Android, you can&#8217;t start it by pressing on the Fennec icon (well, you can, but Valgrind won&#8217;t read the debuginfo.)  Instead, issue this command in a shell on the phone:</p>
<pre>  am start -a org.mozilla.gecko.DEBUG -n org.mozilla.fennec_sewardj/.App</pre>
<p>This requests a &#8220;debug intent&#8221; startup of Firefox, which sidesteps the fancy dynamic unpacking of libraries into ashmem, and instead does the old style thing of unpacking them into /data/data/org.mozilla.fennec_sewardj.  From there Valgrind can read debuginfo in the normal way.</p>
<p>One minor last hint: run &#8220;top -d 2 -s cpu -m 19&#8243;.  Because Valgrind runs slowly on the phone, I&#8217;m  often in the situation of wondering am-I-waiting-for-it or is-it-waiting-for-me?  Running top pretty much answers that question.</p>
<p>And .. so .. it works!  It&#8217;s slow, but it appears to be stable, and, crucially, the false error rate from  Memcheck is low enough to be usable.</p>
<p>So, what&#8217;s next?  Writing this reminded me what a hassle it is to get all the ducks lined up right.  We need to streamline it.  Suggestions welcome!.</p>
<p>One thing I&#8217;ve been thinking about is to to avoid the need to have debuginfo on the target, by allowing Valgrind to query the host somehow.  Another thing I plan to do is make the Callgrind tool work, so we can get profile information too.</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.org/jseward/2011/09/27/valgrind-on-android-current-status/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>A low overhead, always-on, system-wide memory event log</title>
		<link>http://blog.mozilla.org/jseward/2011/05/26/a-low-overhead-always-on-system-wide-memory-event-log/</link>
		<comments>http://blog.mozilla.org/jseward/2011/05/26/a-low-overhead-always-on-system-wide-memory-event-log/#comments</comments>
		<pubDate>Thu, 26 May 2011 10:03:29 +0000</pubDate>
		<dc:creator>jseward</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.mozilla.org/jseward/?p=131</guid>
		<description><![CDATA[This is really a RFC. It seems to me that many end-user bugzilla reports about excessive memory use share a common structure.  First, the user reports that &#8220;I did A, B and C, had a coffee, went back to my machine and found that Firefox was sitting on N gigabytes of memory.&#8221;  Then follows lots [...]]]></description>
			<content:encoded><![CDATA[<p>This is really a RFC.</p>
<p>It seems to me that many end-user bugzilla reports about excessive memory use share a common structure.  First, the user reports that &#8220;I did A, B and C, had a coffee, went back to my machine and found that Firefox was sitting on N gigabytes of memory.&#8221;  Then follows lots of<br />
discussion along the lines of &#8220;oh, so the cycle collector did/didn&#8217;t run while Fx was idle&#8221;, and &#8220;but no, that&#8217;s not right, because the tertiary FooBar timer inhibit mechanism will have made the XYZ collector run instead&#8221; kind of thing.</p>
<p>This goes on and on, while everybody tries to figure out what Firefox was actually doing in the period leading up to the too-much-memory observation.  Meanwhile the original reporter gets bored with the discussion and moves on to something else, and there&#8217;s general confusion, annoyance, and lack of reproducibility all round.</p>
<p>So the idea is simple: make a log file listing events which are known to have a significant impact on memory usage.  Nothing heavyweight, just one line per event, timestamped, plus brief relevant stats.  For example:</p>
<ul>
<li> GC started / ended, total size N, reclaimed M bytes</li>
</ul>
<ul>
<li> GC mapped in new heap / unmapped heap</li>
</ul>
<ul>
<li> CC started / ended, total size N, reclaimed M bytes</li>
</ul>
<ul>
<li> image discard ran</li>
</ul>
<ul>
<li> jemalloc mmap&#8217;d more pages / munmapped some pages</li>
</ul>
<ul>
<li> nanojit / mjit mapped / unmapped code pages</li>
</ul>
<p>Perhaps with some indirectly relevant events such as</p>
<ul>
<li>downloaded another chunk of the phishing database</li>
</ul>
<ul>
<li> no user input seen for 17 minutes</li>
</ul>
<ul>
<li> new tab created; now there are 23 of them</li>
</ul>
<ul>
<li> user requested about:memory (+ what it produced)</li>
</ul>
<ul>
<li> extension Xyzzy loaded/initialised</li>
</ul>
<p>That way, we&#8217;d at least have some information about the space history leading up to a situation where a user says &#8220;urk!  massive memory leak!&#8221;.</p>
<p>The log would be low overhead, so it can be used in production.  For sure we don&#8217;t want to log more than a couple of events per second.  Perhaps a moderate sized circular buffer (64KB? 1MB?) that could be dumped to disk on request.  Then, when someone reports excessive memory use, the first thing we ask for is the log file.</p>
<p>Partial implementations of this exist already.  There&#8217;s javascript.options.mem.log, for example, and I&#8217;m sure other subsystems have their own schemes.  But AFAIK there&#8217;s no uniform,  system-wide, lightweight, easy-to-use mechanism.  Would one be useful?</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.org/jseward/2011/05/26/a-low-overhead-always-on-system-wide-memory-event-log/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>MVP: A Memory Value Profiler</title>
		<link>http://blog.mozilla.org/jseward/2011/05/09/mvp-a-memory-value-profiler/</link>
		<comments>http://blog.mozilla.org/jseward/2011/05/09/mvp-a-memory-value-profiler/#comments</comments>
		<pubDate>Mon, 09 May 2011 15:59:13 +0000</pubDate>
		<dc:creator>jseward</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.mozilla.org/jseward/?p=112</guid>
		<description><![CDATA[I&#8217;ve been wondering how much of the data we keep in memory is actually useful, as opposed to being bits which are mostly or entirely constant.  Recently I hacked up a Valgrind tool to get some initial measurements. Imagine a machine instruction that does a 32-bit memory read.  If the read value is observed only [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been wondering how much of the data we keep in memory is actually useful, as opposed to being bits which are mostly or entirely constant.  Recently I hacked up a Valgrind tool to get some initial measurements.</p>
<p>Imagine a machine instruction that does a 32-bit memory read.  If the read value is observed only ever to be one of a small set, let&#8217;s say, 1, 2 or 3, then we might suspect that most of the bits in the word are unused.  Should we be concerned?  It depends whether the instruction accesses a wide range of addresses: if it does, that might be a clue that we&#8217;ve got some inefficient representation going on over a large area of memory.</p>
<p>MVP puts this into practice.  It watches all 1, 2, 4 and 8 byte non-stack memory read instructions, observing both the spread of read values and the spread of accessed addresses.  For each monitored instruction, it computes an aggregate wastefulness score.  This is the the number of always-constant bits in the values multiplied by (some approximation of) the number of different accessed addresses.</p>
<p>You can level various kinds of &#8220;Bogus!&#8221; accusations at it, but it does give some indication of structure and array fields that are not working very hard.  For example, given this:</p>
<p><code> #define MILLION 1000000<br />
int a[MILLION];</code></p>
<p><code> for (i = 0; i &lt; MILLION; i++)   a[i] = (i &amp; 1) ? 1 : -1;</code></p>
<p><code> </code></p>
<p><code> // ... later ...<br />
for (i = 0; i &lt; MILLION; i++)   sum += a[i];</code></p>
<p>it will tell you that the &#8220;sum += a[i]&#8221; reads are mostly pulling in constant bits:</p>
<pre>  wasted 968,781  (31/32 cbs, 31,251 sects)  0x40057E: main (mvp1.c:25)</pre>
<p>31 out of the 32 bits read are constant (it has some sophistication about ignoring constant sign bits), and the accesses are spread over 31,251 different &#8220;sectors&#8221; (128-byte chunks of address space).  Hence this read instruction gets a relatively high wastefulness metric of 968,781.</p>
<p>MVP runs Firefox without difficulty.  Results are interesting.  A couple of high-roller tidbits:</p>
<pre>  wasted 1,143,304  (8/32 cbs, 142,913 sects)
     0x6431A7B: fast_composite_src_x888_0565 (pixman-fast-path.c:904)</pre>
<p>fast_composite_src_x888_0565 converts image data from 24 bit 8:8:8 format to 16-bit 5:6:5 format, and the 24 bit values are parked in 32 bit words.  The data is spread widely (142,913 sectors) and the top 8 bits of the read data are observed to be constant, as we&#8217;d expect.</p>
<pre>  wasted 989,696  (16/16 cbs, 61,856 sects) 
     0x652B054: js::NameNode::create(JSAtom*, JSTreeContext*)
     (jsparse.cpp:798)</pre>
<p>This has to do with</p>
<pre>  pn_dflags = (!tc-&gt;topStmt || tc-&gt;topStmt-&gt;type == STMT_BLOCK)
              ? PND_BLOCKCHILD : 0;</pre>
<p>16 constant bits out of 16 implies that tc-&gt;topStmt-&gt;type (a uint16) has only ever been observed to be STMT_BLOCK here.</p>
<p>The most striking observation from the initial numbers, though, is somewhat obvious: that on a 64 bit platform, pointer-intensive structures are space-inefficient.  There are large numbers of 64-bit load instructions that pull in values with 36 or so constant bits.  That would correspond to fetching addresses of objects scattered across a few hundred megabytes of address space.</p>
<p>Perhaps we should work to minimise the number of heap objects.  Maybe what we need is a way to detect object pairs with coupled lifetimes, so they can be merged into a single object, and the connecting pointers removed.</p>
<p>Anyway, enough speculation.  The tool is very much WIP.  If anybody is interested in trying it out, or has comments/suggestions, let me know.</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.org/jseward/2011/05/09/mvp-a-memory-value-profiler/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>A thread-checking toolkit for Firefox</title>
		<link>http://blog.mozilla.org/jseward/2011/03/24/a-thread-checking-toolkit-for-firefox/</link>
		<comments>http://blog.mozilla.org/jseward/2011/03/24/a-thread-checking-toolkit-for-firefox/#comments</comments>
		<pubDate>Thu, 24 Mar 2011 13:58:42 +0000</pubDate>
		<dc:creator>jseward</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.mozilla.org/jseward/?p=103</guid>
		<description><![CDATA[Back in January I blogged about using Helgrind to check for threading errors in Firefox&#8217;s JS engine.  That effort was the first step towards a bigger goal, namely to find and remove all unintended data races in the browser proper. I have been wanting to get a point where our C++ developers can routinely use [...]]]></description>
			<content:encoded><![CDATA[<p>Back in January I blogged about <a href="http://blog.mozilla.org/jseward/2011/01/11/finding-data-races-in-the-js-engine">using Helgrind to check for threading errors in Firefox&#8217;s JS engine</a>.  That effort was the first step towards a bigger goal, namely to find and remove all unintended data races in the browser proper.</p>
<p>I have been wanting to get a point where our C++ developers can routinely use Helgrind to check for threading bugs in code, both new and old, in the same way that Valgrind&#8217;s Memcheck tool is now widely used to check for memory errors.  For the reasons discussed in my January posting, race checking is more difficult than memory checking.  Now, though, I believe we&#8217;re approaching the point where routine Helgrinding is feasible.</p>
<p>I&#8217;d like to introduce what amounts to a kit for thread-checking Firefox.  The main resource for this is at the MDC page &#8220;<a href="https://developer.mozilla.org/en/Debugging_Mozilla_with_Helgrind">Debugging Mozilla with Helgrind</a>&#8220;.  Here&#8217;s a summary.</p>
<p>There&#8217;s three parts to the kit:</p>
<ul>
<li>A markup patch for the Mozilla code base.  This describes to Helgrind the effect of some synchronisation events it doesn&#8217;t understand and stops it complaining about some harmless races in the JS engine.</li>
</ul>
<ul>
<li>A suppression file that hides error reports in system libraries.</li>
</ul>
<ul>
<li>A development version of Helgrind.  This contains a bunch of correctness, diagnostic and scalability improvements.  A stock Valgrind installation won&#8217;t work.</li>
</ul>
<p>With this framework in place, I completed a first run through Mochitests with Helgrind.  It took 32 CPU hours.  Around 15 bugs have been filed.  Some of them are now fixed, and others have been declared harmless.  But that&#8217;s just a beginning: there are many more uninvestigated reports lurking in the mochitests output.</p>
<p>Have a look at the <a href="https://developer.mozilla.org/en/Debugging_Mozilla_with_Helgrind">MDC page</a> for more details, including directions on how to get started.  And, of course, if you want help with any of this, please feel free to contact me.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.org/jseward/2011/03/24/a-thread-checking-toolkit-for-firefox/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Profiling the browser&#8217;s virtual memory behaviour</title>
		<link>http://blog.mozilla.org/jseward/2011/01/27/profiling-the-browsers-virtual-memory-behaviour/</link>
		<comments>http://blog.mozilla.org/jseward/2011/01/27/profiling-the-browsers-virtual-memory-behaviour/#comments</comments>
		<pubDate>Thu, 27 Jan 2011 14:01:14 +0000</pubDate>
		<dc:creator>jseward</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.mozilla.org/jseward/?p=74</guid>
		<description><![CDATA[We&#8217;ve been chipping away at memory use of Firefox 4 for a couple of months now, with good results.  Recently, though, I&#8217;ve been wondering if we&#8217;re measuring the right things.  It seems to me there&#8217;s two important things to measure: Maximum virtual address space use for the process.  Why is this important?  Because if the [...]]]></description>
			<content:encoded><![CDATA[<p>We&#8217;ve been chipping away at memory use of Firefox 4 for a couple of<br />
months now, with good results.  Recently, though, I&#8217;ve been wondering<br />
if we&#8217;re measuring the right things.  It seems to me there&#8217;s two<br />
important things to measure:</p>
<ul>
<li>Maximum virtual address space use for the process.  Why is this<br />
important?  Because if the process runs out of address space, it&#8217;s<br />
in serious trouble.  Ditto, perhaps worse, if the process uses up<br />
all the machine&#8217;s swap.</li>
</ul>
<ul>
<li>But the normal case is different: we don&#8217;t run out of address space<br />
or swap.  In this case I don&#8217;t care how much memory the browser<br />
uses.  Really.  When we talk about memory use in the non-OOM<br />
situation, we&#8217;re using that measure as a proxy for responsiveness.<br />
Excessive memory use isn&#8217;t intrinsically bad.  Rather, it&#8217;s the side<br />
effect that&#8217;s the problem: it causes paging, both for the browser<br />
and for everything else running on the machine, slowing<br />
everything down.</li>
</ul>
<p>Trying to gauge responsiveness by looking at peak RSS figures strikes<br />
me as a losing prospect.  The RSS values are set by some more-or-less<br />
opaque kernel page discard algorithm, and depend on the behaviour of<br />
all processes in the system, not just Firefox.  Worse, it&#8217;s uninformative:<br />
we get no information about which parts of our code base are causing<br />
paging.</p>
<p>So I hacked up a VM profiler.  This tells me the page fault behaviour<br />
when running Firefox using a given amount of real memory.  It isn&#8217;t as<br />
big a task as it sounds, since we already have 99.9% of the required<br />
code in pace: Valgrind&#8217;s Cachegrind tool.  It just required replacing<br />
the cache simulator with a virtual-to-physical address map simulator.</p>
<p>The profiler does a pretty much textbook pseudo-LRU clock algorithm<br />
simulation.  It differentiates between page faults caused by data and<br />
instruction accesses, since these require different fixes &#8212; make the<br />
data smaller vs make the code smaller.  It also differentiates between<br />
clean (page unmodified) and dirty (page modified, requires writeback)<br />
faults.</p>
<p>Here are some preliminary results.  Bear in mind the profiler has only<br />
just started to work, so the potential for bogosity is still large.</p>
<p>First question is: we know that 4.0 uses more memory than 3.6.x.  But<br />
does that result in more paging?  I profiled both, loading 5 cad-comic<br />
tabs (http://www.cad-comic.com/cad/random) and idling for a while, for<br />
about 8 billion instructions.  Results, simulating 100MB of real memory:</p>
<p>3.6.x, release build, using jemalloc:</p>
<p>VM I accesses: 8,250,840,547  (3,186 clean faults + 350 dirty faults)<br />
VM D accesses: 3,089,412,941  (5,239 clean faults + 552 dirty faults)</p>
<p>M-C, release build, using jemalloc:</p>
<p>VM I accesses: 8,473,182,041  ( 8,140 clean faults +  4,979 dirty faults)<br />
VM D accesses: 3,372,806,043  (22,720 clean faults + 14,335 dirty faults)</p>
<p>Apparently it does page more.  Most of the paging is due to data<br />
rather than instruction accesses.  Requires further investigation.</p>
<p>Second question is: where does that paging come from?  Are we missing<br />
any easy wins?  From a somewhat longer run with bigger workload, I got<br />
this (w/ apologies for terrible formatting):<br />
<code><br />
Da (# data accesses)<br />
.                Dfc (# clean data faults)<br />
.                          function<br />
------------------------------------------<br />
18,921,574,436   382,023   PROGRAM TOTALS</code></p>
<p><code> </code></p>
<p><code>.   19,339,625    60,583   js::Shape::trace<br />
.    2,228,649    51,635   JSCompartment::purge<br />
.   32,583,809    22,223   js_TraceScript<br />
.   16,306,348    18,404   js::mjit::JITScript::purgePICs<br />
.   18,160,249    12,847   js::mjit::JITScript::purgePICs<br />
.   52,155,631    11,727   memset<br />
.   27,229,391    10,813   js::PropertyTree::sweepShapes<br />
.  120,482,308    10,256   js::gc::MarkChildren<br />
.  138,049,859     9,134   memcpy<br />
.    2,228,649     8,779   JSCompartment::sweep<br />
.   179,083,731    8,057   js_TraceObject<br />
.    6,269,454      5,949   js::mjit::JITScript::sweepCallICs<br />
</code></p>
<p>18% ish of the faults come from js::Shape::trace.</p>
<p>And quite a few come from js::mjit::JITScript::purgePICs (two<br />
versions) and js::mjit::JITScript::sweepCallICs.  According to Dave<br />
Anderson and Chris Leary, there might be some opportunity to poke<br />
the code pages in a less jumping-around-y fashion.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.org/jseward/2011/01/27/profiling-the-browsers-virtual-memory-behaviour/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Finding data races in the JS engine</title>
		<link>http://blog.mozilla.org/jseward/2011/01/11/finding-data-races-in-the-js-engine/</link>
		<comments>http://blog.mozilla.org/jseward/2011/01/11/finding-data-races-in-the-js-engine/#comments</comments>
		<pubDate>Tue, 11 Jan 2011 01:58:15 +0000</pubDate>
		<dc:creator>jseward</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.mozilla.org/jseward/?p=61</guid>
		<description><![CDATA[Some background Back in March last year I spent some time using Valgrind&#8217;s Helgrind tool to look for data races in the browser.  Data races happen when two threads access the same piece of memory without any form of synchronisation (either locking or the use of atomic operations), and at least one of the accesses [...]]]></description>
			<content:encoded><![CDATA[<h3>Some background</h3>
<p>Back in March last year I spent some time using Valgrind&#8217;s Helgrind tool<br />
to look for data races in the browser.  Data races happen when two<br />
threads access the same piece of memory without any form of<br />
synchronisation (either locking or the use of atomic operations), and at<br />
least one of the accesses is a write.  That can lead to all manner of<br />
data structure corruption and crashes.  What&#8217;s really bad is that such<br />
bugs are often nearly impossible to reproduce, because they are timing<br />
dependent.  They therefore fall into the Very Scary Bugs category, and<br />
are something we really want to get rid of by any means possible.<br />
Before release!</p>
<p>Helgrind is a runtime analysis tool that looks for such races.  It is far<br />
from perfect, but better than nothing.  For those familiar with these<br />
things, it&#8217;s a pure happens-before race detector capable of showing full<br />
stacks for both memory accesses involved in a race.  It also checks for<br />
lock ordering inconsistencies (a.k.a. potential deadlocks) and various<br />
misuses of the POSIX pthreads API.</p>
<p>So what happened with Helgrinding the browser?  In short, I didn&#8217;t get<br />
far.  Mostly I just got greyer hair.  I had hoped to get the browser<br />
Helgrind-clean, in the same way it is now pretty much Memcheck-clean,<br />
but the effort got mired in difficulties:</p>
<ul>
<li> Race-checking is resource-intensive, more so than the memory checking<br />
that Valgrind (Memcheck) is normally used for.  A critical data<br />
structure in Helgrind (vector timestamps) turned out not to work well<br />
at the scale demanded for full browser runs, and they became<br />
infeasibly slow and memory hungry.</li>
</ul>
<ul>
<li> Happens-before race detection has the advantage of not giving false<br />
positives.  But the downside is it is scheduling-sensitive, so<br />
identical runs sometimes report different subsets of the races that<br />
are really present.  Add to that the nondeterminism of the browser and<br />
the slowness of Helgrind, and I had a major repeatability problem.</li>
</ul>
<ul>
<li> The browser has a number of deliberate or apparently-harmless races.<br />
Making sense of the race reports required examining bits of source<br />
code all over the tree that I&#8217;d never seen before and didn&#8217;t<br />
understand.  This proved to be difficult and time consuming.</li>
</ul>
<p>So I left it at that, resolving one day to come back and fix the vector<br />
timestamp representations, so as to at least avoid the resource<br />
problems.</p>
<p>Nothing happened for some months.  Then, in December, Paul Biggar wrote<br />
a nice self-contained threaded Javascript test case, <a href="http://bugzilla.mozilla.org/show_bug.cgi?id=619595">bug 619595</a>.</p>
<p>And I thought: hmm, maybe I should Helgrindify just the threaded jsshell<br />
running Paul&#8217;s test.  The standalone JS engine is smaller and more<br />
tractable than the browser, and I knew that Jason Orendorff had<br />
successfully used Helgrind on it earlier in the year.  Also, there&#8217;s<br />
been a vast amount of JS engine hackery in the past year, with<br />
particular emphasis on the threading aspects.  And 4.0 is coming up<br />
fast.  So, I thought, now might be a good time to give it a spin.</p>
<h3>Preparation</h3>
<p>To work properly, Helgrind needs to see all inter-thread synchronisation<br />
events in the program under test.  It can do that without help for<br />
programs which use only the POSIX pthreads API.  But many programs roll<br />
their own synchronisation primitives, and since it can&#8217;t see those,<br />
Helgrind reports huge numbers of races which don&#8217;t exist.  Both NSPR and<br />
the JS engine do this (eg ThinLocks).  A small amount of markup using<br />
client requests (<a href="http://bugzilla.mozilla.org/show_bug.cgi?id=551155">bug 551155</a>) provides Helgrind with the information it needs.</p>
<p>Paul&#8217;s test runs, or, at least, tries to run, all the Sunspider tests in<br />
parallel.  I modified it trivially to make it run up to 10 copies of<br />
each test in parallel.  This stresses Helgrind to the limit of<br />
feasibility on my machine, but shakes out more races.</p>
<p>Then we&#8217;re ready to go.</p>
<h3>Results</h3>
<p>I found a number of races &#8212; some expected, some not.  The unintended<br />
and dangerous-looking ones are:</p>
<ul>
<li> allocator for YARR-generated code is not thread safe (<a href="http://bugzilla.mozilla.org/show_bug.cgi?id=587288">bug 587288</a>)<br />
&#8211; potential crasher</li>
</ul>
<ul>
<li> race on JSContext::defaultCompartmentIsLocked (<a href="http://bugzilla.mozilla.org/show_bug.cgi?id=622691">bug 622691</a>)<br />
&#8211; consequences unknown to me, but doesn&#8217;t look correct</li>
</ul>
<ul>
<li>various races on parts of the property tree, eg kids[] array<br />
elements (<a href="http://bugzilla.mozilla.org/show_bug.cgi?id=609104#c3">bug 609104 comment 3</a>)</li>
</ul>
<p>Then there are two which are unintended but probably harmless.  In both<br />
cases each thread initialises a shared data structure to some value<br />
which never changes after that, so multiple initialisations are harmless:</p>
<ul>
<li>jsdate.cpp: global &#8220;static jsdouble LocalTZA;&#8221; is raced</li>
</ul>
<ul>
<li>nanojit::Assembler::nHints[] is raced.</li>
</ul>
<p>Then there are races which are intended and, so, presumably harmless, on<br />
the following fields:</p>
<ul>
<li>JSRuntime::gcMallocBytes (also gcBytes, I think)</li>
</ul>
<ul>
<li> JSRuntime::gcPoke</li>
</ul>
<ul>
<li> JSRuntime::protoHazardShape</li>
</ul>
<ul>
<li> JSThreadData::requestDepth</li>
</ul>
<ul>
<li> JSRuntime::gcIsNeeded</li>
</ul>
<p>Finally, there&#8217;s one I can&#8217;t decide about:</p>
<ul>
<li> The GC&#8217;s stack scanner races against other functions that touch the<br />
stack &#8212; that is, just about everything.  I don&#8217;t know if I expect<br />
that or not.  I would have thought that if one thread is scanning the<br />
stack, all the other threads are blocked waiting for it, hence there<br />
is no race.  So either (1) my understanding is wrong, (2) helgrind<br />
doesn&#8217;t see the inter-thread sync events causing other threads to<br />
wait, or (3) the stack scanner is borked.  I suspect (1) or (2).</li>
</ul>
<h3>Comments</h3>
<p>It&#8217;s pleasing to have a list of at least some of the observable races in<br />
the JS engine, since it provides something to cross-check assumed racey<br />
behaviour against.  It&#8217;s also good to have found some unintended races<br />
before release.</p>
<p>I&#8217;m a little concerned about the intended races, eg JSRuntime::gcPoke.<br />
My sketchy understanding of the C++0x draft standard is that accesses to<br />
shared locations must be mediated either by locking or by machine-level<br />
atomic operations.  All other shared accesses count as races.  C++0x, in<br />
<a href="http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/threadsintro.html#c++0x">the words of Hans J Boehm</a>, &#8216;guarantees nothing in the event of a data<br />
race.  Any program allowing a data race produces &#8220;undefined behavior&#8221;.&#8217;</p>
<p>Boehm has good presentation which summarises the proposed C++0x memory<br />
model, at <a href="http://www.hpl.hp.com/personal/Hans_Boehm/misc_slides/c++mm.pdf">http://www.hpl.hp.com/personal/Hans_Boehm/misc_slides/c++mm.pdf</a>.<br />
See in particular slides 7, 9 and 11.</p>
<p>From a Helgrind-usage point of view, these races are easily suppressed<br />
by adding client requests to specify that the fields in question should<br />
not be race-checked.  But, overall, I still don&#8217;t like them: deliberate<br />
races are a hindrance to understandability and to automated checking.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.org/jseward/2011/01/11/finding-data-races-in-the-js-engine/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Space profiling the browser</title>
		<link>http://blog.mozilla.org/jseward/2011/01/07/space-profiling-the-browser/</link>
		<comments>http://blog.mozilla.org/jseward/2011/01/07/space-profiling-the-browser/#comments</comments>
		<pubDate>Fri, 07 Jan 2011 03:01:11 +0000</pubDate>
		<dc:creator>jseward</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.mozilla.org/jseward/?p=43</guid>
		<description><![CDATA[It&#8217;s been a good month or more since we started the current round of chasing space problems in Firefox.  Considerable effort has gone into identifying and fixing memory hogs.  Although the individual fixes are often excellent, I&#8217;ve haven&#8217;t had the big picture on how we&#8217;re doing. So today I did some 3 way profiling, comparing [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s been a good month or more since we started the current round of<br />
chasing space problems in Firefox.  Considerable effort has gone into<br />
identifying and fixing memory hogs.  Although the individual fixes are<br />
often excellent, I&#8217;ve haven&#8217;t had the big picture on how we&#8217;re doing.<br />
So today I did some 3 way profiling, comparing</p>
<ul>
<li>mozilla-central of today, incorporating essentially all the<br />
space fixes to date</li>
</ul>
<ul>
<li>mozilla-central of 1 Nov last year, before this really got going</li>
</ul>
<ul>
<li>1.9.2 of today, since that&#8217;s what we keep<br />
getting compared against</li>
</ul>
<p>These are release builds on x86_64 linux, using jemalloc, as that&#8217;s<br />
presumably the least fragmentful allocator we have.</p>
<p>Each run loads 20 cad-comic.com tabs.  I let the browser run through<br />
60 billion machine instructions, then stopped it.  By<br />
around 40 billion instructions it has loaded the tabs completely, and<br />
the last 20 billion are essentially idling, intended to give an<br />
identifiable steady-state plateau.  That plateau ought to indicate the<br />
minimum achievable residency, after the cycle collector, JS garbage<br />
collector, the method jit code thrower-awayer, the image discarder,<br />
and any other such things, have done their thing.  I regard the<br />
plateau as more indicative of the behaviour of the browser during a<br />
long run, than I do the peak.</p>
<p>I profiled using Valgrind&#8217;s Massif profiler, using the &#8211;pages-as-heap<br />
option.  This measures all mapped pages in the process, and so<br />
includes C++ heap, other mmap&#8217;d space, code, data and bss segments &#8211;<br />
everything.</p>
<p>Consequently a lot of the measured space is the constant overhead of<br />
the text, data and bss segments of the many shared objects involved.<br />
That cost is the same regardless of the browser&#8217;s workload.  To<br />
quantify it, I did a fourth profile run, loading a single blank page.<br />
This gives me a way to compute the incremental cost for each<br />
cad-comic.com tab.</p>
<p>The summary results of all this are (all numbers are MBs)</p>
<ul>
<li>Constant overhead: 526</li>
</ul>
<ul>
<li>Total costs: 1.9.2   907,  MC-Nov10  1149,  MC-now  1077</li>
</ul>
<ul>
<li>Hence incremental per-tab costs are:<br />
1.9.2  19.0,<br />
MC-Nov10  31.1 (63% above 1.9.2),<br />
MC-now  27.5  (45% above 1.9.2)</li>
</ul>
<p>So we&#8217;re made considerable improvements since November.  But we&#8217;re<br />
still worse than 1.9.2.  Nick Nethercote tells me that bug 623428 should<br />
bring further improvements when it lands.</p>
<p>Here are the top-level visualisations for the three profiles.</p>
<p>Firstly, 1.9.2 (picture below).  What surprised me is the massive peak<br />
of around 1.6GB during page load.  Once that&#8217;s done, it falls back to a<br />
series of modest trough-peak variations.  I took the steady-state<br />
measurement above at the lowest trough, around 54 billion instructions<br />
on the horizontal axis.</p>
<p>Also interesting is that steady-state is reached before 25 billion<br />
instructions.  The M-C runs below took longer to get there.</p>
<div id="attachment_45" class="wp-caption alignleft" style="width: 748px"><a href="http://blog.mozilla.org/jseward/files/2011/01/192.png"><img class="size-large wp-image-45  " title="Profile for 1.9.2" src="http://blog.mozilla.org/jseward/files/2011/01/192-1024x775.png" alt="" width="738" height="558" /></a><p class="wp-caption-text">Profile for 1.9.2</p></div>
<p>The M-C Nov10 picture (below) is less dramatic.  It lacks the 1.6GB peak,<br />
instead climbing to pretty much the final level of around 1.2GB and<br />
staying there, with a slight decline into steady-state at around 44<br />
billion insns.</p>
<p><a href="http://blog.mozilla.org/jseward/files/2011/01/MC-1nov10.png"><img class="alignleft size-large wp-image-46" title="MC-1nov10" src="http://blog.mozilla.org/jseward/files/2011/01/MC-1nov10-1024x815.png" alt="Profile for M-C, 1 Nov 10" width="738" height="587" /></a></p>
<p>The M-C-of-now picture (below) is similar, although steady state<br />
is less steady, and somewhat lower, reflecting the fixes of the past<br />
few weeks.  Observe how the orange band steps down slightly in<br />
three stages after about 24 billion instructions.  I believe that&#8217;s Brian<br />
Hackett&#8217;s code discard patch, bug 617656.  Also, note the gradual<br />
slope up from around 38 billion to 53 billion insns.  That might be<br />
the excessively-infrequent GC problem investigated in bug 619822.</p>
<p>So what&#8217;s with the 1.6GB peak for 1.9.2 ?  It gives the interesting<br />
effect that, although M-C is worse in steady state than 1.9.2, M-C<br />
has more modest peak requirements, at least for this test case.</p>
<p>On investigation, what 1.9.2 seems to be spiked by is thread stacks.<br />
The implication is that it has more simultaneously live threads than<br />
M-C.  Why this should be, I don&#8217;t know.  I did however notice that<br />
1.9.2 seems to load all 20 tabs at the same time, whereas M-C appears<br />
to pull them in in smaller groups.  Related?  I don&#8217;t know.</p>
<p><a href="http://blog.mozilla.org/jseward/files/2011/01/MC-6jan111.png"><img class="alignleft size-large wp-image-48" title="MC-6jan11" src="http://blog.mozilla.org/jseward/files/2011/01/MC-6jan111-1024x815.png" alt="Profile for M-C, 6 Jan 11" width="738" height="587" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.org/jseward/2011/01/07/space-profiling-the-browser/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Fun &#8216;n&#8217; games with DHAT</title>
		<link>http://blog.mozilla.org/jseward/2010/12/05/fun-n-games-with-dhat/</link>
		<comments>http://blog.mozilla.org/jseward/2010/12/05/fun-n-games-with-dhat/#comments</comments>
		<pubDate>Sun, 05 Dec 2010 22:40:19 +0000</pubDate>
		<dc:creator>jseward</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.mozilla.org/jseward/?p=4</guid>
		<description><![CDATA[Back in 2007 Nick Nethercote morphed his Massif heap profiler into its present form.  Massif intercepts malloc/free et al, takes periodic snapshots of the heap and shows results using stack trees. It answers the questions &#8220;what&#8217;s in the heap?&#8221; and &#8220;who put it there?&#8221; Since then, I&#8217;d been mulling over a heap profiler that could [...]]]></description>
			<content:encoded><![CDATA[<p>Back in 2007 Nick Nethercote morphed his <a href="http://www.valgrind.org/docs/manual/ms-manual.html">Massif</a> heap profiler into<br />
its present form.  Massif intercepts malloc/free et al, takes<br />
periodic snapshots of the heap and shows results using stack trees.<br />
It answers the questions &#8220;what&#8217;s in the heap?&#8221; and &#8220;who put it there?&#8221;</p>
<p>Since then, I&#8217;d been mulling over a heap profiler that could also tell<br />
me something about block lifetimes and usages.  This year I finally<br />
got around to hacking something up &#8212; <a href="http://www.valgrind.org/docs/manual/dh-manual.html">DHAT</a>.  Like Massif, DHAT<br />
intercepts malloc/free, but it also inspects every (data) memory<br />
reference, to see which block, if any, it is to.  By doing that we can<br />
identify hot blocks and under-used ones.  For allocation points which<br />
always allocate blocks of the same size, DHAT keeps count of how often<br />
each block offset is accessed, thereby giving information on hot and<br />
cold object fields, and showing up probable aligment holes.</p>
<p>DHAT also records block lifetime information.  Time is measured in<br />
instructions executed, as does Massif.  DHAT notes the age at death of<br />
each block and shows the average value for each allocation point.<br />
Doing that makes it easy to find allocation points which chew through<br />
lots of heap, but don&#8217;t hold on to it for long, or, conversely, points<br />
that allocate heap and hold on to it for the entire process lifetime.</p>
<p>DHAT tracks two kinds of entities: blocks and allocation points (APs).  A block<br />
is just a heap block.  An AP is a stack trace that has allocated one<br />
or more blocks.  When a block is freed, its statistics are merged back<br />
into its AP.  At the end of the run, DHAT shows statistics for the top<br />
N APs, as sorted by one of three user-selectable metrics.  Most of the<br />
art of using DHAT is in interpreting the mass of numbers it produces.</p>
<p>DHAT perhaps ought to be merged with Massif at some point.  For now,<br />
the emphasis was to get something up and running quickly, to see if it<br />
generates any useful information.</p>
<p>So does it show up anything interesting in Firefox?</p>
<p>Here&#8217;s a no-brainer, <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=609905">bug 609905</a>:</p>
<p><code>-------------------- 32 of 5000 --------------------<br />
max-live:    524,328 in 1 blocks<br />
tot-alloc:   524,328 in 1 blocks (avg size 524328.00)<br />
deaths:      1, at avg age 15,015,989,851 (99.60% of prog lifetime)<br />
acc-ratios:  0.00 rd, 0.00 wr  (192 b-read, 825 b-written)<br />
at 0x4C27ECA: operator new(unsigned long) (vg_replace_malloc.c:261)<br />
by 0x661D01D: js::InitJIT(js::TraceMonitor*) (jstracer.cpp:7807)<br />
by 0x651FBEB: JSThreadData::init() (jscntxt.cpp:497)<br />
by 0x652049D: js_CurrentThread(JSRuntime*) (jscntxt.cpp:588)<br />
by 0x652085C: js_InitContextThread(JSContext*) (jscntxt.cpp:659)<br />
</code></p>
<p>This is a half-megabyte block that&#8217;s allocated, held onto for the<br />
entire browser run, but never accessed.  The key is to look at the<br />
acc-ratios field, which shows the average number of times each byte<br />
in the block got read and written &#8212; here, 0.00 for both.  Turns out<br />
to be a leftover allocator from the regexp engine that predated YARR.</p>
<p>Here&#8217;s another, <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=611400">bug 611400</a>, that&#8217;s more<br />
interesting.  When profiling the browser I noticed quite a lot heap<br />
occupied by allocations from Jaegermonkey&#8217;s method<br />
js::mjit::Compiler::finishThisUp.  I wondered what it was but didn&#8217;t<br />
think much more about it until I profiled the JS engine running<br />
Kraken, and fell across this:</p>
<p><code>-------------------- 4 of 100 --------------------<br />
max-live:    12,197,056 in 1 blocks<br />
tot-alloc:   12,197,056 in 1 blocks (avg size 12197056.00)<br />
deaths:      1, at avg age 1,432,667,675 (42.98% of prog lifetime)<br />
acc-ratios:  0.00 rd, 0.00 wr  (59 b-read, 129 b-written)<br />
at 0x47B44FF: calloc (vg_replace_malloc.c:467)<br />
by 0x81FF24D: js::mjit::Compiler::finishThisUp(js::mjit::JITSc<br />
by 0x8215E47: js::mjit::Compiler::performCompilation(js::mjit:<br />
by 0xC2F402F: ???<br />
</code></p>
<p>Hmm, a 12.2 MB allocation which is never used!  That&#8217;s around 12% of<br />
the maximum live heap in this run.  Turns out the method JIT creates<br />
tables mapping JS bytecodes to the corresponding native code entry<br />
points.  This Kraken test (imaging-darkroom) includes huge tables of<br />
constants, for which there are no entry points, so the 12MB allocation<br />
is a completely empty table.</p>
<p>This is an extreme case of a more general problem, though.<br />
Instrumenting the method jit shows that about 98% of table entries are<br />
unused when running more &#8220;normal&#8221; Javascript, when<br />
surfing at fairly complex-looking web pages, with 5 open tabs.</p>
<p>I changed the table representation to only store useful entries.  That<br />
saves the 12.2MB in Kraken.  For a browser with 5 tabs, it saves<br />
somewhere in the region of 1%-2% of the entire C++ heap, which is a<br />
nice outcome.</p>
<p>And there&#8217;s more:</p>
<ul>
<li>we were allocating a 3MB file buffer when writing Zip files (<a href="https://bugzilla.mozilla.org/show_bug.cgi?id=610040">610040</a>) (although JOrendorff beat me to it by a few minutes)</li>
</ul>
<ul>
<li>Inefficient layout of CSS style rules (<a href="https://bugzilla.mozilla.org/show_bug.cgi?id=596140">596140</a>)</li>
</ul>
<p>The above results were obtained by looking through allocation points<br />
sorted by decreasing maximum-live-volume.  Sorting the output on other<br />
metrics gives other perspectives:</p>
<ul>
<li>sorting by decreasing max-blocks-live shows places where we allocate large numbers of small blocks (CSS handling, the HTML5 parser)</li>
</ul>
<ul>
<li>sorting by decreasing total-bytes-allocated tends to show up places that turn over a lot of heap, even if it isn&#8217;t held onto for long (on Linux, the X client libraries seem particularly bad)</li>
</ul>
<p>DHAT is available in Valgrind-3.6.0 (&#8211;tool=exp-dhat).  It is stable<br />
but can sometimes produce misleading numbers, and is unreasonably slow<br />
for such a simple tool.  I have a fixed up and 2-3 x faster version,<br />
which I&#8217;ll ship in 3.6.1.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.org/jseward/2010/12/05/fun-n-games-with-dhat/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>

