22
Feb 12

Only pay for the entropy you use

Log Files Are Boring

Just an idea, based on hearing that build log transfers seem to consume large amounts of bandwidth. (Note that for all I know, this is already being done.)

Logs are pretty dull. In particular, two consecutive log files are usually quite similar. It’d be nice if we could take advantage of this redundancy to reduce the bandwidth/time consumed by log transfers.

rsync likes boring data

The natural thing that springs to mind is rsync. I grabbed two log files that are probably more similar to each than is really fair, but they shouldn’t be horribly unrepresentative. rsyncing one to the other found them to share 32% of their data, based on the |rsync –stat| output lines labeled “Matched data” and “Literal data”, for a speedup of 1.46x.

I suspected that rsync’s default block size is too large, and so most of the commonalities are not found. So I tried setting the block size ridiculously low, to 8 bytes, and it found them to be 98% similar. Which is silly, because it has to retrieve more block hashes at that block size than it saves. The total “speedup” is reported as 0.72x.

But the sweet spot in the middle, with a block size of 192, gives 84% similarity for a speedup of 4.73x.

compression likes boring data too

Take a step back: this only applies to uncompressed files. Simply gzipping the log file before transmitting it gives us a speedup of 14.5x. Oops!

Well, rsync can compress the stuff it sends around too. Adding a -z flag with block size 192 gives a speedup of 16.2x. Hey, we beat basic gzip!

But compression needs decent chunks to work with, so the sweet spot may be different. I tried various block sizes, and managed a speedup of 24.3x with -B 960. An additional 1.7x speedup over simple compression is pretty decent!

To summarize our story so far, let’s say you want to copy over a log file named log123.txt. The proposal is:

  1. Have a vaguely recent benchmark log file, call it log_compare.txt, available on all senders and receivers. (Actually, it’d probably be a different one per build configuration, but whatever.)
  2. On the server, hard link log123.txt to log_compare.txt.
  3. From the client, rsync -z -B 960 log123.txt server:log123.txt

stop repeating what I say!

But it still feels like there ought to be something better. The benchmark log file is re-hashed every time you do this and the hashes are sent back over the wire, costing bandwidth. So let’s eliminate that part. Note that we’ll drop the -z from flag because we may as well compress the data during the transfer instead:

 ssh server 'ln log_compare.txt log123.txt'
 rsync -B 960 log123.txt log_compare.txt --only-write-batch=batch.dat
 ssh -C server 'rsync --read-batch=- argleblargle log132.txt' < batch.dat

Note that “argleblargle” is ignored, since the source file isn’t needed.

So what’s the speedup now? Let’s only consider the bytes transmitted over the network. Assuming the compression from ssh -C has the same effect as gzipping the file locally, I get a speedup of 28.9x, about 2x the speedup of simply compressing the log file in the first place.

But wait. The block size of 960 was based on the cost of retrieving all those hashes from the remote side. We’re not doing that anymore, so a smaller block size should again be more effective. Let’s see… -B 192 gets a total speedup of 139x, which is almost exactly one order of magnitude faster than plain gzipped log files. Now we’re talking!

loose ends

Two things still bug me. One is a minor detail — the above is writing out batch.dat, then reading it back in to send over to the server. This uselessly consumes disk bandwidth. It would be better if rsync could directly read/write compressed batch files to stdin/stdout. (It can read uncompressed batches from stdin, but not write to stdout. You could probably hack it somehow, perhaps with /proc/pidN/fd/…, but it’s not a big deal. And you can just use use /dev/shm/batch.dat for your temporary filename, and remove it right after. It’d still be better if it never had to exist uncompressed anywhere, but whatever.)

The other is that we’re still checksumming that benchmark file locally for every log file we transfer. It doesn’t change the number of bytes spewed over the network, but it slows down the overall procedure. I wonder if librsync would allow avoiding that somehow…? (I think rsync uses two checksums, a fast rolling checksum and a slower precise one, so you’d need to compute both for all offsets. And reading those in would probably cost more than recomputing from the original file. But I haven’t thought too hard about this part.)

not just emacs and debuggers

I sent this writeup to Jim Blandy, who in a typically insightful fashion noticed that (1) this requires some fiddly bookkeeping to ensure that you have a comparison file, and (2) revision control systems already handle all of this. If you have one version of a file checked in and then you check in a modified version of it, the VCS can compute a delta to save storage costs. Then when you transmit the new revision to a remote repository, the VCS will know if the remote already has the baseline revision so it can just send the delta.

Or in other words, you could accomplish all of this by simply checking your log files into a suitable VCS and pushing them to the server. That’s not to say that you’re guaranteed that your VCS will be able to fully optimize this case, just that it’s possible for it to do the “right” thing.

I attempted to try this out with git, but I don’t know enough about how git does things. I checked in my baseline log file, then updated it with the new log file’s contents, then ran git repack to make a pack file containing both. I was hoping to use the increase in size from the original object file to the pack file as an estimate of the incremental cost of the new log file, but the pack file was *smaller* than either original object file. If I make a pack with just the baseline, then I end up with two pack files, but the new one is still smaller.

clients could play too

As a final thought, this idea is not fundamentally restricted to the server. You could do the same thing inside eg tbpl: keep the baseline log(s) in localStorage or IndexedDB. When requesting a log, add a parameter ?I_have_baseline_36fe137a1192. Then, at the server’s discretion, it could compute a delta from that baseline and send it over as a series of “insert this literal data, then copy bytes 3871..17313 from your baseline, then…”. tbpl would reconstruct the resulting log file, the unicorns would do their lewd tap dance, and everyone would profit.


06
Feb 12

Disagree

I’ve read Paul Graham’s “How To Disagree” essay, and I have to say, I disagree. There are some good ideas in there, but it’s clearly the work of a pretentious has-been.

Continue reading →


21
Jan 12

bzexport –new: crash test dummies wanted

Scenario 1: you have a patch to some bug sitting in our mercurial queue. You want to attach it to a bug, but the bugzilla interface is painful and annoying. What do you do?

Use bzexport. It’s great! You can even request review at the same time.

What I really like about bzexport is that while writing and testing a patch, I’m in an editor and the command line. I may not even have a browser running, if I’m constantly re-starting it to test something out. Needing to go to the bugzilla web UI interrupts my flow. With bzexport, I can stay in the shell and move onto something else immediately.

Scenario 2: You have a patch, but haven’t filed a bug yet. Neither has anybody else. But your patch has a pretty good description of what the bug is. (This is common, especially for small things.) Do you really have to go through the obnoxious bug-filing procedure? It sure is tempting just to roll this fix up into some other vaguely related bug, isn’t it? Surely there’s a simple way to do things the right way without bouncing between interfaces?

Well, you’re screwed. Unless you’re willing to test something out for me. If not, please stop reading.
Continue reading →


05
Jan 12

patch queue dependencies

A little while back, I was again contemplating a tangled patch queue, considering how to rework it for landing. I thought it’d be nice to see at a very basic level which patches in the queue were going to be problematic, and which I could freely reorder at whim.

So I whipped together a silly little script to do that at a file level only. Example output:

% patchdeps
Note: This is based on filename collisions only, so may overreport conflicts
if patches touch different parts of the same file. (TODO)
                                                                          
A bug-663281-deque                   X   *       *     *   * *     *      
A bug-663281-deque-test              |   :       :     :   : *     :      
A bug-642054-func-setline          X |   *       :     :   : :     :      
A bug-642054-js_MapPCToLineNumber--' |   *       :     :   : :     :      
A bug-642054-rwreentrant             |   : X     :     :   : :     :      
A algorithm--------------------------'   X |     *     *   * *     *      
A system-libunwind                     X | |     :   * : * : *   * :      
A try-libunwind------------------------' | |     :   X : * : *   * :      
A backtrace------------------------------' | X * * * | * : * * * : * * * *
U shell-backtrace                          | | : * : | : : : : : : : : : :
U M-reentr---------------------------------' | : : : | : : : : : : : : : :
U M-backtrace--------------------------------' X : : | : : : : : : : * : :
U activities-----------------------------------' X : | : : : : * * : X * *
U profiler---------------------------------------' X | * : * * X * * | * *
U bug-675096-valgrind-jit--------------------------' | * : * : | : : | : :
U bug-599499-opagent-config--------------------------' X * : * | * : | : :
U bug-599499-opagent-----------------------------------' X X * | : * | : :
U bug-642320-gdb-jit-config------------------------------' | * | * : | : :
U bug-642320-gdb-jit---------------------------------------' X | : * | : :
U import-libunwind                                           | | : : | : :
U libunwind-config-------------------------------------------' | X X | : :
U warnings-fixes-----------------------------------------------' | | | : *
U bug-696965-cfi-autocheck---------------------------------------' | | X :
U mystery-librt-stuff----------------------------------------------' | | :
U bug-637393-eval-lifetime                                           | | :
U register-dwarf-----------------------------------------------------' | :
U bug-652535-JM__JIT_code_performance_counters-------------------------' X
U JSOP_RUNMODE-----------------------------------------------------------'

How to read it: patches that have no conflicts earlier in the stack are shown without a line next to them. They’re free spirits; you can “sink” them anywhere earlier in your queue without getting conflicts. (The script removes their lines to make the grid take up less horizontal space.)

Any other patch gets a horizontal line that then bends up to show the interference pattern with earlier patches. All in all, you have a complete interference matrix showing whether the set of files touched by any patch intersects the set of files for any other patch.

‘X’ marks the first conflict. After that, the marker turns to ‘*’ and the vertical lines get broken. (That’s just because it’s mostly the first one that matters when you’re munging your queue.)

So the patch named “backtrace” conflicts with the earlier “algorithm” patch, as well as the even earlier “bug-642054-js_MapPCToLineNumber” and others. The “M-reentr” patch only touches the same stuff as “bug-642054-rwreentrant” (not surprising, since “M-…” is my notation for a patch that needs to be folded into an earlier patch.) “system-libunwind” doesn’t conflict with anything earlier in the queue, and so can be freely reordered in the series file to anywhere earlier than where it is now — but note that several later patches touch the same stuff as it does. (It happens to be a patch to js/src/configure.in.)

Useful? Not very. But it was kinda fun to write and I find myself running it occasionally just to see what it shows, so I feel the entertainment value was worth the small investment of time. Though now I’m tempted to enhance it by checking for collisions in line ranges, not just in the files…

I suppose I could make a mercurial extension out of it, but that’d require porting it from Perl to Python, which is more trouble than it’s worth. (Yes, I still use Perl as my preferred language for whipping things together. Even though I dislike the syntax for nested data structures, I very much like the feature set, and it’s still the best language I’ve found for these sorts of things. So phbbbttt!)