Feb 12

Only pay for the entropy you use

Log Files Are Boring

Just an idea, based on hearing that build log transfers seem to consume large amounts of bandwidth. (Note that for all I know, this is already being done.)

Logs are pretty dull. In particular, two consecutive log files are usually quite similar. It’d be nice if we could take advantage of this redundancy to reduce the bandwidth/time consumed by log transfers.

rsync likes boring data

The natural thing that springs to mind is rsync. I grabbed two log files that are probably more similar to each than is really fair, but they shouldn’t be horribly unrepresentative. rsyncing one to the other found them to share 32% of their data, based on the |rsync –stat| output lines labeled “Matched data” and “Literal data”, for a speedup of 1.46x.

I suspected that rsync’s default block size is too large, and so most of the commonalities are not found. So I tried setting the block size ridiculously low, to 8 bytes, and it found them to be 98% similar. Which is silly, because it has to retrieve more block hashes at that block size than it saves. The total “speedup” is reported as 0.72x.

But the sweet spot in the middle, with a block size of 192, gives 84% similarity for a speedup of 4.73x.

compression likes boring data too

Take a step back: this only applies to uncompressed files. Simply gzipping the log file before transmitting it gives us a speedup of 14.5x. Oops!

Well, rsync can compress the stuff it sends around too. Adding a -z flag with block size 192 gives a speedup of 16.2x. Hey, we beat basic gzip!

But compression needs decent chunks to work with, so the sweet spot may be different. I tried various block sizes, and managed a speedup of 24.3x with -B 960. An additional 1.7x speedup over simple compression is pretty decent!

To summarize our story so far, let’s say you want to copy over a log file named log123.txt. The proposal is:

  1. Have a vaguely recent benchmark log file, call it log_compare.txt, available on all senders and receivers. (Actually, it’d probably be a different one per build configuration, but whatever.)
  2. On the server, hard link log123.txt to log_compare.txt.
  3. From the client, rsync -z -B 960 log123.txt server:log123.txt

stop repeating what I say!

But it still feels like there ought to be something better. The benchmark log file is re-hashed every time you do this and the hashes are sent back over the wire, costing bandwidth. So let’s eliminate that part. Note that we’ll drop the -z from flag because we may as well compress the data during the transfer instead:

 ssh server 'ln log_compare.txt log123.txt'
 rsync -B 960 log123.txt log_compare.txt --only-write-batch=batch.dat
 ssh -C server 'rsync --read-batch=- argleblargle log132.txt' < batch.dat

Note that “argleblargle” is ignored, since the source file isn’t needed.

So what’s the speedup now? Let’s only consider the bytes transmitted over the network. Assuming the compression from ssh -C has the same effect as gzipping the file locally, I get a speedup of 28.9x, about 2x the speedup of simply compressing the log file in the first place.

But wait. The block size of 960 was based on the cost of retrieving all those hashes from the remote side. We’re not doing that anymore, so a smaller block size should again be more effective. Let’s see… -B 192 gets a total speedup of 139x, which is almost exactly one order of magnitude faster than plain gzipped log files. Now we’re talking!

loose ends

Two things still bug me. One is a minor detail — the above is writing out batch.dat, then reading it back in to send over to the server. This uselessly consumes disk bandwidth. It would be better if rsync could directly read/write compressed batch files to stdin/stdout. (It can read uncompressed batches from stdin, but not write to stdout. You could probably hack it somehow, perhaps with /proc/pidN/fd/…, but it’s not a big deal. And you can just use use /dev/shm/batch.dat for your temporary filename, and remove it right after. It’d still be better if it never had to exist uncompressed anywhere, but whatever.)

The other is that we’re still checksumming that benchmark file locally for every log file we transfer. It doesn’t change the number of bytes spewed over the network, but it slows down the overall procedure. I wonder if librsync would allow avoiding that somehow…? (I think rsync uses two checksums, a fast rolling checksum and a slower precise one, so you’d need to compute both for all offsets. And reading those in would probably cost more than recomputing from the original file. But I haven’t thought too hard about this part.)

not just emacs and debuggers

I sent this writeup to Jim Blandy, who in a typically insightful fashion noticed that (1) this requires some fiddly bookkeeping to ensure that you have a comparison file, and (2) revision control systems already handle all of this. If you have one version of a file checked in and then you check in a modified version of it, the VCS can compute a delta to save storage costs. Then when you transmit the new revision to a remote repository, the VCS will know if the remote already has the baseline revision so it can just send the delta.

Or in other words, you could accomplish all of this by simply checking your log files into a suitable VCS and pushing them to the server. That’s not to say that you’re guaranteed that your VCS will be able to fully optimize this case, just that it’s possible for it to do the “right” thing.

I attempted to try this out with git, but I don’t know enough about how git does things. I checked in my baseline log file, then updated it with the new log file’s contents, then ran git repack to make a pack file containing both. I was hoping to use the increase in size from the original object file to the pack file as an estimate of the incremental cost of the new log file, but the pack file was *smaller* than either original object file. If I make a pack with just the baseline, then I end up with two pack files, but the new one is still smaller.

clients could play too

As a final thought, this idea is not fundamentally restricted to the server. You could do the same thing inside eg tbpl: keep the baseline log(s) in localStorage or IndexedDB. When requesting a log, add a parameter ?I_have_baseline_36fe137a1192. Then, at the server’s discretion, it could compute a delta from that baseline and send it over as a series of “insert this literal data, then copy bytes 3871..17313 from your baseline, then…”. tbpl would reconstruct the resulting log file, the unicorns would do their lewd tap dance, and everyone would profit.

Feb 12


I’ve read Paul Graham’s “How To Disagree” essay, and I have to say, I disagree. There are some good ideas in there, but it’s clearly the work of a pretentious has-been.

Continue reading →