18
Apr 12

What’s your random seed?

Greg Egan is awesome

I’m going back and re-reading Luminous, one of his collections of short stories. I just read the story Transition Dreams, which kinda creeped me out. Partly because I buy into the whole notion that our brains are digitizable — as in, there’s nothing fundamentally unrepresentable about our minds. There’s probably a fancy philosophy term for this, with some dead white guy’s name attached to it (because only a dozen people had thought of it before him and he talked the loudest).

Once you’re willing to accept accurate-enough digitization, the ramifications get pretty crazy. And spooky. I can come up with some, but Egan takes it way farther, and Transition Dreams is a good illustration. But I won’t spoil the story. (By the way, most of Egan’s books are out of print or rare enough to be expensive, but Terrence tells me that they’re all easily available on Kindle. Oddly, although I would be happy to transition my mental workings from meat to bits, I’m still dragging my heels on transitioning my reading from dead trees to bits.)

Transition and Free Will

Now, let’s assume that you’ve converted your brain to live inside a computer (or network of computers, or encoded into the flickers of light on a precisely muddy puddle of water, it really doesn’t matter.) So your thinking is being simulated by all these crazy cascades of computation (only it’s not simulated; it’s the real thing, but that’s irrelevant here.) Your mind is getting a stream of external sensor input, it’s chewing on that and modifying its state, and you’re just… well, being you.

Now, where is free will in this picture? Assuming free will exists in the first place, I mean, and that it existing and not existing are distinguishable. If you start in a particular, fully-described state, and you receive the exact same inputs, will you always behave in exactly the same way? You could build the mind hosting computer either way, you know, and the hosted minds wouldn’t normally be able to tell the difference. But they could tell the difference if they recorded all of their sensory inputs (which is fairly plausible, actually), because they could make a clone of themselves back at the previous state and replay all their sensory input and see if they made the same decisions. (Actually, it’s easier than that; if the reproduction was accurate, they should end up bit-for-bit identical.)

I don’t know about you, but I’d rather not be fully predictable. I don’t want somebody to copy me and my sensor logs, and then when I’m off hanging out in the Gigahertz Ghetto (read: my brain is being hosted on a slow computer), they could try out various different inputs on faster computers to see how “I” reacted and know for 100% certainty how to achieve some particular reaction.

Well, ok, my time in the GHzGhetto might change me enough to make the predictions wrong, so you’d really have to do this while I was fully suspended. Maybe the shipping company that suspends my brain while they shoot me off to a faster hosting facility in a tight orbit around the Sun (those faster computers need the additional solar energy, y’know) is also selling copies on the side to advertisers who want to figure out exactly what ads they can expose me to upon reawakening to achieve a 100% clickthrough rate. Truly, truly targeted advertising.

So, anyway, I’m going to insist on always having access to a strong source of random numbers, and I’ll call that my free will. You can record the output of that random number generator, but that’ll only enable you to accurately reproduce my past, not my future.

The Pain and Joy of Determinism

Or will I? What if that hosting facility gets knocked out by a solar flare? Do I really want to start over from a backup? If it streams out the log of sensor data to a safer location, then it’d be pretty cool to be able to replay as much of the log as still exists, and recover almost all of myself. I’d rather mourn a lost day than a lost decade. But that requires not using an unpredictable random number generator as an input.

So what about a pseudo-random number generator? If it’s a high quality one, then as long as nobody else can access the seed, it’s just as good. But that gives the seed incredible importance. It’s not “you”, it’s just a simple number, but in a way it allows substantial control over you, so it’s private in a more fundamental way than anything we’ve seen before. Who would you trust it to? Not yourself, certainly, since you’ll be copied from computer to computer all the time and each transfer is an opportunity for identity theft. What about your spouse? Or maybe just a secure service that will only release it for authorized replays of your brain?

Without that seed (or those timestamped seeds?), you can never go back. Well, you can go back to your snapshots, but you can’t accurately go forward from there to arbitrary points in time. Admittedly, that’s not necessary for some uses — if you want to know why you did something, you can go back to a snapshot and replay with a different seed. If you do something different, it was a choice made of your own free will. You could use it in court cases, even. If you get the same result, well, it’s trickier, because you might make the same choice for 90% of the possible random seeds or something. “Proof beyond a reasonable confidence interval?” Heh.


13
Apr 12

bzexport changes released

bzexport –new and hg newbug have landed

My bzexport changes adding a --new flag and an hg newbug command have landed. Ok, they landed months ago. See my previous blog post for details; all of the commands and options described there are still valid in the current version. But please pull from the official repo instead of my testing repo given in the earlier blog post.

Installing bzexport

mkdir -p ~/hg-extensions
cd ~/hg-extensions
hg clone http://hg.mozilla.org/users/tmielczarek_mozilla.com/bzexport

in the [extensions] section of your ~/.hgrc, add:
bzexport = ~/hg-extensions/bzexport/bzexport.py

Note to Windows users: unfortunately, I think the python packaged with MozillaBuild is missing the json.py package that bzexport needs. I think it still works if you use a system Python with json.py installed, but I’m not sure.

Trying it out

For the (understandably) nervous users out there, I’d like you to give it a try and I’ve made it safe to do so. Here are the levels of paranoia available: Continue reading →


22
Feb 12

Only pay for the entropy you use

Log Files Are Boring

Just an idea, based on hearing that build log transfers seem to consume large amounts of bandwidth. (Note that for all I know, this is already being done.)

Logs are pretty dull. In particular, two consecutive log files are usually quite similar. It’d be nice if we could take advantage of this redundancy to reduce the bandwidth/time consumed by log transfers.

rsync likes boring data

The natural thing that springs to mind is rsync. I grabbed two log files that are probably more similar to each than is really fair, but they shouldn’t be horribly unrepresentative. rsyncing one to the other found them to share 32% of their data, based on the |rsync –stat| output lines labeled “Matched data” and “Literal data”, for a speedup of 1.46x.

I suspected that rsync’s default block size is too large, and so most of the commonalities are not found. So I tried setting the block size ridiculously low, to 8 bytes, and it found them to be 98% similar. Which is silly, because it has to retrieve more block hashes at that block size than it saves. The total “speedup” is reported as 0.72x.

But the sweet spot in the middle, with a block size of 192, gives 84% similarity for a speedup of 4.73x.

compression likes boring data too

Take a step back: this only applies to uncompressed files. Simply gzipping the log file before transmitting it gives us a speedup of 14.5x. Oops!

Well, rsync can compress the stuff it sends around too. Adding a -z flag with block size 192 gives a speedup of 16.2x. Hey, we beat basic gzip!

But compression needs decent chunks to work with, so the sweet spot may be different. I tried various block sizes, and managed a speedup of 24.3x with -B 960. An additional 1.7x speedup over simple compression is pretty decent!

To summarize our story so far, let’s say you want to copy over a log file named log123.txt. The proposal is:

  1. Have a vaguely recent benchmark log file, call it log_compare.txt, available on all senders and receivers. (Actually, it’d probably be a different one per build configuration, but whatever.)
  2. On the server, hard link log123.txt to log_compare.txt.
  3. From the client, rsync -z -B 960 log123.txt server:log123.txt

stop repeating what I say!

But it still feels like there ought to be something better. The benchmark log file is re-hashed every time you do this and the hashes are sent back over the wire, costing bandwidth. So let’s eliminate that part. Note that we’ll drop the -z from flag because we may as well compress the data during the transfer instead:

 ssh server 'ln log_compare.txt log123.txt'
 rsync -B 960 log123.txt log_compare.txt --only-write-batch=batch.dat
 ssh -C server 'rsync --read-batch=- argleblargle log132.txt' < batch.dat

Note that “argleblargle” is ignored, since the source file isn’t needed.

So what’s the speedup now? Let’s only consider the bytes transmitted over the network. Assuming the compression from ssh -C has the same effect as gzipping the file locally, I get a speedup of 28.9x, about 2x the speedup of simply compressing the log file in the first place.

But wait. The block size of 960 was based on the cost of retrieving all those hashes from the remote side. We’re not doing that anymore, so a smaller block size should again be more effective. Let’s see… -B 192 gets a total speedup of 139x, which is almost exactly one order of magnitude faster than plain gzipped log files. Now we’re talking!

loose ends

Two things still bug me. One is a minor detail — the above is writing out batch.dat, then reading it back in to send over to the server. This uselessly consumes disk bandwidth. It would be better if rsync could directly read/write compressed batch files to stdin/stdout. (It can read uncompressed batches from stdin, but not write to stdout. You could probably hack it somehow, perhaps with /proc/pidN/fd/…, but it’s not a big deal. And you can just use use /dev/shm/batch.dat for your temporary filename, and remove it right after. It’d still be better if it never had to exist uncompressed anywhere, but whatever.)

The other is that we’re still checksumming that benchmark file locally for every log file we transfer. It doesn’t change the number of bytes spewed over the network, but it slows down the overall procedure. I wonder if librsync would allow avoiding that somehow…? (I think rsync uses two checksums, a fast rolling checksum and a slower precise one, so you’d need to compute both for all offsets. And reading those in would probably cost more than recomputing from the original file. But I haven’t thought too hard about this part.)

not just emacs and debuggers

I sent this writeup to Jim Blandy, who in a typically insightful fashion noticed that (1) this requires some fiddly bookkeeping to ensure that you have a comparison file, and (2) revision control systems already handle all of this. If you have one version of a file checked in and then you check in a modified version of it, the VCS can compute a delta to save storage costs. Then when you transmit the new revision to a remote repository, the VCS will know if the remote already has the baseline revision so it can just send the delta.

Or in other words, you could accomplish all of this by simply checking your log files into a suitable VCS and pushing them to the server. That’s not to say that you’re guaranteed that your VCS will be able to fully optimize this case, just that it’s possible for it to do the “right” thing.

I attempted to try this out with git, but I don’t know enough about how git does things. I checked in my baseline log file, then updated it with the new log file’s contents, then ran git repack to make a pack file containing both. I was hoping to use the increase in size from the original object file to the pack file as an estimate of the incremental cost of the new log file, but the pack file was *smaller* than either original object file. If I make a pack with just the baseline, then I end up with two pack files, but the new one is still smaller.

clients could play too

As a final thought, this idea is not fundamentally restricted to the server. You could do the same thing inside eg tbpl: keep the baseline log(s) in localStorage or IndexedDB. When requesting a log, add a parameter ?I_have_baseline_36fe137a1192. Then, at the server’s discretion, it could compute a delta from that baseline and send it over as a series of “insert this literal data, then copy bytes 3871..17313 from your baseline, then…”. tbpl would reconstruct the resulting log file, the unicorns would do their lewd tap dance, and everyone would profit.


06
Feb 12

Disagree

I’ve read Paul Graham’s “How To Disagree” essay, and I have to say, I disagree. There are some good ideas in there, but it’s clearly the work of a pretentious has-been.

Continue reading →