I’ve been hacking on my copy of pynacl this week. pynacl is a set of Python bindings to the NaCl cryptography library (by djb and friends). The bindings, first written by Sean Lynch and then picked up by “k3d3”, are great. They even support py2.6 through py3.2.
But actually building them is a hassle, because the NaCl build process is so idiosyncratic. It consists of a 500-line undocumented shell script named “do”, and running it gets you 25 minutes of 100% CPU that executes in stony silence (all progress messages are redirected to a logfile). If you can wait that long, and think to explore the directory afterwards, you’ll be rewarded with a build/HOSTNAME/ directory that contains a libnacl.a and a set of header files that are pretty easy to use. What’s actually going on behind the scenes is that the script is exhaustively compiling and testing a large matrix of compiler flags (-O vs -O3 vs -O3 -funroll-loops), ABI variants, and alternative implementations. The goal is apparently to:
- select the fastest possible implementation and compiler options, using any assembly-language tricks specific to the processor (SSE3, etc)
- make sure the unit tests pass
- construct a performance report to send back to the authors
Unfortunately, this doesn’t play well with other build systems that might want to embed a copy, such as Python’s distutils, because:
- some compiler flags (-fPIC) are needed to build the .so files that python can load at runtime: distutils knows what these are, “do” doesn’t
- when building e.g. debian packages, the results will be used on other machines, so processor-specific optimizations aren’t ok (you might build your packages on a machine with some feature that’s not present on the machines that use those packages, so stick with least-common-denominator).
- running “do” requires a Bourne Shell interpreter, standard on unix systems but not so obvious on windows. distutils knows how to compile things on windows, but you have to tell it the source files, and it will run the compiler itself.
- having a separate “./do” compile step means that “setup.py build” is not enough, which means that easy_install won’t work, making it hard to use pynacl as a dependency in virtualenv or pip environments.
In addition, waiting 25 minutes for an otherwise small and elegant library to build is just a drag.
So I’ve pursued two projects this week. The first is to simply embed a mostly-unmodified (I did add -fPIC) copy of the latest nacl release into the pynacl source tree, and modify pynacl’s setup.py to use it. The build process becomes “cd nacl-*; ./do; cd ..; python setup.py build”. This is way easier than downloading+building an external copy and then pointing environment variables at the result. Takes just as long, but is easier to run. This work is in my “embed-nacl” branch (don’t be surprised if the branch is gone by the time you read this.. it may get merged to trunk). I also have a subset of “do”, named “dont”, which cuts out some of the unused pieces (like the C++ bindings and the large performance benchmarks), and runs in about 8 minutes.
The second is to embed a modified subset of the nacl sources in pynacl, making it look much more like a normal python extension module. This is in my “minimal” branch (again, this branch may get merged and be deleted).
This turned out to be pretty hard: the nacl “do” script synthesizes different versions of the same header file for each algorithm, and deletes many of them just after compilation. Also, the C code for e.g. crypto_hash_sha256 uses the same function name crypto_hash() as the code for crypto_hash_sha512, and depends upon #defines in the header file to prevent them from colliding. (the goal here is to let users call short function names and get the recommended algorithm, like crypto_box() instead of crypto_box_curve25519xsalsa20poly1305(), which is frequently a good idea, but not always).
So I had to modify the .c files to have non-colliding names, and write new header files to provide the recommended-algorithm mapping. I only copied in the portable implementations (leaving out the non-portable asm speedups). I also left out the multiple-ABI support, the try-different-compiler-flags speedups, the automatic performance tests, and the unit tests (although I hope to bring those back). But the result is something that compiles in 13 seconds, a 100x speedup, and which can be built with easy_install.
How does performance suffer without the specialized implementations and compiler flags? Here’s a comparison of the time it takes to call the various python functions (using 1000-byte messages, where applicable) in the two versions, on my 2010 MacBookPro (2.8GHz Core2Duo). The results are tolerable. The biggest victim is the Curve25519 scalar multiplication functions, since the whole algorithm was specifically designed to take advantage of 64-bit operations not available to the generic version (which needs to run on 32-bit machines). This slowdown also hits the high-level box()/box_open() functions, bringing them up to 5ms on my laptop. crypto_sign uses an early version of Ed25519, and nacl does not yet have an optimized version, so it takes 6ms in both branches. The next version of nacl will include the same carefully optimized version of Ed25519 as the SUPERCOP benchmark suite, in which signing takes about 19us, so this will also become a noticeable slowdown.
|crypto_auth||8.50 us||7.95 us||1.1x|
|crypto_auth_verify||8.49 us||7.87 us||1.1x|
|crypto_box||4.64 ms||198.61 us||23.3x|
|crypto_box_keypair||4.57 ms||198.84 us||23.0x|
|crypto_box_open||4.65 ms||199.33 us||23.3x|
|crypto_hash_sha256||898.93 ns||855.99 ns||1.1x|
|crypto_hash_sha512||1.54 us||1.25 us||1.2x|
|crypto_onetimeauth||49.06 us||2.40 us||20.4x|
|crypto_onetimeauth_verify||67.07 us||2.42 us||27.7x|
|crypto_scalarmult||4.57 ms||191.88 us||23.8x|
|crypto_scalarmult_base||4.69 ms||192.76 us||24.3x|
|crypto_secretbox||58.02 us||5.09 us||11.4x|
|crypto_secretbox_open||58.59 us||5.64 us||10.4x|
|crypto_sign||6.28 ms||5.95 ms||1.1x|
|crypto_sign_keypair_fromseed||6.26 ms||5.96 ms||1.1x|
|crypto_sign_open||16.58 ms||15.67 ms||1.1x|
|crypto_stream||5.53 us||2.66 us||2.1x|
|crypto_stream_xor||6.80 us||2.27 us||3.0x|
But I think the result is still pretty good: for many applications where you’re using Python, 5ms is just as good as 200us. And this branch makes it much easier to depend upon pynacl than the others.
My next steps are to clean up the branch a bit, talk to k3d3 about which approach seems the best, get it merged upstream, and then hopefully get it copied to PyPI. And then investigate how hard it’d be to build the alternative implementations: maybe with some distutils hacking, we could build all the variants (and see which ones actually work), do some quick performance tests, then call setup() again “for real” with just the fastest ones. That should shave off most of the slowdown (leaving the compiler-flag gains), but still give us something that compiles easily.