Improving the pynacl build process

I’ve been hacking on my copy of pynacl this week. pynacl is a set of Python bindings to the NaCl cryptography library (by djb and friends). The bindings, first written by Sean Lynch and then picked up by “k3d3”, are great. They even support py2.6 through py3.2.

But actually building them is a hassle, because the NaCl build process is so idiosyncratic. It consists of a 500-line undocumented shell script named “do”, and running it gets you 25 minutes of 100% CPU that executes in stony silence (all progress messages are redirected to a logfile). If you can wait that long, and think to explore the directory afterwards, you’ll be rewarded with a build/HOSTNAME/ directory that contains a libnacl.a and a set of header files that are pretty easy to use. What’s actually going on behind the scenes is that the script is exhaustively compiling and testing a large matrix of compiler flags (-O vs -O3 vs -O3 -funroll-loops), ABI variants, and alternative implementations. The goal is apparently to:

select the fastest possible implementation and compiler options, using any assembly-language tricks specific to the processor (SSE3, etc)
make sure the unit tests pass
construct a performance report to send back to the authors

Unfortunately, this doesn’t play well with other build systems that might want to embed a copy, such as Python’s distutils, because:

some compiler flags (-fPIC) are needed to build the .so files that python can load at runtime: distutils knows what these are, “do” doesn’t
when building e.g. debian packages, the results will be used on other machines, so processor-specific optimizations aren’t ok (you might build your packages on a machine with some feature that’s not present on the machines that use those packages, so stick with least-common-denominator).
running “do” requires a Bourne Shell interpreter, standard on unix systems but not so obvious on windows. distutils knows how to compile things on windows, but you have to tell it the source files, and it will run the compiler itself.
having a separate “./do” compile step means that “setup.py build” is not enough, which means that easy_install won’t work, making it hard to use pynacl as a dependency in virtualenv or pip environments.

In addition, waiting 25 minutes for an otherwise small and elegant library to build is just a drag.

So I’ve pursued two projects this week. The first is to simply embed a mostly-unmodified (I did add -fPIC) copy of the latest nacl release into the pynacl source tree, and modify pynacl’s setup.py to use it. The build process becomes “cd nacl-*; ./do; cd ..; python setup.py build”. This is way easier than downloading+building an external copy and then pointing environment variables at the result. Takes just as long, but is easier to run. This work is in my “embed-nacl” branch (don’t be surprised if the branch is gone by the time you read this.. it may get merged to trunk). I also have a subset of “do”, named “dont”, which cuts out some of the unused pieces (like the C++ bindings and the large performance benchmarks), and runs in about 8 minutes.

The second is to embed a modified subset of the nacl sources in pynacl, making it look much more like a normal python extension module. This is in my “minimal” branch (again, this branch may get merged and be deleted).

This turned out to be pretty hard: the nacl “do” script synthesizes different versions of the same header file for each algorithm, and deletes many of them just after compilation. Also, the C code for e.g. crypto_hash_sha256 uses the same function name crypto_hash() as the code for crypto_hash_sha512, and depends upon #defines in the header file to prevent them from colliding. (the goal here is to let users call short function names and get the recommended algorithm, like crypto_box() instead of crypto_box_curve25519xsalsa20poly1305(), which is frequently a good idea, but not always).

So I had to modify the .c files to have non-colliding names, and write new header files to provide the recommended-algorithm mapping. I only copied in the portable implementations (leaving out the non-portable asm speedups). I also left out the multiple-ABI support, the try-different-compiler-flags speedups, the automatic performance tests, and the unit tests (although I hope to bring those back). But the result is something that compiles in 13 seconds, a 100x speedup, and which can be built with easy_install.

How does performance suffer without the specialized implementations and compiler flags? Here’s a comparison of the time it takes to call the various python functions (using 1000-byte messages, where applicable) in the two versions, on my 2010 MacBookPro (2.8GHz Core2Duo). The results are tolerable. The biggest victim is the Curve25519 scalar multiplication functions, since the whole algorithm was specifically designed to take advantage of 64-bit operations not available to the generic version (which needs to run on 32-bit machines). This slowdown also hits the high-level box()/box_open() functions, bringing them up to 5ms on my laptop. crypto_sign uses an early version of Ed25519, and nacl does not yet have an optimized version, so it takes 6ms in both branches. The next version of nacl will include the same carefully optimized version of Ed25519 as the SUPERCOP benchmark suite, in which signing takes about 19us, so this will also become a noticeable slowdown.

name	ref	optimized	slowdown
crypto_auth	8.50 us	7.95 us	1.1x
crypto_auth_verify	8.49 us	7.87 us	1.1x
crypto_box	4.64 ms	198.61 us	23.3x
crypto_box_keypair	4.57 ms	198.84 us	23.0x
crypto_box_open	4.65 ms	199.33 us	23.3x
crypto_hash_sha256	898.93 ns	855.99 ns	1.1x
crypto_hash_sha512	1.54 us	1.25 us	1.2x
crypto_onetimeauth	49.06 us	2.40 us	20.4x
crypto_onetimeauth_verify	67.07 us	2.42 us	27.7x
crypto_scalarmult	4.57 ms	191.88 us	23.8x
crypto_scalarmult_base	4.69 ms	192.76 us	24.3x
crypto_secretbox	58.02 us	5.09 us	11.4x
crypto_secretbox_open	58.59 us	5.64 us	10.4x
crypto_sign	6.28 ms	5.95 ms	1.1x
crypto_sign_keypair_fromseed	6.26 ms	5.96 ms	1.1x
crypto_sign_open	16.58 ms	15.67 ms	1.1x
crypto_stream	5.53 us	2.66 us	2.1x
crypto_stream_xor	6.80 us	2.27 us	3.0x

But I think the result is still pretty good: for many applications where you’re using Python, 5ms is just as good as 200us. And this branch makes it much easier to depend upon pynacl than the others.

My next steps are to clean up the branch a bit, talk to k3d3 about which approach seems the best, get it merged upstream, and then hopefully get it copied to PyPI. And then investigate how hard it’d be to build the alternative implementations: maybe with some distutils hacking, we could build all the variants (and see which ones actually work), do some quick performance tests, then call setup() again “for real” with just the fastest ones. That should shave off most of the slowdown (leaving the compiler-flag gains), but still give us something that compiles easily.

This entry was posted on Thursday, January 19th, 2012 at 5:21 pm and is filed under Cryptography. You can follow any comments to this entry through the RSS 2.0 feed. You can leave a comment, or trackback from your own site.

7 comments

Andrew Sutherland

January 19, 2012 at 7:43 pm

I fear this is too late to be any use, but for deuxdrop, I created a waf wscript that builds nacl. Its decisions of what algorithms to use are based on poor python heuristics, but it tries to otherwise mimic nacl’s own build infrastructure:

https://github.com/asutherland/nacl/blob/master/wscript

Reply
- warner
  
  January 23, 2012 at 12:46 pm
  
  Ah, yeah, wish I’d seen that a day earlier :). At least it confirms what I deduced from the ‘do’ script. I think the distutils-based scheme in my “minimal” branch is still a good way to go.. I’ll pursue pushing that upstream.
  
  Reply
warner

January 23, 2012 at 12:52 pm

So, to figure out how to build alternate versions of the ed25519 signature code (like the amd64-51-30k version), I first needed to learn what compiler flags would work, so I started the SUPERCOP benchmark suite (which uses the same ‘do’ script), thinking I would just look at its output file and see which flags succeeded. It’s gotta be “-march=amd64” or something, but I wasn’t able to figure out out with a few minutes experimentation, so I thought I’d just let the machine tell me.

That was last week. It’s still running, and it’s only up to the hash functions. SUPERCOP/okcompilers/c contains *1600* variations, and it tries them all. Sheesh.

Reply
- warner
  
  February 11, 2012 at 1:00 pm
  
  an update: the SUPERCOP suite finally finished, after 15 days! On my Athlon64 box, it got passing results with -m32 (but not -m64), an -march= in the set [athlon, barcelona, core2, i386, i486, k6, k6-2, k6-3, k8, native, nocona, pentium, pentium(-m,-mmx,2,3,4,pro), prescott]. It also accepted -msse4, -msse41, and -mtune-native. And it liked -fno-schedue-insns, -fomit-frame-pointer, and -funroll-loops.
  
  When I tried to build python-ed25519 with the AMD-specific code (https://github.com/warner/python-ed25519/tree/amd64), the files compiled well, but the final link failed with a relocation complaint:
  
  src/ed25519-supercop-amd64-64-24k/fe25519_mul.o: relocation R_X86_64_32S against `crypto_sign_ed25519_amd64_64_24k_batch_38′ can not be used when making a shared object; recompile with -fPIC
  
  which is a pity, because that file (and all the others) *were* compiled with
  -fPIC. Back when I used to do MIPS assembler, I remember this sort of warning
  happening when you had a large function and compiled with -fpic (lowercase),
  and the relative jumps got too large to fit in the limited opcode space that
  -fpic created. Recompiling with -fPIC (uppercase) used larger instructions
  that could accomodate a larger relative branch. So I wonder if this is saying
  that this code is too big for even -fPIC to handle, or if it’s telling me
  that the compiler somehow ignored my -fPIC request completely.
  
  Reply
warner

January 27, 2012 at 11:00 am

Also an interesting datapoint: Matthew Dempsky’s pure-python Curve25519 code, from page 6+7 of http://cr.yp.to/papers.html#naclcrypto , runs in about 22ms on this same laptop. So for human-initiated actions (like sending an email, as opposed to sending a packet), the easier-to-deploy pure-python version might be fast enough. (you still need the actual encryption part, of course.. this is just key-agreement)

Reply
Sean Lynch

February 4, 2012 at 12:04 am

Wow, I’ve been blissfully ignorant that anyone was using pynacl. I somehow completely missed an email notification about a pull request from k3d3 back in December and didn’t realize until today when I looked through my github notifications that it was getting any attention at all.

I’m really glad to see that this is getting some love! I wrote much of it on an airplane on the way to New York, then did most of the rest during downtime on my vacation there.

Reply
Sean Lynch

February 4, 2012 at 12:06 am

I wonder if it only worked for me (given the PIC problems) because I am on a Mac.

Reply

Brian Warner Just another Blog.mozilla.com site

Improving the pynacl build process

7 comments

Leave a Reply to warner Cancel reply