Hm, haven’t posted in a while. Having your tonsils out will do that!
Mike Hommey recently blogged on some binary layout issues he’d run into with Thumb-2 binaries. For the last week or so, I’ve been working on one of those problems: Thumb-2/ARM trampolines.
memcpy calls, how do they work? After all, code in your executable or shared library can’t directly call memcpy, since the address of memcpy is unknown until runtime. You can handle this in various ways; the solution adopted for ELF systems is something called a procedure linkage table, or PLT. (The mechanisms on Windows and Mac are similar, but the details are probably somewhat different; I’m not familiar with them.)
The PLT is simply an extra layer of indirection. Each entry in the PLT has a corresponding entry in the global offset table (GOT) associated with it; the GOT entry is the address of the actual function the PLT entry should call. At load time, the GOT entries are set to the address of the entry point of the dynamic linker’s lookup function.
When you call memcpy, then, you’re actually calling the PLT entry associated with memcpy. The first time you make such a call, the dynamic linker is invoked to figure out where memcpy actually lives. Once it has done so, it updates the associated GOT entry with that address and calls the “real” memcpy. Subsequent calls to memcpy then find the actual address in the GOT and jump directly there, bypassing the dynamic linker.
(glibc on Linux plays elaborate games so that calls from libc to libc functions don’t go through the PLT; the details of doing so are best left for a separate blog post.)
OK, so onto the problem description. PLT entries on ARM Linux are ARM mode code. Of course, it’s perfectly valid for Thumb code to want to use these PLT entries; it’d be silly to have separate PLT entries for ARM calls and Thumb calls. For non-tail calls, Thumb code can just use blx, which switches to ARM mode and jumps to a given address. But for tail calls, there’s no jump-to-offset-and-switch-mode instruction (bx jumps to an address in a register, not an immediate offset). So some cleverness is required.
In the gold linker, the solution taken is general: whenever we have such a Thumb-to-ARM branch (whether to a PLT entry or otherwise), generate a small stub of code to do the mode switch and branch, then branch to that. Such stubs are generated for various other similar cases, so the linker was already doing such things anyway.
There’s two issues with this approach. The first is that the stubs take up extra space, then second is that they’re placed wherever the linker thinks is convenient, which might not be the best place for improving the layout of the compiled code. (The linker does put some effort into reusing stubs, so if you have several Thumb-to-ARM calls/jumps to the same address, the linker will reuse a stub, rather than duplicating it willy-nilly.) The second is the real dealbreaker here.
The GNU linker’s approach is more clever: it embeds the necessary stub into the PLT entry itself. Whereas the entry previously looked like:
22c: e28fc610 add ip, pc, #16777216 ; 0x1000000 230: e28cca08 add ip, ip, #32768 ; 0x8000 234: e5bcf080 ldr pc, [ip, #128]! ; 0x80
it now looks like:
228: 4778 bx pc 22a: 46c0 nop ; (mov r8, r8) 22c: e28fc610 add ip, pc, #16777216 ; 0x1000000 230: e28cca08 add ip, ip, #32768 ; 0x8000 234: e5bcf080 ldr pc, [ip, #128]! ; 0x80
(This works because bx pc actually branches to the PC of the bx insn + 4: i.e. 0x22c.)
So direct jumps from Thumb code use the entry point at 0×228, whereas all other calls and jumps from ARM or Thumb code use the entry point at 0x22c. This method is smaller (4 bytes for the PLT entry stub versus 16-ish bytes for the out-of-line, imperfectly located stub) and plays nicely with section reordering.
We could use GNU ld for our linking needs, but gold has several very nice features (performance, identical code folding, section ordering, incremental linking, etc.) that make its use desirable. So the last week of my time has been spent learning gold well enough to add these small Thumb stubs to PLT entries when necessary. gold can now do this, woohoo!
I believe my code is suitable for using in Mozilla builds. Before being submitted upstream, however, there need to be some kinks worked out with regards to incremental linking: gold’s incremental linking support assumes that all PLT entries are the same size. The above modifications mean that some will be 16 bytes and some will be 12. (A possible cheesy hack would be to make all stubs 16 bytes when incremental linking…) It might also be worthwhile exploring a more general solution for stub positioning so that stubs play nicely with section reordering.