the performance implications of strncpy

Last week, I was working on making Firefox compile for a OS X target on a Linux host.  As part of this effort, I ported Apple’s opensourced ld64 linker to compile and run on a Linux host.  Since OS X is a BSD-derived operating system, ld64 made use of the strlcpy and strlcat functions, designed to be safer than the strcpy/strncpy/strcat/strncat functions.  Linux’s libc doesn’t implement strlcpy and strlcat, so I had to find replacements.  strlcpy seemed easy enough, as a presentation on maloader suggested:

size_t strlcpy(char* dst, const char* src, size_t size)
{
    dst[size - 1] = '\0';
    strncpy(dst, src, size - 1);
    return strlen(dst);
}

I cribbed strlcat from someplace else and went about my merry way.

After I got ld64 to compile, then link simple programs, and worked my way through some configure bugs for non-Android cross-compiles, I ran into a problem: the JavaScript shell was taking 8 minutes to link.  This was unacceptable; it meant linking libxul was going to take over an hour, maybe over two, to link, which nobody would be happy about.  The equivalent link of the JavaScript shell on my Mac mini took about two seconds.

I started investigating what was going on with perf, just checking into what ld64 was doing for parts of those 8 minutes.  99%+ of the time was being spent inside strncpy.  Hm, that’s strange.

I fiddled around with a couple different things, none of which had much impact.  Then I took a close look at the code calling strlcpy (yes, all the time in strlcpy was through this function, which should have been a red flag in the first place):

int32_t StringPoolAtom::add(const char* str)
{
	int32_t offset = kBufferSize * _fullBuffers.size() + _currentBufferUsed;
	int lenNeeded = strlcpy(&_currentBuffer[_currentBufferUsed], str, kBufferSize-_currentBufferUsed)+1;
	if ( (_currentBufferUsed+lenNeeded) < kBufferSize ) {
 		_currentBufferUsed += lenNeeded;
 	}
 	else {
 		int copied = kBufferSize-_currentBufferUsed-1;
 		// change trailing '\0' that strlcpy added to real char
 		_currentBuffer[kBufferSize-1] = str[copied];
 		// alloc next buffer
 		_fullBuffers.push_back(_currentBuffer);
 		_currentBuffer = new char[kBufferSize];
 		_currentBufferUsed = 0;
 		// append rest of string
 		this->add(&str[copied+1]);
	}
	return offset;
}

In this code, kBufferSize is 16MB, so the size parameter passed to strlcpy can be rather large compared to the size of the string being copied to the destination.

I forget exactly where I read it, but I saw some small blurb about glibc’s strncpy having the crazy behavior of null padding the destination string, rather than just appending a null terminator. If strlcpy was implemented by calling out to strncpy, then just that function above would be writing hundreds or even thousands of megabytes of zeros more than required. That would definitely slow things down!

(I later discovered that this “crazy behavior” is documented in the strncpy man page and is actually required by standards.  Indeed, the original strlcpy paper cites this problem of strncpy as a motivating factor for strlcpy.  It is the only way the performance figures they give in the paper are actually relevant to their point.  But silly me, I just assumed I knew how strncpy works rather than actually reading documentation. I am really curious how this behavior of strncpy came to be and why folks thought it was worth preserving.)

Once I fixed the strlcpy implementation to do things properly, cross link times became comparable to native link times.  And then I could think about linking libxul in a reasonable amount of time. (And I did link libxul in a reasonable amount of time, if you read through the bug. And it even runs on a Mac!)

Lesson learned: don’t use strncpy!

2 comments

  1. So what does your strlcpy look like now?