Improving FTP Cluster Availability

jakem

2

Over the last several months we’ve had intermittent trouble with our FTP cluster that powers ftp.mozilla.org, as well as acting as the origin host for the primary product delivery CDNs, download.cdn.mozilla.net. Despite the name, the vast majority of the traffic is actually Apache-powered HTTP and HTTPS, not FTP.

The Problem

This problem manifested as a simple cessation of HTTP services on one node at a time. The quick fix was to restart Apache which usually worked. Sometimes, it still wouldn’t come back up upon restarting it, or would fail to start – rare, but common enough to be recognized. Generally speaking, simply trying 2 or 3 times would eventually restore everything to a working state. Fortunately, breakage during normal operations was rare.

Making the problem much worse, any time we changed a relevant config in Puppet, Puppet would attempt to restart Apache. Since it doesn’t always come back, this was frequently a fatal blow to one or more nodes. Even worse, they’d generally all get the same treatment within 30 minutes of each other… so one change would generally be fatal to at least one node, if not multiple ones.

Of course our load balancer would certainly notice that a node has bit the dust and would pull it out, but much damage is already done- any in-flight connections are lost and have to be re-established. This is very disruptive to some things, especially automation and testing, which frequently don’t respond well to service outages or disconnections. Or rather, they respond perfectly, by screaming loudly.

Mitigation

The first (and simplest) fix was to reduce the extra pain caused by Puppet. Statistically the vast majority of events were coming from this situation as far as we could tell. This was very straightforward. The relevant docs are here: http://docs.puppetlabs.com/references/latest/type.html#service.

There are 2 relevant flags: “restart” and “hasrestart“. The former defines a custom command that puppet should use to restart the service in question, when needed. The latter is simply a flag, indicating if the service in question already has a “restart” command built in that puppet can use directly.

We had set both of these. The former was attempting to change the restart command to “graceful“, which is an Apache-specific restart mode that is much gentler to existing connections when restarting (they’re allowed to finish, rather than being abruptly cut off). The problem here is, the latter overrides the former. If you tell puppet that a service has a restart command (hasrestart => true), it will use that and ignore your custom one.

So that’s the first half of the fix – we simply removed the “hasrestart => true” line, making graceful restarts work properly. This way, at least new config changes won’t break all in-progress operations.

The Real Problem

This was a puzzling issue because it’s one that we haven’t run into anywhere else (and we run a lot of Apache nodes). Clearly something was unique about the configuration on these nodes.

It didn’t take long to turn up a very promising suspect: this is the only RHEL6 clusters where we make heavy use of the “worker” MPM in Apache, rather than “prefork”. It’s also one of the most heavily-loaded cluster we run, in terms of requests-per-second.

Most of the errors generated by Apache were unhelpful, like this:

[Thu Jan 03 15:23:23 2013] [crit] (22)Invalid argument: ap_queue_pop failed

This is obscure enough as to be virtually unactionable. Hundreds of lines like this (and other similarly-obscure errors) across the cluster. Buried in the muck, however, were some very interesting gems like this:

[Thu Jan 03 15:23:23 2013] [alert] (11)Resource temporarily unavailable: apr_thread_create: unable to create worker thread

These happened often even on start-up, although typically Apache would still start. This is what ultimately led us to the real problem.

Our Apache config for the worker MPM looked like this:

    ThreadLimit          100
    ThreadsPerChild      100
    StartServers          30
    ServerLimit           50
    MaxClients          4000
    MinSpareThreads       75
    MaxSpareThreads       75
    MaxRequestsPerChild  500

Let’s break this down:

  • ThreadLimit is the maximum number of threads a worker process can have. Default is 64. A restart is needed to raise this.
  • ThreadsPerChild is how many threads are actually created per worker process. A worker always has precisely this many threads within it (plus 1 additional thread that doesn’t process requests). Default is 25. This can be raised up to ThreadLimit with just a “graceful” restart.
  • StartServers is the number of worker processes that spawned at start.
  • ServerLimit is the maximum number of worker processes that can be spawned. A restart is needed to raise this.
  • MaxClients is the maximum number of simultaneous requests that can be served.
  • MaxSpareThreads is the maximum number of idle threads Apache will attempt to keep on hand. If it has more than this, it will kill off worker processes (each with #ThreadsPerChild threads) until it has less than this many idle.
  • MinSpareThreads is the minimum number of idle threads Apache will attempt to keep on-hand. If it has fewer than this, it will spawn new processes (each with #ThreadsPerChild threads) until it has enough.
  • MaxRequestsPerChild is how many requests a worker process will serve (across all its threads) before it will automatically perish and be respawned.

One noteworthy point that’s not immediately obvious when starting to use the worker MPM is that Apache does not ever change the number of threads per process. That is fixed at the value of ThreadsPerChild. If Apache needs more or less processing threads, it will spawn or kill off whole worker processes, with that fixed number of threads per process.

Now, some observations:

  • 30 servers spawned at start * 100 threads each = 3000 available threads. Given a MaxSpareThreads of only 75, on the surface this seems like a large mismatch. Most of these 30 processes will be killed off immediately in an attempt to reduce the number of spare threads, unless we really do need that many (spoiler: we don’t).
  • ServerLimit of 50 * 100 threads per process = 5000 max possible threads, but MaxClients is only 4000. This is okay but slightly odd, and misleading if you don’t work with “worker” very often. Side note: The other way around won’t work.
  • MaxRequestsPerChild of only 500 means that each thread will serve (on average) only 5 requests before being killed. In all but the most horrible of memory-leaking applications, this is very wasteful and results in lots of unnecessary “churn” in worker processes.

All that is technically okay, but appears to be sub-optimal. Without knowing anything about the traffic itself, it’s hard to know what parts (if any) of that to criticize. The real secret to our troubles lies buried within that tantalizing “unable to create worker thread” error.

What’s Wrong?

After much gnashing of teeth, we ultimately discovered this seemingly-unrelated bug within Red Hat’s Bugzilla: Bug 432903 – /etc/security/limits.conf should reduce the risk of forkbombing.

To make a long story short (too late, I know), it turns out Red Hat instituted a global soft limit to the number of processes any user (including root) can spawn. This default is 1024. It’s been that way for a very long time. We just haven’t had the misfortune of running up against it. It’s likely that previous incarnations of this service (where the above settings came from) had already removed or modified this limit.

In Linux this limit counts threads just like processes. So:

  • 1 “master” httpd process
  • 30 child processes on start-up
  • Each child has 1 “overhead” thread
  • Each child has 100 “processing” threads
  • Total of 1 + 30 + 30*1 + 30*100 threads on start-up == 3061 count needed!

This is the source of all our troubles – we’re trying to spawn far more threads than the OS will allow us to do.

It also unveils an interesting observation: since there was no possible way Apache was ever spawning more than 1024 workers, we can now guarantee that our worker MPM settings above are most definitely poor for our environment. If we really did need more than 1024, we’d be hitting these errors during normal usage, rather than primarily on service restart. StartServers was in effect highlighting a weakness of the configuration that we would inevitably encounter if we ever did have that much traffic.

The Fix

Sadly, you cannot overcome this with a simple /etc/security/limits.conf entry for the “apache” user. Apache creates these threads as the root user, before it drops privileges, so they count against root’s limit (even though they later drop privileges to a normal user). You could certainly set a higher limit just for root but that may not be desirable, as many things follow a pattern similar to Apache (start as root, drop privs).

The best trick we’ve found is to tweak the Apache start-up workflow in such a way as to run “ulimit -u 4096” or similar, right before starting Apache. Thankfully, there’s an easy way to do this in RHEL6 without having to modify the actual init script. The file /etc/sysconfig/httpd is sourced by the default init script… you can simply add this line in there, and it’ll be correctly applied next time Apache is started. We already manage this file via Puppet anyway, so this was a trivial 1-line fix for us.

With that fix in place, we were able to go back and refine our worker MPM settings. Here’s the new config:

    ThreadLimit          150  # up from 100
    ThreadsPerChild      150  # up from 100
    StartServers           8  # down from 30
    ServerLimit           16  # down from 50
    MaxClients          2400  # down from 4000
    MinSpareThreads       75  # same
    MaxSpareThreads      500  # up from 75
    MaxRequestsPerChild 5000  # up from 500

You can see a number of improvements here:

  • ServerLimit * ThreadsPerChild = MaxClients. Not strictly necessary, but easy to keep a handle on things. I found it best to think about MaxClients as if it were a derived value. It isn’t, but if you set it like it is, it functions as a nice sanity-check on the other settings.
  • StartServers * ThreadsPerChild = a sane number of starting workers, compared to other settings.
  • Min and MaxSpareThreads set much more sanely, rather than set to equal (and very low) values. Apache doesn’t have to kill or spawn processes as frequently.
  • MaxRequestsPerChild is much higher (though still somewhat low per-thread), further reducing process churn.

Even though MaxClients and ServerLimit are now much lower, we’re actually able to handle much more traffic now due to the ulimit tweak.

In the End

You could consider this to be a case of premature optimization coming back to bite us. In this particular case these settings were actually migrated from older servers which did not exhibit this particular problem. We weren’t optimizing so much as attempting to maintain consistency with earlier incarnations of this cluster, because we knew that would work. We were wrong.

The trouble with this problem was not that it’s a particularly tricky thing. It really isn’t, ulimit settings are something most Linux sysadmins grow accustomed to working with at some point. The real problem here was the infrequency and unpredictability of running into the issue, coupled with settings that were believed to be appropriate. It just required someone to take some time out to dig into why this crazy little nonsense was happening every now and then.

If you have any questions, please feel free to ask in the comments below!

2 responses

  1. Ed M wrote on :

    An interesting read, thank you :-)

  2. Eugene wrote on :

    Great closing paragraph, thank you. I wonder how many RHEL5/6 configurations which worked okay will bomb when we all start to get familiar with RHEL7.