TL;DR: Recently, the Firefox Sync platform suffered serious performance degradation. We’d like to explain what happened, as well as the steps taken to deal with the issue from the perspective of Mozilla Services Operations (“Ops”).
The Sync platform melted down after the release of Firefox 7…
…to put it bluntly. This was the worst outage in Sync’s short (~1 year) post-Labs history, and it was caused by a confluence of factors including bugs, poor communication, and lack of proper database maintenance. In the interest of being as open as Mozilla is with their products & source code, we want to share with you, our community, what happened, why, and what we’re going to do from now on to prevent issues like this negatively affecting your experience with our products.
Obviously, all of what we know now to be contributing factors are in hindsight. Had we known even one of these lurking issues, we might not have seen the level of service disruption that occurred. We will leverage this experience to help head-off future incidents with this and other Services products.
Instant Sync is a new feature in Firefox 7. It was first put into Beta sometime after the Firefox 5 release, which put it on track for the Firefox 7 release. Unfortunately, throughout the development and QA process, it wasn’t viewed as having a potential to have a negative impact on Sync’s performance, so neither Ops nor QA had any attention focused on it.
This meant that while the Ops team was aware of the existence of the Instant Sync feature, it wasn’t something that we were tracking or worried about. This lack of concern and/or assessment of impact was responsible for the missing metrics that would have properly modeled the behavior of Instant Sync “in the wild”.
Broken server-initiated back-off
Firefox’s Sync client code is designed so that Services Ops has the ability to force clients to back off from their scheduled Sync activity. What this means is that if, for some reason, the back-end Sync platform gets overloaded, we can reply with a certain HTTP header (either our custom X-Weave-Backoff or the HTTP-spec Retry-After) and instruct the client to delay the next check-in by whatever time we determine to be necessary. This functionality can give us breathing room to make repairs, take compensatory actions.
Our faith in the stability of the Sync product proved to be our undoing in this case. Because we’ve never needed to use the server-initiated back-off feature, we didn’t notice that it had been broken at some point during the last few months of Firefox’s development. Essentially, when we needed it most, it didn’t work.
Among all the data Sync “syncs”, the “collection” (as it’s called in the code) that generates the most data is Browser History. Since your History is constantly changing, it constantly needs updating in Sync. Every time you use your browser, you are syncing more and more history data into our encrypted databases. To our detriment, and perhaps foolishly, we weren’t particularly rigid about our database cleanup of that data, and it backed way up. Until Firefox 7 (before Instant Sync existed), this old data was not a significant problem since we had more than enough storage overhead to deal with the existing backlog of old data and our processing resources were fast enough such that the presence of the old data had a negligible impact on our query load in MySQL.
In Ops, we do not often discuss hard numbers when discussing capacity. Server capacity in any environment with a changing work load is difficult to quantify and as a result we use rough numbers. As of September 2011, we were reporting that we were using “about 40%” of our capacity in Northern California and “about 60%” in out Phoenix data center. Those numbers, while imprecise, were reasonably correct and were well within tolerances for perceived load.
The downfall is that we were not being precise in differentiating between “horizontal” capacity and “vertical” capacity. If these aren’t familiar terms to you, scaling horizontally is when you add more servers to your existing pool to better spread the load among more resources. Vertical scaling is when you add local (in-machine) resources that will better handle increased workload. These local resources might be an additional CPU, more RAM, faster, bigger hard drives, etc. To give a non-technical analogy, if you own a sedan and have two kids but have twins on the way, you could either buy a minivan (vertical scale) or another sedan and drive two cars when all 4 kids need to go somewhere (horizontal scale).
Services Ops had (and still have) plenty of horizontal capacity. If we suddenly gained 50% more new users this month we’d certainly be watching our metrics graphs carefully for bottlenecks, but we’re reasonably confident the platform would continue to deliver the same level of performance. However, if our existing users suddenly started sending 50% more traffic, we would be unable to handle that load because we aren’t operating with enough overhead to scale vertically to that level and performance would suffer. And, basically, that’s what happened. But before we start getting into those kinds of details, let’s go through the timeline of the event…
Firefox 7 released
On 27 September 2011, Firefox 7 came out on time. At this point, a Firefox release day is not a Red Alert, “all hands on deck” event like it has been in the past. We do check out our metrics graphs a little more than an average day, but other than that we’re business-as-usual. On this particular day, there was a small uptick in traffic to the Sync servers, but it wasn’t enough to generate any real concerns and was attributed to new users trying out the new release.
Load creeps up
Within 48 hours, it was clear that this new traffic wasn’t just new users. Traffic was steadily climbing, but the odd part was that we weren’t seeing any correlating increase in new user sign-ups. This means that the traffic we were seeing was from existing users. As luck would have it, at about the same time we start seeing an increase in drive failures in our Northern California data center. Equipment housed at that facility is a bit older than the equipment in use in Phoenix and as a result a certain amount of failure is expected, but the failure rate suddenly jumped more than four times more than the normal failure rate. What we now realize is that the increased disk i/o the database nodes were trying to handle was pushing old equipment right over the edge into failure.
By the 1st of October, we were starting to receive notifications from our monitoring system indicating individual databases were overloading to the point where they were unresponsive. In order to repair the errors users would experience from their assigned database node being overloaded, we began initiating “migrations” of those users to other less loaded database servers. The problem with this approach is that it involves forcing a “First Sync” of all the users data to a brand new node, inducing additional load on that server as well,
At this point, we’re sure there’s some sort of problem with Firefox 7 and how it’s interacting with the Sync servers, but we’re unable to find any correlation between failure rates and any functionality in Fx7. Outside of 503 errors occurring between our Zeus load-balancers and the back-end platform, we’re receiving nothing indicating what the root cause may be.
Firefox, both the project and the code, is an almost living thing. The nightly builds change nightly (hence the name) and it’s difficult to see what “Butterfly Effect” a change in some code will have on Aurora/Beta/Release in the following 6/12/18 weeks (respectively). Completely open communication between teams internally helps detect potential pitfalls, yes, but is not foolproof since some corners of the code are not as well-exercised as others (Example: if the Back button suddenly stopped working, everyone would notice. However, some bits of code are only triggered very rarely and without direct user input, making them harder to notice).
Keeping this in mind, when the release of FF7 was imminent, we asked our if anything “major” had changed in the Sync client portion of the Firefox code. From the perspective of the developers the changes that went into this version of the browser should not have resulted in a significant change in behavior, so no red flags were raised. Since we were operating under the impression that there were no major behavior changes expected, Ops continued to dig through logs and trace connections, desperately trying to correlate behavior and events, yet continuously coming up empty.
Eng/Ops side-bar: Mozilla Services is unlike any organization I’ve worked for previously. It’s not just that Engineering and Operations are friendly with each other (which I’ve encountered at a few other places), but we work very closely on a daily basis and are very cooperative. The two teams sit intermingled (those who are based out of the main Mountain View office are, anyway), the bulk of our online team chats happen in a combined chat room. When there are all-hands events at the Mozilla offices, our team dinners include both Eng & Ops. It’s fantastic, and makes for an excellent working environment even when things aren’t going to plan.
I want to provide this perspective to help illustrate that we aren’t at-odds with Engineering. So, while trying to pin down what’s going on, we’re not accusing Eng of anything, we’re investigating collaboratively and still coming up dry.
It was a long weekend of fire-fighting, and we were losing the battle as traffic was overloading more and more servers.
On Monday, 3 Oct, after some trawling through the landed bugs listing for Fx7, the Ops Engineers noticed the Instant Sync feature bug. I asked in our intra-team IRC chat channel “Did Instant Sync land in Firefox 7?”
With the patch confirmed and after several pointed questions about specific expected Instant Sync behaviors, the investigation was on.
Knowing what to look for, finally, made it much easier to pin Instant Sync as the primary cause of the Sync storage (database) nodes’ load spike. At this point, server load was as high as a 50% increase over normal (which, for some nodes, meant load was pinned at 100%).
Warning: Technical details below.
We had already verified, via our personal Sync accounts and about:sync-log data, that Firefox 7 was doing the right thing and backing-off a little when it encountered timeouts connecting to the server. However, this wasn’t enough to quell the overall load increase and it also did nothing to prevent the Error Bar from showing up at the bottom of users’ Firefox browsers.
At this point we decided to start sending the back-off headers using X-Weave-Backoff only to discover that neither Firefox 7 nor Firefox 6 would respect it. Sadly, the same case happened with Retry-After headers.
Without the ability to force clients to back-off, our only available course of action is to attempt to compensate for the increased network traffic.
Prior to the release of Firefox 7, we began the process of upgrading our older MySQL installation with a new version of Percona. Our original plan was that this would be an interrupt-free, multi-week process, however we could plainly see that databases running Servers running older MySQL builds were unable to keep up and were completely loaded but servers running the new Percona builds were humming along nicely. Realizing this, we dove in, and started doing upgrades as fast as we could. From the perspective of the sync client, a server down for MySQL upgrade looks exactly the same as server overloaded because of no MySQL traffic can be shipped to that server. In that way, it’s a no worse than the existing situation for the user, so we just went for it.
While this was going on, another Operations engineer whipped up a better method to log the aforementioned 503 errors on the load balancers. This allowed us to see what was failing where more efficiently.
An additional problem-solving avenue that proved invaluable was Twitter. We asked users via @mozservices on Twitter to help by submitting links to pastebin’d copies of their about:sync-log files. This gave the Engineers the ability to hone in on exactly what was wrong as well as discover and fix several previously-unknown client and server bugs.
We also added a previously decommissioned web node back into the load balancer pool in our Phoenix data center to help handle the load there, since First Syncs take up more CPU and this server could help handle that additional load.
We did some OS tuning within all the web nodes, upping the the file descriptor limits from 1024 to 8192. This was done because the overload was causing nginx to hit the fd limit. We did some additional adjustment of the nginx config itself to allow for more current connections to the python back-ends (which run under gunicorn) due to the severe query latency. Finally, we changed some Linux kernel-level TCP parameters to allow processes to have a larger accept() queue, re-use TIME_WAIT sockets in a more timely manner, and generally take advantage of the fact we have lots of RAM (adjust buffer sizes, etc).
On the load balancers we dropped the LDAP-slaves VIP timeout to 0. LDAP could handle its own timeouts fine, and chopping them off short was just making Sync upset when it happened, causing more 503s to be sent to users’ Firefox browsers which triggered the error bar even more.
On the load balancers in our Northern California data center (which run on older/slower hardware than in Phoenix), we upped the conntrack limits, allowing for more connections inbound to the cluster and to the load-balancer back-ends (web servers and LDAP servers). The increased load and longer-lived connections made for a higher number of concurrent connections, which bumped us into the nf_conntrack limit (a Linux iptables thing). We also added monitoring so we can tell when we’re approaching the new set maximum. Then we lowered the pool_time for the Sync storage MySQL connection pool. There is a bug in the version of the Sync storage code we were running that would serve an error (“503″) to a user in the case of a dropped MySQL connection (instead of retrying), and we were seeing long-lived connections getting dropped (possibly by a firewall in the middle) — a newer version of Sync storage is slated to be live in production this week that fixes this problem.
All of this tuning bought us a small amount of breathing room, but we were still smacking right into the true bottleneck — that the databases simply could not churn through this increased query load that this additional traffic was requiring. So we developed a plan to very rapidly, very aggressively prune out the old data which was, as mentioned, primarily users’ old History data.
Bulking up the Infrastructure
While we hoped we could simply accomplish this “in place” on the databases, we knew it wouldn’t be too far-fetched to imagine that doing that maintenance work would both be blocked by the existing load as well as add its own load into the mix, as these were very intensive clean-up processes. To prepare for that potential scenario, we began to prep additional hardware that was being held in reserve in our Northern California data center. Some quick background on that: we had purchased a lot of gear in bulk from a bankruptcy sale that was handed overin situ. It was about double what we actually needed for Sync when we brought it live in March of 2011, so we had several cabinets full of servers powered-off, waiting to be needed. This ended up being very fortuitous for us, since if we had needed to buy new servers, there was no way we could have gotten them selected, shipped, racked, configured, built, and online within a reasonable period. With this gear sitting as it was, even with work of updating BIOS, getting the IPMI management configured (which has to be done one-by-one with a crash cart), we managed to get 60 servers prepped and kickstarted in less than 48 hours.
So, while some of the team got started prepping these servers, we tried to “live prune” the old data. On the least-loaded servers we saw some improvement, but was very slow to finish. On the most severely overloaded systems, the process failed as it simply couldn’t complete. So, with our fears realized, and we had to try Plan B.
It was theorized that if the problem was old data, we could just “nuke” all the data from the server so that the next time the client connected, Firefox would automatically initiate a “First Sync” and re-push all the data bundles (which would only include the user’s history of the last 5000 history entries, rather than months of old junk data). We started gingerly, by nuking the data of 10% of users on a few selected nodes. It was immediately apparent that this was a terrible idea. First, the database loads were already quite high from the other 90% of users. First Sync events push a LOT of data very quickly, so that just made it worse. Second, with Instant Sync in the mix, users were not exactly trickling in to attempt to First Sync. They were all trying to do it Right Now. And then fail, and then try again. And then fail…
By now (Thursday, 6 Oct) the newly built databases were built and able to be brought into the production pools. This allowed us to launch Plan C, mass user migration. This process entails marking the database as “down” via an administrative process on our side, which blocks all incoming traffic to that server. We then delete the user’s Sync storage node assignment in the LDAP system, clear the user’s data from our caches. Once complete, the next time the client tries to Sync, it will get a 401 response code and re-request the user’s Sync Storage node. We configured the the node-assignment server to only allocate users to these new nodes (including a few non-overloaded nodes in Phoenix) and the users were able to initiate their First Sync successfully. Having done this, Sync’s behavior returned to normal for them.
That was the theory, anyway.
Unfortunately, it takes a long time to do the above for the several million users (more than 3 hours per node, there are many dozens of nodes, and we could only do them in batches to avoid overloading the new servers). In spite of this, we felt confident that it would work based on our earlier less-successful attempts, and started the user migrations in earnest, targeting the nodes in the worst condition first and saw immediate positive impact. As nodes were cleared of users, we truncated the data and returned them to the database pool to accept new users.
Emerging into the sunlight
By early Friday (7 Oct) we saw substantial, sustained improvement in the Sync platform, with a very sharp reduction of the signature “503” errors on the load balancer. After another 24 hours (36 hours in total) of migrations and cycling the cleaned databases back in, 503s were essentially eliminated, and our metrics showed we had transient error rates in-line with our pre-Firefox 7 numbers (and in some areas even better).
By Monday (10 Oct) we had complete confidence that Sync was back to normal, as we had 48 hours of normal metrics. We announced as such via the @mozservices Twitter stream and asked for any users having continued problems to contact us. While we have had ongoing conversations with some users experiencing Sync misbehavior and exposing new and interesting client bugs, we have had no recurrence of the 503 overloading problem.
All of our mitigation efforts described above are permanent changes to the Sync platform. It’s stable, and we could rest to catch our breath. However, these changes only address the immediate problem and were implemented in a reactionary fashion. This is simply an unacceptable end-point in this process, having left us wondering “What if there’s another feature around the corner that increases load in another portion of the platform?” We would be just as unprepared as we were 3 weeks ago.
Procedure & bureaucracy
Our primary concern from the outset was “How did this happen?” The confluence of factors contributing to the melt-down have been laid out above, but that’s not what the question is asking. How, with all of the coordinated inter-team status meetings and load testing and roll-outs did such a huge change to the Sync client get “snuck” into production?
It didn’t sneak, really. The Engineering team followed protocol, they announced it in the weekly meeting, it went through peer review, and every other prescribed step needed to have this feature land. Following that, the client went through QA, Sync functionality was tested (and successful), and it rolled out. Only to have this be the result. What did they do wrong?
Nothing according to the procedure, and that is the problem: the procedure itself.
Most urgently, there was no formal Ops Review of upcoming features either for flagging expected behavior changes, nor anticipating misbehavior. There will now be regularly scheduled Ops reviews scheduled for all “major” features as well as an overview review before major version releases. This goes not only for the client, which was the antagonist in this case, but the server products as well.
Similarly, there was no explicit hand-off to QA to test these new features for functionality, much less a model-load test. This was exacerbated by the lack of testing the X-Weave-Backoff header “fail-safe” feature. To remedy this scenario, there will now be an explicit QA review of all feature changes, as well as coordinated test-case development of these features. Additionally, there will be a punch list of required-functionality tests that are absolute blockers to release (“respects the X-Weave-Backoff headers” being the item right after “Syncs stuff”).
First things last
Here is a list of our internal punch list, including to which team(s) the responsibility falls. Much of these are already done or are in-progress or ongoing, but this is directly from our internal Wiki:
- Build and document model of current behavior, expected traffic (Ops/Eng)
- Bring back and enforce system-review requirement for all client behavior changes (Eng)
- Identify required metrics for reliability, create dashboard to compare Nightly/Aurora/Beta/Release channels (Ops)
- Enhanced User-Agent data required (Eng)
- Standing review, one week before Rapid Release cut-over day, of all changes landed and all metrics from above action(Eng/Ops)
- Fix backoff handling for Fx8 + add automated tests (Eng) – Done: Bug 691612 Bug 691663 Bug 691988
- Add back-off handling to pre-Aurora-merge extended QA validation (Eng/QA)
- Document and make available current system capacity numbers (Ops)
- Determine and implement standard for required headroom on each server (Ops)
- Coordinate metrics-ops observation of Aurora/Beta traffic patterns (Ops/Metrics)
- zeus log analysis and reporting improvements (Ops)
- deploy TTL pruning cron (Ops)
- push 2.6.x server-core update (Eng/Ops)
- improve understanding of platform structure outside of Ops (Ops)
- automated stack trace processing and reporting, top crasher equivalent, server-focused (Ops/Eng)
- Re-triage of all outstanding bugs to identify future lurking issues (Ops/Eng)
- 90-second minimum sync interval (Eng) — Done: Bug 694149
- Create and document engineering contact list + escalation path for all applications (Eng)
- Update new user activation load test, identify optimal+maximum user uptake rate per node (Ops/Eng)
- gradual user reassignment per node, not per db (Ops)
- Ensure ongoing reliability work is accounted for in planning, and deferral of maintenance (Eng/Ops) – Ongoing
We (Mozilla Services Operations) hope that this helps you to understand why this Sync melt-down took as long as it did to resolve. We strive to prevent even minor blips for our users, and any failure at that goal that results in a visible outage is an embarrassment to us, but this one was particularly frustrating since hour after hour we couldn’t dig ourselves out using the tools and techniques we had previously relied on.
We know we are better for it, though we still wish this problem hadn’t happened.