For about 48 hours the AMO API was thrashing because of the popularity (and hunger!) of the new add-ons manager in Firefox 3 Beta 3. The dust has settled and the servers are humming happily along, so now is a good time to blog about what happened and how we’ll handle future releases successfully.
Stop. Take a deep breath. Alright, here we go.
What happened this week?
Now that the API is functional (most major bugs have been ironed out) we got a rude awakening this week and found out exactly how much traffic the improved Add-ons Manager can generate, but it’s a nice problem to have and we’re happy it’s been well received.
Wednesday, around peak time, the API started clobbering our databases:
Shortly after we entered our peak traffic window, we had to turn off the API to keep the normal AMO working. Diagnosis found that:
- Load was not utilizing the read-only slave and was focused mainly on the master read/write database (mrdb03).
- Cache hit rates were down to 60% from the usual 90% for memcached
- When our databases hit peak CPU, the app cluster would tumble because of the piling requests
How it was fixed?
Wednesday, IT and Webdev spent quite a bit of time getting the API back up. Starting with the three points above, we:
- Off-loaded read-only traffic to DB slaves
- Investigated optimizations for the API
- Looked at cache rules and cache policies for both memcache and the hardware load balancer
However, Thursday didn’t fare any better for the cluster. This time the slaves started to melt near peak time — forcing us to once again temporarily disable the API. Under-utilizing memcache was the main issue. Cache headers were fine, slave was utilized, app nodes were fine — just too many damn queries flying at our database servers!
So on Thursday we continued our look into what was going on. We tried to figure out why our cache hit rate was so low (60% instead of 90%). Digging through AMO, we found CACHE_PAGES_FOR, which set the expire time on memcache records when calling Memcache::set(), was set to 60 seconds. We increased this to 7200 to aggressively cache database traffic and were collectively off for valentine’s dinner.
The next day, Memcache was our valentine.
The combination of our efforts worked:
- Overall query traffic was reduced dramatically
- What traffic that did make it past memcache was well distributed onto 2 read-only slaves (db04, db04-2)
- App code was optimized to reduce overhead and unnecessary database traffic — this was done by placing hard limit on the number of search results returned by the API, among other things
How will we scale?
So these growing pains will help us move forward. Here is our plan of attack for scaling this beast for the Firefox 3 onslaught:
- Move the API (services.addons.mozilla.org) to a separate docroot with its own read-only slaves and more aggressive caching policies that are separate from the main AMO
- Optimize client code to reduce the number of requests needed to retrieve data and also imploring local caching methods for redundant content or content that doesn’t change over time very much
- Offload even more traffic onto read-only slaves
- Upgrade to CakePHP to latest 1.1.x stable branch, which optimizes auto-generated queries quite a bit (thanks to clouserw for researching this)
- Refactor how we pull localized strings from our database
- Optimize our search performance on AMO and the API
- Switch default CakePHP data source to read-only slaves
- Find ways to use memcache at higher levels (caching larger objects instead of at just query level)
Once again it was a great team effort to get things running smoothly. Thanks to IT for helping us troubleshoot this. We’ll continue to build on this experience to ensure better reliability in future releases.
Looking back at the last three days, the Firefox 3 Beta 3 release was a success in more ways than one. It showed everyone what the web can do, but it also helped us wrap our heads around the API and how much traffic it generates. All of this will make for a better Firefox 3.0 release.