AMO was updated on Thursday, March 22nd around 8pm. Overnight, we watched the web infrastructure to ensure that AMO could withstand peak load times, but this morning near peak time cluster load levels became too high and we were forced to rollback yet again to prevent affecting other critical applications.
- Our database bottleneck is now nonexistent
- The app servers are fine during off-peak times
- During this short period, we received as much if not more feedback than during previously announced beta window (2 weeks)
- App server load is unacceptable
- Traffic on web nodes has more than doubled as a result of absorbing releases.mozilla.org traffic
- Profile application to look for new pain points (already done, no obvious culprits)
- Move public add-ons .xpi traffic back over to releases.mozilla.org
- Only sandbox files will be served from webheads for policy reasons
- Remove locale strings from all image URLs to improve cache rate for images
- Reassess and redeploy at earliest reasonable time
Because life isn’t complete without colorful graphs, here is a graph showing database CPU (load reflected this too) going down dramatically:
But the bad news is that our app nodes were angry:
This is largely due to a dramatic increase in overall traffic that was moved onto the cluster from releases.mozilla.org:
This traffic graph is what leads us to believe that offloading file transfers back onto releases.mozilla.org will give apache the breathing room it needs. We will post an update as soon as we can verify that.
Thanks again everyone for supporting us as we work through these issues. As for the negative comments, they serve as good motivation, too.