As morgamic wrote on Friday, the Remora team have all had a long 9 days trying to get the new version of AMO out the door and make all its improvements available to our users. As we discovered to our chagrin, our performance testing on the stage environment did not really map well into real-world results, and the result was unacceptable load on the app cluster — unacceptable to the extent that pretty much everything that was sharing that database was killed by it, in fact.
With the assistance of IT (and especially of oremj, who wrote and operated a load replay tool to help us make sure we were testing the right mix of requests), we’ve been able to gather more data on where we were too slow, and since Friday we’ve been able to make some rather significant improvements to our impact on the database. Specifically, if I may be permitted a nerdy digression: on Friday, a single application server was able to generate a load of 7.0 on a single database server, but in our tests today, that same application was only able to generate a load of 0.04 (sic). We have more performance options still in our quiver if we need them, but because they are all non-trivial in terms of risk of regression we are hoping that we’ll have reached an acceptable level with this most recent work and will be able to hold off on additional optimizations until after the initial release.
Over the course of the week, and even the last few days, we’ve seen that the performance improvements we made did indeed cause us to have regressions, and while our test suite was able to catch some of them, it did not catch all of them. We want to get more user (including add-on developer) testing to help us prove out the system, and also to collect more of the great feedback that’s already been so helpful in improving the site. Also, we have learned all too well that running tests on our stage setup is not a surefire predictor of performance on the production cluster.
And if that weren’t enough, we now have a snapshot of the existing AMO database that’s more than a week old, and is missing many changes since that time. We don’t want to casually throw away the work of our community or reviewers, so we are faced with the need to re-migrate the database — and incorporate changes that people have made on preview.amo to update their add-ons for the new capabilities of Remora as well.
So what’s next?
Measure twice, cut once
First, we have a maintenance window with IT right now in which we will not be deploying Remora, but running full-scale load testing against the preview installation. This should give us a pretty reliable picture of how Remora will perform “in the wild”, and will give us additional data on where our remaining hot spots are. If these results show that we’re still substantially worse than the current AMO in terms of impact on the cluster, we’ll have to circle back and pick another appropriate performance arrow to shoot. We are optimistic, though, that our 17,400% improvement will put us in a pretty good position here.
Retrace our steps
Once we’re confident that we’re in scoring range in terms of performance, we’re going to lock down the current AMO system such that new add-ons and updates can’t be submitted, and we’ll direct developers to the new site for submission of new add-ons and updates. This will introduce perhaps a few weeks of lag into the update process, but it turns out (alas) that the current system’s review-scaling problems are such that updates sometimes have to wait for at least that long today, so it’s not as shocking a situation as it might be if we were coming from a more efficient system. For high-priority updates (security updates, or updates to our most popular add-ons), we’ll be instructing people to file bugs, and we’ll handle the updates by hand on the current site. It’ll be a bit painful, but we don’t want to find ourselves in this re-migration situation again before we release, which would be even more painful.
Tell us what you really think
We’ll also be driving more users from the current site and other parts of the community to the new site, to get their feedback and to get developers to work on updating their categorization and summaries, for example. This should help us catch a lot of simple polish bugs, as well as improving our general confidence in the app significantly.
Make a list, check it twice
Based on their feedback and our own additional testing over the next week and a half (again, assuming that the performance groundhog sees his shadow tonight), we’ll construct a release-blocker list, and from that an end-game release schedule.
What’s this all about, anyway?
And then we’ll have Remora live as addons.mozilla.org. This will give us some powerful tools for bringing more of the Mozilla community into the add-ons world, including:
- deep support for localization, including translation of add-on descriptions, version notes, and user reviews
- a more inclusive and transparent review and nomination model for making sure that the add-ons that we put in front of 80M users are up to the challenge, and that users get what they expect with clear descriptions and meaningful ratings
- a discussion system for users and developers to communicate with each other, helping to provide a better channel for feedback and support
- a more robust code base for even more improvements in the future
Fool me once, shame on…don’t be fooled again
Or: “didn’t you tell a bunch of reporters that you were releasing on February 12th?”
It’s embarrassing and frustrating to miss deadlines (as I know from very extensive experience!), and painful to not be able to put your hard work out into the world where it can help people. The entire Remora team has been working almost literally around the clock for the last few weeks, and very hard for half a year before that, and we would like nothing in the world more than to deliver it to the world.
Well, one thing in the world that we’d like more: to be confident when we release it that it’s up to the daunting task of representing add-ons to many, many millions of Firefox users, and that it’ll be a release that we’re appropriately proud of. We’d rather be late than damage the world’s impression of Mozilla or add-ons, and we’d be doing a disservice to the exact users we’re working to help if we were to push it out before we were confident it was ready.
It’s been a tremendous learning experience building an application with such a huge exposure to the world, and a rewarding one, though not one that we would necessarily want to do every quarter. We’re ready to correct our course and move confidently towards the finish line, or perhaps towards another metaphor of completion entirely.
Thanks to everyone for their patience and support, and especially:
- to the IT team for their support and creativity in helping us find a path to better performance,
- to our community of testers (I’m looking at you, Wladimir) for not only helping us identify problems but also suggesting solutions to them,
- to our localizers, for their patience with our early disorganization and frequent string changes.
For the (somewhat exhausted but still pretty upbeat) Remora team,