Remora update and plan

As morgamic wrote on Friday, the Remora team have all had a long 9 days trying to get the new version of AMO out the door and make all its improvements available to our users. As we discovered to our chagrin, our performance testing on the stage environment did not really map well into real-world results, and the result was unacceptable load on the app cluster — unacceptable to the extent that pretty much everything that was sharing that database was killed by it, in fact.

With the assistance of IT (and especially of oremj, who wrote and operated a load replay tool to help us make sure we were testing the right mix of requests), we’ve been able to gather more data on where we were too slow, and since Friday we’ve been able to make some rather significant improvements to our impact on the database. Specifically, if I may be permitted a nerdy digression: on Friday, a single application server was able to generate a load of 7.0 on a single database server, but in our tests today, that same application was only able to generate a load of 0.04 (sic). We have more performance options still in our quiver if we need them, but because they are all non-trivial in terms of risk of regression we are hoping that we’ll have reached an acceptable level with this most recent work and will be able to hold off on additional optimizations until after the initial release.

Over the course of the week, and even the last few days, we’ve seen that the performance improvements we made did indeed cause us to have regressions, and while our test suite was able to catch some of them, it did not catch all of them. We want to get more user (including add-on developer) testing to help us prove out the system, and also to collect more of the great feedback that’s already been so helpful in improving the site. Also, we have learned all too well that running tests on our stage setup is not a surefire predictor of performance on the production cluster.

And if that weren’t enough, we now have a snapshot of the existing AMO database that’s more than a week old, and is missing many changes since that time. We don’t want to casually throw away the work of our community or reviewers, so we are faced with the need to re-migrate the database — and incorporate changes that people have made on preview.amo to update their add-ons for the new capabilities of Remora as well.

So what’s next?

Measure twice, cut once

First, we have a maintenance window with IT right now in which we will not be deploying Remora, but running full-scale load testing against the preview installation. This should give us a pretty reliable picture of how Remora will perform “in the wild”, and will give us additional data on where our remaining hot spots are. If these results show that we’re still substantially worse than the current AMO in terms of impact on the cluster, we’ll have to circle back and pick another appropriate performance arrow to shoot. We are optimistic, though, that our 17,400% improvement will put us in a pretty good position here.

Retrace our steps

Once we’re confident that we’re in scoring range in terms of performance, we’re going to lock down the current AMO system such that new add-ons and updates can’t be submitted, and we’ll direct developers to the new site for submission of new add-ons and updates. This will introduce perhaps a few weeks of lag into the update process, but it turns out (alas) that the current system’s review-scaling problems are such that updates sometimes have to wait for at least that long today, so it’s not as shocking a situation as it might be if we were coming from a more efficient system. For high-priority updates (security updates, or updates to our most popular add-ons), we’ll be instructing people to file bugs, and we’ll handle the updates by hand on the current site. It’ll be a bit painful, but we don’t want to find ourselves in this re-migration situation again before we release, which would be even more painful.

Tell us what you really think

We’ll also be driving more users from the current site and other parts of the community to the new site, to get their feedback and to get developers to work on updating their categorization and summaries, for example. This should help us catch a lot of simple polish bugs, as well as improving our general confidence in the app significantly.

Make a list, check it twice

Based on their feedback and our own additional testing over the next week and a half (again, assuming that the performance groundhog sees his shadow tonight), we’ll construct a release-blocker list, and from that an end-game release schedule.

What’s this all about, anyway?

And then we’ll have Remora live as addons.mozilla.org. This will give us some powerful tools for bringing more of the Mozilla community into the add-ons world, including:

  • deep support for localization, including translation of add-on descriptions, version notes, and user reviews
  • a more inclusive and transparent review and nomination model for making sure that the add-ons that we put in front of 80M users are up to the challenge, and that users get what they expect with clear descriptions and meaningful ratings
  • a discussion system for users and developers to communicate with each other, helping to provide a better channel for feedback and support
  • a more robust code base for even more improvements in the future

Fool me once, shame on…don’t be fooled again

Or: “didn’t you tell a bunch of reporters that you were releasing on February 12th?”

It’s embarrassing and frustrating to miss deadlines (as I know from very extensive experience!), and painful to not be able to put your hard work out into the world where it can help people. The entire Remora team has been working almost literally around the clock for the last few weeks, and very hard for half a year before that, and we would like nothing in the world more than to deliver it to the world.

Well, one thing in the world that we’d like more: to be confident when we release it that it’s up to the daunting task of representing add-ons to many, many millions of Firefox users, and that it’ll be a release that we’re appropriately proud of. We’d rather be late than damage the world’s impression of Mozilla or add-ons, and we’d be doing a disservice to the exact users we’re working to help if we were to push it out before we were confident it was ready.

It’s been a tremendous learning experience building an application with such a huge exposure to the world, and a rewarding one, though not one that we would necessarily want to do every quarter. We’re ready to correct our course and move confidently towards the finish line, or perhaps towards another metaphor of completion entirely.

Thanks to everyone for their patience and support, and especially:

  • to the IT team for their support and creativity in helping us find a path to better performance,
  • to our community of testers (I’m looking at you, Wladimir) for not only helping us identify problems but also suggesting solutions to them,
  • to our localizers, for their patience with our early disorganization and frequent string changes.
  • For the (somewhat exhausted but still pretty upbeat) Remora team,

    Mike Shaver

9 responses

  1. Jim Plush wrote on :

    I think very few developers can truly appreciate the impact that millions of daily users can bring. Would you mind sharing what brought the load down from 7.0? bad query? hardware? numerous issues?

  2. Frédéric Wenzel wrote on :

    Jim: Sure. We made performance improvements in a few places. Mainly it’s using caching for commonly queried data and we also analyzed queries that put the most stress on the database server and replaced them by semantically equivalent yet more Mysql-friendly counterparts. In addition to that, we analyzed very carefully how our caching hardware reacted to different parts of the site and made sure it behaved as expected.

  3. Joost wrote on :

    Take your time, just make sure it’s tip top before releasing it 🙂

  4. Herodot wrote on :

    just keep your time. everythings gonna be perfect

  5. Bryan wrote on :

    Keep up the good work guys!

    It kind of feels like trying to finish a marathon while dragging a horse the entire way., doesn’t it?

    Stay positive! Rest and food and a life away from the computer is sure to follow!

  6. john wrote on :

    It’s not needed to hurry up. Take the time you need!!! Users have the (old) addons working and that’s enough!!! if you take the time needed thing will work out better.

    so… go slow – take all the time you need (take a few days of break if needed) … I’m sure users understand!!! 🙂

  7. Shaun wrote on :

    Hurry up, already! :-p Lol, jking. 😉

  8. Benjamin Blanco wrote on :

    Yeah, you can take your time. ^.^

    Besides, if people really want an addon they can usually go to the site of the person who made it and get it there. Although this is a bit out of the way, it’ll work until you’re done.

    I’ve no idea what you’re doing(Because I just saw that you were moving to another site or something a day or two ago) but I’m sure it’s well worth the wait.

  9. Hugo Heden wrote on :

    Come on, Mike, realize that it is no hurry — it’s a lot more important that you get things stable and all, than for it all to be rushed out the door!

    It can’t really be business critical for anyone that you meet this deadline!? Or what am I missing?

    Don’t burn yourself out! 🙂