This Thursday and Friday we attempted to push updates to re-partition our crash report database and optimize the reporting tool to take advantage of it. This was the deployment of bug 432450 and a fix for bug 444749, among others.
Our first attempt suffered from a network timeout, which required an eleven hour restore and re-run. The re-run, done Friday, was done using a socket connection but would have required an additional 1-3 days of downtime, which was well outside our originally announced window. Consequently, the database was rolled back to its contents as of 6:55PM PDT, January 29. Reports have since resumed processing.
We plan on doing the following:
- Set up a complete replica of production to test this process end-to-end. Our dry runs were done on a staging database that was roughly 1/5 the size. We anticipated a scaling of O(n), but in practice on the production server, we got performance more inline with O(n^2). So we did not expect the full extent of timeouts or how much downtime would be needed. This will be avoided in future updates and we are setting up a stage database from a recent dump (once we gather the hardware for it).
- Push a now+ partitioning script. The work done in bug 432450, on top of a complex migration script for old data, has logic for handling new partitions automatically which benefits new reports. Since we don’t want to keep adding to our old database schema, we will push these updates so that new reports are properly partitioned. Pros – in a week or two, things will be speedy and we aren’t going to struggle with timeouts. Cons – we aren’t migrating the last 4 weeks. We will not see a performance increase when querying data older than the date of the repartitioning.
We would like to push the partitioning script (without migration of old data) on Thursday. We will announce when it will be as soon as we know.
Long term, we are already in the process of seeking additional resources to help examine our database configuration and systems architecture. We will have more updates on that process in the future.
Our team wants this work deployed as much as everyone else. Thanks to everyone for their patience as we work through these issues.