Socorro Dumps Wave Good-bye to the Relational Database

Let’s say we’ve got some twenty-five million chunks of data ranging in size from one K to several meg. Let’s also say that we only rarely ever need to access this data, but when we do, we need it fast. Would your first choice be to save this data in a relational database?

That’s the situation that we’ve got in Socorro right now. Each time we catch a crash coming in from the field, we process it and save a “cooked” version of the dump in the database. We also save some details about the crash in other tables so that we can generate some aggregate statistics.

It’s that cooked dump that’s causing some concern. The only time that we ever access that data is when someone requests that specific crash using the Socorro UI. Considering that these cooked crashes take up nearly three quarters of the storage needs of our database, there’s not a lot of value there for the effort. They inflate the hardware requirements for our database, make backups take too long and complicate any future database replication plans that we might consider.

We’re about to migrate our instance of Socorro to new shiny 64bit hardware. Moving these great drifts of cooked dumps would take hours and necessitate potentially more than a day of down time for production. We don’t want that.

It’s time for a great migration. All those dumps are going to leave the database. We’re spooling them out into a file system storage scheme. At the same time, we’re reformatting them into JSON. In the next version of Socorro, when a user requests their dump by UUID, it will be served by Apache directly from a file system as a compressed JSON file. The client will decompress it and through javascript magic give the same display that we’ve got now.

There’s some future benefits to moving this data into a file system format. Think about all of this data sitting there in a Hadoop friendly format waiting for a future data mining project. We’ve nothing specific planned, but we’ve got the first step done.

We’re hoping to get the data migration done within the week. New versions of the processing programs will have to be deployed as well as the changes to the Web application. Once that’s done, we can proceed to the deployment of our fancy new hardware.

4 responses

  1. jb wrote on :

    It would be interesting if you posted the format that you dump it in.

  2. Ludovic wrote on :

    I like the hadoop idea – but maintaining a Hadoop cluster for that is a bit too much imo.

  3. Jeff Balogh wrote on :

    From the title, I thought you were going to be dropping some couchdb hotness on us! It’s awesome that you’re able to push the dumps up the stack and take care of it with Apache. How are you making sure that everything goes smoothly with the migration? What kind of testing are you doing?

    Good luck!

  4. Fred wrote on :

    That sounds like a very good idea. Considering the dumps are rarely needed and, in particular, are not part of any queries but rather BLOB information, they are just keeping the RDBMS from what it is supposed to do.