Let’s say we’ve got some twenty-five million chunks of data ranging in size from one K to several meg. Let’s also say that we only rarely ever need to access this data, but when we do, we need it fast. Would your first choice be to save this data in a relational database?
That’s the situation that we’ve got in Socorro right now. Each time we catch a crash coming in from the field, we process it and save a “cooked” version of the dump in the database. We also save some details about the crash in other tables so that we can generate some aggregate statistics.
It’s that cooked dump that’s causing some concern. The only time that we ever access that data is when someone requests that specific crash using the Socorro UI. Considering that these cooked crashes take up nearly three quarters of the storage needs of our database, there’s not a lot of value there for the effort. They inflate the hardware requirements for our database, make backups take too long and complicate any future database replication plans that we might consider.
We’re about to migrate our instance of Socorro to new shiny 64bit hardware. Moving these great drifts of cooked dumps would take hours and necessitate potentially more than a day of down time for production. We don’t want that.
It’s time for a great migration. All those dumps are going to leave the database. We’re spooling them out into a file system storage scheme. At the same time, we’re reformatting them into JSON. In the next version of Socorro, when a user requests their dump by UUID, it will be served by Apache directly from a file system as a compressed JSON file. The client will decompress it and through javascript magic give the same display that we’ve got now.
There’s some future benefits to moving this data into a file system format. Think about all of this data sitting there in a Hadoop friendly format waiting for a future data mining project. We’ve nothing specific planned, but we’ve got the first step done.
We’re hoping to get the data migration done within the week. New versions of the processing programs will have to be deployed as well as the changes to the Web application. Once that’s done, we can proceed to the deployment of our fancy new hardware.
jb wrote on :
Ludovic wrote on :
Jeff Balogh wrote on :
Fred wrote on :