Let’s say we’ve got some twenty-five million chunks of data ranging in size from one K to several meg. Let’s also say that we only rarely ever need to access this data, but when we do, we need it fast. Would your first choice be to save this data in a relational database?
That’s the situation that we’ve got in Socorro right now. Each time we catch a crash coming in from the field, we process it and save a “cooked” version of the dump in the database. We also save some details about the crash in other tables so that we can generate some aggregate statistics.
It’s that cooked dump that’s causing some concern. The only time that we ever access that data is when someone requests that specific crash using the Socorro UI. Considering that these cooked crashes take up nearly three quarters of the storage needs of our database, there’s not a lot of value there for the effort. They inflate the hardware requirements for our database, make backups take too long and complicate any future database replication plans that we might consider.
We’re about to migrate our instance of Socorro to new shiny 64bit hardware. Moving these great drifts of cooked dumps would take hours and necessitate potentially more than a day of down time for production. We don’t want that.
There’s some future benefits to moving this data into a file system format. Think about all of this data sitting there in a Hadoop friendly format waiting for a future data mining project. We’ve nothing specific planned, but we’ve got the first step done.
We’re hoping to get the data migration done within the week. New versions of the processing programs will have to be deployed as well as the changes to the Web application. Once that’s done, we can proceed to the deployment of our fancy new hardware.