Three Weeks with the New Socorro File System

Three weeks ago today, we deployed the new Socorro file system into production. It was the first in in a series of engineered improvements to the Socorro codebase. By “engineered”, I mean that it was the first major improvement to the code that wasn’t done during an emergency with a gun to our heads. For the previous half year, we’ve been reactive instead of proactive.

The new file system has performed quite well. The most outward expression of this improvement is the speed at which priority jobs are processed.

A priority job is any submitted crash for which someone has requested a report. There can be a backlog of submitted crashes and it might take from several minutes to several hours for the processing programs to get around to a particular job. If someone requests a particular crash, we’ve got a way for that job to jump the queue for immediate processing. Prior to the new file system, the biggest hurdle to processing a job quickly was simply finding it. There was no index to assist in find a job quickly.

The new file system changed that. All entries are indexed as they’re inserted. To see how it’s done, see my previous blog posting. This gives us very fast access to any crash dump which translates to response times of thirty to ninety seconds for priority job requests. Try it. Considering the volume of crashes we get, it’s amazing that we can zero in and process a crash so quickly.

The last two weeks hasn’t all been champagne and fireworks. We had a scare about forty eight hours after deployment. The automatic indexing scheme uses a radix algorithm to spread crash dumps evenly through a branching file system structure. During design, we chose to make this structure four levels deep. Each level did a 256 way bifurcation of the directory tree. That translates into 256^4 possible directories or about 4.3 billion. Once a directory was created, we never retired it, thinking that it would be faster to reuse old directories than to bother destroying and creating them all the time. At the rate that we received new files, we calculated that it would take years to clog up the file system. We banked on the assumption that we had at least 4.3 billion inodes available in the file system.

It was a bad assumption. It turns out that we’re using some sort of black box storage systems with variable sized inodes. We didn’t have 4.3G inodes available, we had only 64M. Back into reactive coding as performance art, we took twenty four hours to brainstorm, code, and deploy a solution. Changing the number of levels from four to three was an obvious way to reduce our foot print: 256^3 is only 16M. The number of levels of our radix directory structure is now a configuration option. The trick was making four days of data stored with four levels compatible with new data being collected with fewer levels. I managed that by encoding the number of levels into uuid of each crash.

Next time you see a crash uuid, take a look at the digits. The seventh digit from the right end will tell you how deep your crash is stored in the file system. If it’s ‘0’ then you’re stored four levels deep. Any other digit is to be taken literally: ‘2’ – two levels, ‘3’ – three levels. This crazy scheme lets the depth be switchable at run time. If directories are getting too crowded, we can raise the depth. Of it we start getting running out of inodes, we can lower the depth.

Great thanks to Frank Griswold for the coding and to Aravind for not throwing knives at me.