Socorro’s File System Storage

lars

4

As the scope and depth of the Socorro/Breakpad project has evolved in the last nine months, the most nonvolatile requirement of the project has been a file system as the initial server side storage for submitted crash dumps. The file system gets used as an ad hoc hierarchical database, but it isn’t optimized for the type of lookups that we need to do. Unfortunately, the original implementation was without indexing by name leaving us using a search (sort of like using find over NFS with 9 million entries). We patched and then patched the patches as the magnitude of our scaling problem erupted in front of us. Now that the emergencies are over and server side throttling is in place, we can revisit and re-engineer a proper interface for the file system.

Indexing by a single key is simple in a file system. The path is the key to a file. Indexing by a second key in a file system requires links. We chose to use symbolic links as our pointers to indicate relationships between files. There’s an interesting thing about symbolic links: if the path they point to isn’t too long, the path is stored right with the inode. Reading the target path from a symbolic link is fast. We created two radix sort hierarchies, one for the name and one for date. We store our uuid named data files on the name branch. Then we have symbolic links span the gap between the branches to indicate the time interval in which a file was submitted.

the New Socorro File System Layout

Our applications use these two hierarchies in a certain order every time. On first reading the file system, we’re looking only for dumps that we haven’t seen before. We walk the hierarchy on the date side and we know that every symbolic link we encounter represents a new crash dump. We pass the uuid on to the next application for processing and then we remove the symbolic links. That guarantees that any link we find on the date side is always new. From then on, we only will need to lookup that uuid by name. The remaining hierarchy of the name branch is optimized for that.

We also use this hierarchy for long term storage of crash dumps using a different file system root. Using the name branch, we can look up files by uuid very quickly even if the number of stored files is huge. When it’s time to retire old data that is no longer needed, we can walk the date branch to find and delete only items older than a threshold.

We expect this new system to speed up priority processing significantly. The uniform API embodied by our new class JsonDumpStorage will increase reliability and consistency across the applications that use the file system.

For more information, see: SocorroFileSystem
For progress information on deployment: Bug 458798

4 responses

  1. Fred Wenzel wrote on ::

    I said it before, but anyway: Very nicely engineered. I share your opinion: This should become pretty darn fast. It should also be possible to divide up this tree across mount points if necessary (by the way another advantage of using symlinks: With hardlinks you’d be in trouble there).

  2. jmdesp wrote on :

    Hum, I myself in that situation would reconsider if there really is no way to get rid of the double indexing :-) However smart you are, it makes things much more complex.

    I think if the initial part of the UID were time based, you could have only one indexing. Random UID generate an automatically balanced tree, but you have enough crashes, all around the world, that any 5 minute period is pretty well balanced too.

  3. K Lars Lohn wrote on :

    Of course we considered several mechanisms, but settled on this one because it is a compromise in complexity. This brief blog post doesn’t tell the whole story. We have two use cases for this structure with sightly different needs. This implementation covers them both efficiently. I really wanted to avoid having two different code bases for the two use cases.

    The complicating factor in having a single index is in the first use case outlined just after the diagram. We’re using the date key as, what in the relational database world is called, a partial index. Walking that index shows us only new entries that we had not previously seen. Had we implemented this using a single index, we would be forced to have some way to mark an entry as new. Scanning for new entries would then degrade into a search requiring us to walk the whole index considering each entry. We’ve already seen that this doesn’t scale. Granted, there are some tricks that we can do to make such a search more efficient, but that brings the code base further away from our needs in our second use case.

    After much consideration, I found this algorithm to be an acceptable compromise. I, of course, reserve the right to totally change my mind after we’ve seen this code perform in production…

  4. Online Data Storage wrote on ::

    Yeah that double-indexing could come back to haunt users :/
    -jack