06
Dec 13

reading binary structures with python

Last week, I wanted to parse some Mach-O files with Python.  “Oh sure,” you think, “just use the struct module and this will be a breeze.”  I have, however, tried to do that:

class MyBinaryBlob:
    def __init__(self, buf, offset):
        self.f1, self.f2 = struct.unpack_from("BB", buf, offset)

and such an approach involves a great deal of copy-and-pasted code.  And if you have some variable-length fields mixed in with fixed-length fields, using struct breaks down very quickly. And if you have to write out the fields to a file, things get even more messy. For this experiment, I wanted to do things in a more declarative style.

The desire was that I could say something like:

class MyBinaryBlob:
    field_names = ["f1", "f2"]
    field_kinds = ["uint8_t", "uint8_t"]

and all the necessary code to parse the appropriate fields out of a binary buffer would spring into existence. (Automagically having the code to write these objects to a buffer would be great, too.) And if a binary object contained something that would be naturally interpreted as a Python list, then I could write a minimal amount of code to do that during initialization of the object as well. I also wanted inheritance to work correctly, so that if I wrote:

class ExtendedBlob(MyBinaryBlob):
    field_names = ["f3", "f4"]
    field_kinds = ["int32_t", "int32_t"]

ExtendedBlob should wind up with four fields once it is initialized.

At first, I wrote things like:

def field_reader(fmt):
    size = struct.calcsize(fmt)
    def reader_sub(buf, offset):
        return struct.unpack_from(fmt, buf, offset)[0], size
    return reader_sub

fi = field_reader("i")
fI = field_reader("I")
fB = field_reader("B")

def initialize_slots(obj, buf, offset, slot_names, field_specs):
    total = 0
    for slot, reader in zip(slot_names, field_specs):
        x, size = reader(buf, offset + total)
        setattr(obj, slot, x)
        total += size

class MyBinaryBlob:
    field_names = ["f1", "f2"]
    field_specs = [fB, fB]

    def __init__(self, buf, offset):
        initialize_slots(self, buf, offset, self.field_names, self.field_specs)

Fields return their size to make it straightforward to add variable-sized fields, not just fixed-width fields that can be parsed by struct.unpack_from. This worked out OK, but I was writing out a lot of copy-and-paste constructors, which was undesirable. Inheritance was also a little weird, since the natural implementation looked like:

class ExtendedBlob(MyBinaryBlob):
    field_names = ["f3", "f4"]
    field_specs = [fi, fi]

    def __init__(self, buf, offset):
        super(ExtendedBlob, self).__init__(buf, offset)
        initialize_slots(self, buf, offset, self.field_names, self.field_specs)

but that second initialize_slots call needs to start reading at the offset resulting from reading MyBinaryBlob‘s fields. I fixed this by storing a _total_size member in the objects and modifying initialize_slots:

def initialize_slots(obj, buf, offset, slot_names, field_specs):
    total = obj._total_size
    for slot, reader in zip(slot_names, field_specs):
        x, size = reader(buf, offset + total)
        setattr(obj, slot, x)
        total += size
    obj._total_size = total

which worked out well enough.

I realized that if I wanted to use this framework for writing binary blobs, I’d need to construct “bare” objects without an existing buffer to read them from. To do this, there had to be some static method on the class for parsing things out of a buffer. @staticmethod couldn’t be used in this case, because the code inside the method didn’t know what class it was being invoked on. But @classmethod, which received the invoking class as its first argument, seemed to fit the bill.

After some more experimentation, I wound up with a base class, BinaryObject:

class BinaryObject(object):
    field_names = []
    field_specs = []

    def __init__(self):
        self._total_size = 0

    def initialize_slots(self, buf, offset, slot_names, field_specs):
        total = self._total_size
        for slot, reader in zip(slot_names, field_specs):
            x, size = reader(buf, offset + total)
            setattr(self, slot, x)
            total += size
        self._total_size = total

    @classmethod
    def from_buf(cls, buf, offset):
        # Determine our inheritance path back to BinaryObject
        inheritance_chain = []
        pos = cls
        while pos != BinaryObject:
            inheritance_chain.append(pos)
            bases = pos.__bases__
            assert len(bases) == 1
            pos = bases[0]
        inheritance_chain.reverse()

        # Determine all the field names and specs that we need to read.
        all_field_names = itertools.chain(*[c.field_names
                                            for c in inheritance_chain])
        all_field_specs = itertools.chain(*[c.field_specs
                                            for c in inheritance_chain])

        # Create the actual object and populate its fields.
        obj = cls()
        obj.initialize_slots(buf, offset, all_field_names, all_field_specs)
        return obj

Inspecting the inheritance hierarchy at runtime makes for some very compact code. (The single-inheritance assertion could probably be relaxed to an assertion that all superclasses except the first do not have field_names or field_specs class members; such a relaxation would make behavior-modifying mixins work well with this scheme.) Now my classes all looked like:

class MyBinaryBlob(BinaryObject):
    field_names = ["f1", "f2"]
    field_specs = [fB, fB]

class ExtendedBlob(MyBinaryBlob):
    field_names = ["f3", "f4"]
    field_specs = [fi, fi]

blob1 = MyBinaryBlob.from_buf(buf, offset)
blob2 = ExtendedBlob.from_buf(buf, offset)

with a pleasing lack of code duplication.  Any code for writing can be written once in the BinaryObject class using a similar inspection of the inheritance chain.

But how does parsing additional things during construction work? Well, subclasses can define their own from_buf methods:

class ExtendedBlobWithList(BinaryObject):
    field_names = ["n_objs"]
    field_specs = [fI]

    @classmethod
    def from_buf(cls, buf, offset):
        obj = BinaryObject.from_buf.__func__(cls, buf, offset)
        # do extra initialization here
        for i in range(obj.n_objs):
            ...
        return obj

The trick here is that calling obj = BinaryObject.from_buf(buf, offset) wouldn’t do the right thing: that would only parse any members that BinaryObject had, and return an object of type BinaryObject instead of one of type ExtendedBlobWithList. Instead, we call BinaryObject.from_buf.__func__, which is the original, undecorated function, and pass the cls with which we were invoked, which is ExtendedBlobWithList, to do basic parsing of the fields. After that’s done, we can do our own specialized parsing, probably with SomeOtherBlob.from_buf or similar. (The _total_size member also comes in handy here, since you know exactly where to start parsing additional members.) You can even define from_buf methods that parse a bit, determine what class they should really be constructing, and construct an object of that type instead:

R_SCATTERED = 0x80000000

class Relocation(BinaryObject):
    field_names = ["_bits1", "_bits2"]
    field_specs = [fI, fI];
    __slots__ = field_names

    @classmethod
    def from_buf(cls, buf, offset):
        obj = BinaryObject.from_buf.__func__(Relocation, buf, offset)

        # OK, now for the decoding of what we just got back.
        if obj._bits1 & R_SCATTERED:
            return ScatteredRelocationInfo.from_buf(buf, offset)
        else:
            return RelocationInfo.from_buf(buf, offset)

This hides any detail about file formats in the parsing code, where it belongs.

Overall, I’m pretty happy with this scheme; it’s a lot more pleasant than bare struct.unpack_from calls scattered about.