Last week, I wanted to parse some Mach-O files with Python. “Oh sure,” you think, “just use the struct
module and this will be a breeze.” I have, however, tried to do that:
class MyBinaryBlob: def __init__(self, buf, offset): self.f1, self.f2 = struct.unpack_from("BB", buf, offset)
and such an approach involves a great deal of copy-and-pasted code. And if you have some variable-length fields mixed in with fixed-length fields, using struct
breaks down very quickly. And if you have to write out the fields to a file, things get even more messy. For this experiment, I wanted to do things in a more declarative style.
The desire was that I could say something like:
class MyBinaryBlob: field_names = ["f1", "f2"] field_kinds = ["uint8_t", "uint8_t"]
and all the necessary code to parse the appropriate fields out of a binary buffer would spring into existence. (Automagically having the code to write these objects to a buffer would be great, too.) And if a binary object contained something that would be naturally interpreted as a Python list, then I could write a minimal amount of code to do that during initialization of the object as well. I also wanted inheritance to work correctly, so that if I wrote:
class ExtendedBlob(MyBinaryBlob): field_names = ["f3", "f4"] field_kinds = ["int32_t", "int32_t"]
ExtendedBlob
should wind up with four fields once it is initialized.
At first, I wrote things like:
def field_reader(fmt): size = struct.calcsize(fmt) def reader_sub(buf, offset): return struct.unpack_from(fmt, buf, offset)[0], size return reader_sub fi = field_reader("i") fI = field_reader("I") fB = field_reader("B") def initialize_slots(obj, buf, offset, slot_names, field_specs): total = 0 for slot, reader in zip(slot_names, field_specs): x, size = reader(buf, offset + total) setattr(obj, slot, x) total += size class MyBinaryBlob: field_names = ["f1", "f2"] field_specs = [fB, fB] def __init__(self, buf, offset): initialize_slots(self, buf, offset, self.field_names, self.field_specs)
Fields return their size to make it straightforward to add variable-sized fields, not just fixed-width fields that can be parsed by struct.unpack_from
. This worked out OK, but I was writing out a lot of copy-and-paste constructors, which was undesirable. Inheritance was also a little weird, since the natural implementation looked like:
class ExtendedBlob(MyBinaryBlob): field_names = ["f3", "f4"] field_specs = [fi, fi] def __init__(self, buf, offset): super(ExtendedBlob, self).__init__(buf, offset) initialize_slots(self, buf, offset, self.field_names, self.field_specs)
but that second initialize_slots
call needs to start reading at the offset resulting from reading MyBinaryBlob
‘s fields. I fixed this by storing a _total_size
member in the objects and modifying initialize_slots
:
def initialize_slots(obj, buf, offset, slot_names, field_specs): total = obj._total_size for slot, reader in zip(slot_names, field_specs): x, size = reader(buf, offset + total) setattr(obj, slot, x) total += size obj._total_size = total
which worked out well enough.
I realized that if I wanted to use this framework for writing binary blobs, I’d need to construct “bare” objects without an existing buffer to read them from. To do this, there had to be some static method on the class for parsing things out of a buffer. @staticmethod
couldn’t be used in this case, because the code inside the method didn’t know what class it was being invoked on. But @classmethod
, which received the invoking class as its first argument, seemed to fit the bill.
After some more experimentation, I wound up with a base class, BinaryObject
:
class BinaryObject(object): field_names = [] field_specs = [] def __init__(self): self._total_size = 0 def initialize_slots(self, buf, offset, slot_names, field_specs): total = self._total_size for slot, reader in zip(slot_names, field_specs): x, size = reader(buf, offset + total) setattr(self, slot, x) total += size self._total_size = total @classmethod def from_buf(cls, buf, offset): # Determine our inheritance path back to BinaryObject inheritance_chain = [] pos = cls while pos != BinaryObject: inheritance_chain.append(pos) bases = pos.__bases__ assert len(bases) == 1 pos = bases[0] inheritance_chain.reverse() # Determine all the field names and specs that we need to read. all_field_names = itertools.chain(*[c.field_names for c in inheritance_chain]) all_field_specs = itertools.chain(*[c.field_specs for c in inheritance_chain]) # Create the actual object and populate its fields. obj = cls() obj.initialize_slots(buf, offset, all_field_names, all_field_specs) return obj
Inspecting the inheritance hierarchy at runtime makes for some very compact code. (The single-inheritance assertion could probably be relaxed to an assertion that all superclasses except the first do not have field_names
or field_specs
class members; such a relaxation would make behavior-modifying mixins work well with this scheme.) Now my classes all looked like:
class MyBinaryBlob(BinaryObject): field_names = ["f1", "f2"] field_specs = [fB, fB] class ExtendedBlob(MyBinaryBlob): field_names = ["f3", "f4"] field_specs = [fi, fi] blob1 = MyBinaryBlob.from_buf(buf, offset) blob2 = ExtendedBlob.from_buf(buf, offset)
with a pleasing lack of code duplication. Any code for writing can be written once in the BinaryObject
class using a similar inspection of the inheritance chain.
But how does parsing additional things during construction work? Well, subclasses can define their own from_buf
methods:
class ExtendedBlobWithList(BinaryObject): field_names = ["n_objs"] field_specs = [fI] @classmethod def from_buf(cls, buf, offset): obj = BinaryObject.from_buf.__func__(cls, buf, offset) # do extra initialization here for i in range(obj.n_objs): ... return obj
The trick here is that calling obj = BinaryObject.from_buf(buf, offset)
wouldn’t do the right thing: that would only parse any members that BinaryObject
had, and return an object of type BinaryObject
instead of one of type ExtendedBlobWithList
. Instead, we call BinaryObject.from_buf.__func__
, which is the original, undecorated function, and pass the cls
with which we were invoked, which is ExtendedBlobWithList
, to do basic parsing of the fields. After that’s done, we can do our own specialized parsing, probably with SomeOtherBlob.from_buf
or similar. (The _total_size
member also comes in handy here, since you know exactly where to start parsing additional members.) You can even define from_buf
methods that parse a bit, determine what class they should really be constructing, and construct an object of that type instead:
R_SCATTERED = 0x80000000 class Relocation(BinaryObject): field_names = ["_bits1", "_bits2"] field_specs = [fI, fI]; __slots__ = field_names @classmethod def from_buf(cls, buf, offset): obj = BinaryObject.from_buf.__func__(Relocation, buf, offset) # OK, now for the decoding of what we just got back. if obj._bits1 & R_SCATTERED: return ScatteredRelocationInfo.from_buf(buf, offset) else: return RelocationInfo.from_buf(buf, offset)
This hides any detail about file formats in the parsing code, where it belongs.
Overall, I’m pretty happy with this scheme; it’s a lot more pleasant than bare struct.unpack_from
calls scattered about.
Have you looked at the ctypes module? It includes Struct and Union types that take a list of fields (one list of pairs rather than two lists). It supports composition (Struct and Union types can be used as fields in other Struct and Union types).
I hadn’t looked at ctypes, no. Glancing over the page, it is not immediately obvious to me how you’d read a ctypes.Structure or similar out of a byte buffer. And it looks like ctypes only deals with native endian datatypes, whereas struct.unpack and friends are willing to read datatypes in whatever endianness you like.
There are types for big-endian and little-endian structures and unions, in addition to the default native-endian.
As for how to parse them from a buffer, just call .from_buffer, .from_buffer_copy, or .from_address, depending on what you need.
Example:
>>> ctypes.c_uint.from_buffer_copy(“\x00\x00\x10\x00”)
c_uint(1048576L)
>>> class Point(ctypes.Structure):
… _fields_ = ((“x”, ctypes.c_int), (“y”, ctypes.c_int))
…
>>> p = Point.from_buffer_copy(“\x01\x00\x00\x00\x00\x01\x00\x00”)
>>> p.x
1
>>> p.y
256