{"id":261,"date":"2013-12-06T10:25:06","date_gmt":"2013-12-06T15:25:06","guid":{"rendered":"http:\/\/blog.mozilla.org\/nfroyd\/?p=261"},"modified":"2013-12-06T10:25:06","modified_gmt":"2013-12-06T15:25:06","slug":"reading-binary-structures-with-python","status":"publish","type":"post","link":"https:\/\/blog.mozilla.org\/nfroyd\/2013\/12\/06\/reading-binary-structures-with-python\/","title":{"rendered":"reading binary structures with python"},"content":{"rendered":"<p>Last week, I wanted to parse some <a href=\"https:\/\/developer.apple.com\/library\/mac\/documentation\/developertools\/conceptual\/MachORuntime\/Reference\/reference.html\">Mach-O files<\/a> with Python.\u00a0 &#8220;Oh sure,&#8221; you think, &#8220;just use <a href=\"http:\/\/docs.python.org\/3\/library\/struct.html\">the <code>struct<\/code> module<\/a> and this will be a breeze.&#8221;\u00a0 I have, however, tried to do that:<\/p>\n<pre>class MyBinaryBlob:\r\n    def __init__(self, buf, offset):\r\n        self.f1, self.f2 = struct.unpack_from(\"BB\", buf, offset)<\/pre>\n<p>and such an approach involves a great deal of copy-and-pasted code.\u00a0 And if you have some variable-length fields mixed in with fixed-length fields, using <code>struct<\/code> breaks down very quickly. And if you have to write out the fields to a file, things get even more messy. For this experiment, I wanted to do things in a more declarative style.<\/p>\n<p>The desire was that I could say something like:<\/p>\n<pre>class MyBinaryBlob:\r\n    field_names = [\"f1\", \"f2\"]\r\n    field_kinds = [\"uint8_t\", \"uint8_t\"]<\/pre>\n<p>and all the necessary code to parse the appropriate fields out of a binary buffer would spring into existence. (Automagically having the code to write these objects to a buffer would be great, too.) And if a binary object contained something that would be naturally interpreted as a Python list, then I could write a minimal amount of code to do that during initialization of the object as well. I also wanted inheritance to work correctly, so that if I wrote:<\/p>\n<pre>class ExtendedBlob(MyBinaryBlob):\r\n    field_names = [\"f3\", \"f4\"]\r\n    field_kinds = [\"int32_t\", \"int32_t\"]<\/pre>\n<p><code>ExtendedBlob<\/code> should wind up with four fields once it is initialized.<\/p>\n<p>At first, I wrote things like:<\/p>\n<pre>def field_reader(fmt):\r\n    size = struct.calcsize(fmt)\r\n    def reader_sub(buf, offset):\r\n        return struct.unpack_from(fmt, buf, offset)[0], size\r\n    return reader_sub\r\n\r\nfi = field_reader(\"i\")\r\nfI = field_reader(\"I\")\r\nfB = field_reader(\"B\")\r\n\r\ndef initialize_slots(obj, buf, offset, slot_names, field_specs):\r\n    total = 0\r\n    for slot, reader in zip(slot_names, field_specs):\r\n        x, size = reader(buf, offset + total)\r\n        setattr(obj, slot, x)\r\n        total += size\r\n\r\nclass MyBinaryBlob:\r\n    field_names = [\"f1\", \"f2\"]\r\n    field_specs = [fB, fB]\r\n\r\n    def __init__(self, buf, offset):\r\n        initialize_slots(self, buf, offset, self.field_names, self.field_specs)<\/pre>\n<p>Fields return their size to make it straightforward to add variable-sized fields, not just fixed-width fields that can be parsed by <code>struct.unpack_from<\/code>. This worked out OK, but I was writing out a lot of copy-and-paste constructors, which was undesirable. Inheritance was also a little weird, since the natural implementation looked like:<\/p>\n<pre>class ExtendedBlob(MyBinaryBlob):\r\n    field_names = [\"f3\", \"f4\"]\r\n    field_specs = [fi, fi]\r\n\r\n    def __init__(self, buf, offset):\r\n        super(ExtendedBlob, self).__init__(buf, offset)\r\n        initialize_slots(self, buf, offset, self.field_names, self.field_specs)<\/pre>\n<p>but that second <code>initialize_slots<\/code> call needs to start reading at the offset resulting from reading <code>MyBinaryBlob<\/code>&#8216;s fields. I fixed this by storing a <code>_total_size<\/code> member in the objects and modifying <code>initialize_slots<\/code>:<\/p>\n<pre>def initialize_slots(obj, buf, offset, slot_names, field_specs):\r\n    total = obj._total_size\r\n    for slot, reader in zip(slot_names, field_specs):\r\n        x, size = reader(buf, offset + total)\r\n        setattr(obj, slot, x)\r\n        total += size\r\n    obj._total_size = total<\/pre>\n<p>which worked out well enough.<\/p>\n<p>I realized that if I wanted to use this framework for writing binary blobs, I&#8217;d need to construct &#8220;bare&#8221; objects without an existing buffer to read them from. To do this, there had to be some static method on the class for parsing things out of a buffer. <code><a href=\"http:\/\/docs.python.org\/3\/library\/functions.html#staticmethod\">@staticmethod<\/a><\/code> couldn&#8217;t be used in this case, because the code inside the method didn&#8217;t know what class it was being invoked on. But <code><a href=\"http:\/\/docs.python.org\/3\/library\/functions.html#classmethod\">@classmethod<\/a><\/code>, which received the invoking class as its first argument, seemed to fit the bill.<\/p>\n<p>After some more experimentation, I wound up with a base class, <code>BinaryObject<\/code>:<\/p>\n<pre>class BinaryObject(object):\r\n    field_names = []\r\n    field_specs = []\r\n\r\n    def __init__(self):\r\n        self._total_size = 0\r\n\r\n    def initialize_slots(self, buf, offset, slot_names, field_specs):\r\n        total = self._total_size\r\n        for slot, reader in zip(slot_names, field_specs):\r\n            x, size = reader(buf, offset + total)\r\n            setattr(self, slot, x)\r\n            total += size\r\n        self._total_size = total\r\n\r\n    @classmethod\r\n    def from_buf(cls, buf, offset):\r\n        # Determine our inheritance path back to BinaryObject\r\n        inheritance_chain = []\r\n        pos = cls\r\n        while pos != BinaryObject:\r\n            inheritance_chain.append(pos)\r\n            bases = pos.__bases__\r\n            assert len(bases) == 1\r\n            pos = bases[0]\r\n        inheritance_chain.reverse()\r\n\r\n        # Determine all the field names and specs that we need to read.\r\n        all_field_names = itertools.chain(*[c.field_names\r\n                                            for c in inheritance_chain])\r\n        all_field_specs = itertools.chain(*[c.field_specs\r\n                                            for c in inheritance_chain])\r\n\r\n        # Create the actual object and populate its fields.\r\n        obj = cls()\r\n        obj.initialize_slots(buf, offset, all_field_names, all_field_specs)\r\n        return obj<\/pre>\n<p>Inspecting the inheritance hierarchy at runtime makes for some very compact code. (The single-inheritance assertion could probably be relaxed to an assertion that all superclasses except the first do not have <code>field_names<\/code> or <code>field_specs<\/code> class members; such a relaxation would make behavior-modifying mixins work well with this scheme.) Now my classes all looked like:<\/p>\n<pre>class MyBinaryBlob(BinaryObject):\r\n    field_names = [\"f1\", \"f2\"]\r\n    field_specs = [fB, fB]\r\n\r\nclass ExtendedBlob(MyBinaryBlob):\r\n    field_names = [\"f3\", \"f4\"]\r\n    field_specs = [fi, fi]\r\n\r\nblob1 = MyBinaryBlob.from_buf(buf, offset)\r\nblob2 = ExtendedBlob.from_buf(buf, offset)<\/pre>\n<p>with a pleasing lack of code duplication.\u00a0 Any code for writing can be written once in the <code>BinaryObject<\/code> class using a similar inspection of the inheritance chain.<\/p>\n<p>But how does parsing additional things during construction work? Well, subclasses can define their own <code>from_buf<\/code> methods:<\/p>\n<pre>class ExtendedBlobWithList(BinaryObject):\r\n    field_names = [\"n_objs\"]\r\n    field_specs = [fI]\r\n\r\n    @classmethod\r\n    def from_buf(cls, buf, offset):\r\n        obj = BinaryObject.from_buf.__func__(cls, buf, offset)\r\n        # do extra initialization here\r\n        for i in range(obj.n_objs):\r\n            ...\r\n        return obj<\/pre>\n<p>The trick here is that calling <code>obj = BinaryObject.from_buf(buf, offset)<\/code> wouldn&#8217;t do the right thing: that would only parse any members that <code>BinaryObject<\/code> had, and return an object of type <code>BinaryObject<\/code> instead of one of type <code>ExtendedBlobWithList<\/code>. Instead, we call <code>BinaryObject.from_buf.__func__<\/code>, which is the original, undecorated function, and pass the <code>cls<\/code> with which we were invoked, which is <code>ExtendedBlobWithList<\/code>, to do basic parsing of the fields. After that&#8217;s done, we can do our own specialized parsing, probably with <code>SomeOtherBlob.from_buf<\/code> or similar. (The <code>_total_size<\/code> member also comes in handy here, since you know exactly where to start parsing additional members.) You can even define <code>from_buf<\/code> methods that parse a bit, determine what class they should really be constructing, and construct an object of that type instead:<\/p>\n<pre>R_SCATTERED = 0x80000000\r\n\r\nclass Relocation(BinaryObject):\r\n    field_names = [\"_bits1\", \"_bits2\"]\r\n    field_specs = [fI, fI];\r\n    __slots__ = field_names\r\n\r\n    @classmethod\r\n    def from_buf(cls, buf, offset):\r\n        obj = BinaryObject.from_buf.__func__(Relocation, buf, offset)\r\n\r\n        # OK, now for the decoding of what we just got back.\r\n        if obj._bits1 &amp; R_SCATTERED:\r\n            return ScatteredRelocationInfo.from_buf(buf, offset)\r\n        else:\r\n            return RelocationInfo.from_buf(buf, offset)<\/pre>\n<p>This hides any detail about file formats in the parsing code, where it belongs.<\/p>\n<p>Overall, I&#8217;m pretty happy with this scheme; it&#8217;s a lot more pleasant than bare <code>struct.unpack_from<\/code> calls scattered about.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Last week, I wanted to parse some Mach-O files with Python.\u00a0 &#8220;Oh sure,&#8221; you think, &#8220;just use the struct module and this will be a breeze.&#8221;\u00a0 I have, however, tried to do that: class MyBinaryBlob: def __init__(self, buf, offset): self.f1, self.f2 = struct.unpack_from(&#8220;BB&#8221;, buf, offset) and such an approach involves a great deal of copy-and-pasted [&hellip;]<\/p>\n","protected":false},"author":320,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/blog.mozilla.org\/nfroyd\/wp-json\/wp\/v2\/posts\/261"}],"collection":[{"href":"https:\/\/blog.mozilla.org\/nfroyd\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.mozilla.org\/nfroyd\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/nfroyd\/wp-json\/wp\/v2\/users\/320"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/nfroyd\/wp-json\/wp\/v2\/comments?post=261"}],"version-history":[{"count":0,"href":"https:\/\/blog.mozilla.org\/nfroyd\/wp-json\/wp\/v2\/posts\/261\/revisions"}],"wp:attachment":[{"href":"https:\/\/blog.mozilla.org\/nfroyd\/wp-json\/wp\/v2\/media?parent=261"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.mozilla.org\/nfroyd\/wp-json\/wp\/v2\/categories?post=261"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.mozilla.org\/nfroyd\/wp-json\/wp\/v2\/tags?post=261"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}