Menu

Writing a binary parser in Python: NumPy vs. Construct

kinow @ Apr 14, 2017 19:21:03

Some time ago I worked with researchers to write a parser for an old data format. The data was generated by device (radiosonde) using the vendor (Vaisala) specific binary format.

One of the researchers told me someone had written a parser for his work, and shared it on GitHub. To be honest, that was my first time parsing data in binary with Python. Did that before with C, C++, Perl, and Java, but never with Python.

The code on GitHub used NumPy and looked similar to this one.

import numpy as np

parse_header = np.dtype( [ (('field_a', 'b1'), ('field_b', '17b1') ] )

with open('input.dat', 'rb') as f:
    header = np.fromfile(f, dtype=parse_header, count=1)
    # ...

And it indeed worked fine. But in the end I used the code - after contacting the author and letting him know what I was about to do - as reference together with an old specification document for the format, and created a parser with Construct.

From Construct’s website:

Construct is a powerful declarative parser (and builder) for binary data.

This is what the code with construct looked like.

from construct import *

parse_header = Struct("parse_header",
    Enum(Byte("file_ready"),
        READY = 1,
        NOT_READY = 0,
        _default_ = "UNKNOWN",
    ),
    Bytes("reserved", 17)
)

# ...

parse_contents = Struct("parse_contents",
    parse_header,
    Range(mincount=1, maxcout=5, subcon=pre_data),
    OptionalGreedyRange(detailed_data)
)

with open('input.dat', 'rb') as f:
    parse_results = parse_contents.parse_stream(fid)
    # ...

Writing the parser with NumPy or Construct would achieve the same result. However, in the end this came down to personal preference, and my point of view as Software Engineer. This is the description of NumPy.

NumPy is the fundamental package for scientific computing with Python.

NumPy is a project tailored for scientific computing, with a focus on linear algebra, N-dimensional arrays, and so it goes. While it contains code that can parse binary data, the footprint added to a project that includes it as dependency is quite big.

The parser written with NumPy wasn’t using 5% of the NumPy code base. Probably less than 1%. Updates to NumPy could break the application compatibility, even if the update came due to some new matrix operation added to NumPy through some external and missing dependency.

In Java something similar happens with Google Guava. While I use it some times, most of the times I find myself using one of the Apache Commons libraries, or another dependency with just what I need. To avoid including unnecessary code to my application.

If you prefer to use NumPy that’s fine too :-) I just had the time enough to rewrite it instead of using the NumPy (took a couple of hours). In other cases it may still make sense to use another tool or library, even if it was not made specifically for the job ¯\(ツ)

♥ Open Source