Fuzzing the Arrow C++ IPC implementation
Published
31 Mar 2020
By
Antoine Pitrou (apitrou)
Apache Arrow aims to allow fast and seamless data interchange between heterogenous runtimes and environments. Whether using the columnar IPC stream protocol, the Flight RPC layer, the Feather file format, the Plasma shared object store, or any application-specific data distribution mechanism, Arrow IPC implementations may try to decode data from untrusted input. While it is ok to report an error in that case, Arrow shouldn’t crash or engage in risky behaviour while reading such data.
To validate the robustness of the Arrow C++ IPC reader (which also underlies the Python, C/GLib, R and Ruby bindings), we successfully submitted the Arrow project to OSS-Fuzz, a continuous fuzzing initiative for critical open source projects, provided by Google.
What is being fuzzed
As of this writing, the RecordBatchStreamReader
and RecordBatchFileReader
C++ classes are being fuzzed by feeding them data generated by the fuzzer.
When a record batch is successfully read by one of those classes, the
fuzzing setup then validates it using RecordBatch::ValidateFull
. This
method can either succeed or fail, but it shouldn’t crash.
By ensuring that reading a record batch from IPC, then validating it, always shows deterministic behaviour, we hope to make it relatively safe to ingest Arrow IPC data coming from untrusted sources.
(of course, it is still recommended for security-critical applications to use cryptographic means of authentication and integrity control – for example, to enable TLS with the Flight RPC protocol)
How we help the fuzzer find problems
Fuzzing is a brute force process that tries to devise invalid data to exercise an implementation’s response. By default, the fuzzer does not know anything about the data representation expected by the program under test. Fuzzing can therefore be extremely inefficient, testing tons of uninteresting variations while missing critical ones.
To help guide the fuzzing process, we added a seed corpus of valid Arrow IPC files with various data types. By starting from this data and mutating it to find invalid variations, OSS-Fuzz was able to find tens of issues with data validation. All of them have been fixed. As of this writing, no new issue in the IPC layer was found since March 4th 2020.
What comes next
Of course, we still monitor OSS-Fuzz for any new problem that could be found in the C++ IPC implementation. Such problems might for example appear when adding features to the Arrow IPC format.
We have started fuzzing the Parquet C++ implementation. Several issues have been found and fixed, but more are still coming. We hope to stabilize the situation in the next month or two.
The tensor and sparse tensor IPC read paths are not being exercised yet. They will be once a motivated core developer wants to own the topic.