Apache Arrow nanoarrow 0.2 Release


Published 22 Jun 2023
By The Apache Arrow PMC (pmc)

The Apache Arrow team is pleased to announce the 0.2.0 release of Apache Arrow nanoarrow. This initial release covers 19 resolved issues from 6 contributors.

Release Highlights

See the Changelog for a detailed list of contributions to this release.

IPC stream support

This release includes support for reading schemas and record batches serialized using the Arrow IPC format. Based on the flatcc flatbuffers implementation, the nanoarrow IPC read support is implemented as an optional extension to the core nanoarrow library. The easiest way to get started is with the ArrowArrayStream provider using one of the built-in ArrowIpcInputStream implementations:

#include <stdio.h>
#include <stdbool.h>

#include "nanoarrow_ipc.h"

int main(int argc, char* argv[]) {
  FILE* file_ptr = freopen(NULL, "rb", stdin);

  struct ArrowIpcInputStream input;
  NANOARROW_RETURN_NOT_OK(ArrowIpcInputStreamInitFile(&input, file_ptr, false));

  struct ArrowArrayStream stream;
  NANOARROW_RETURN_NOT_OK(ArrowIpcArrayStreamReaderInit(&stream, &input, NULL));

  struct ArrowSchema schema;
  NANOARROW_RETURN_NOT_OK(stream.get_schema(&stream, &schema));

  struct ArrowArray array;
  while (true) {
    NANOARROW_RETURN_NOT_OK(stream.get_next(&stream, &array));
    if (array.release == NULL) {
      break;
    }
  }

  return 0;
}

Facilities for advanced usage are also provided via the low-level ArrowIpcDecoder, which takes care of the details of deserializing the flatbuffer headers and assembling buffers into an ArrowArray. The current implementation can read schema and record batch messages that contain any Arrow type that supported by the C data interface. The initial version can read both big and little endian streams originating from platforms of either endian. Dictionary encoding and compression are not currently supported.

Getting started with nanoarrow

Early users of the nanoarrow C library respectfully noted that it was difficult to know where to begin. This release includes improvements in reference documentation but also includes a long-form tutorial for those just getting started with or considering adopting the library. You can find the tutorial at the nanoarrow documentation site.

C library

The nanoarrow 0.2.0 release also includes a number of bugfixes and improvements to the core C library, many of which were identified as a result of usage connected with development of the IPC extension.

  • Helpers for extracting/appending Decimal128 and Decimal256 elements from/to arrays were added.
  • The C library can now perform “full” validation to validate untrusted input (e.g., serialized IPC from the wire).
  • The C library can now perform “minimal” validation that performs all checks that do not access any buffer data. This feature was added to facilitate future support for the ArrowDeviceArray that was recently added as the Arrow C device interface.
  • Release verification on Ubuntu (x86_64, arm64), Fedora (x86_64), Archlinux (x86_64), Centos 7 (x86_64, arm64), Alpine (x86_64, arm64, s390x), Windows (x86_64), and MacOS (x86_64) were added to the continuous integration system. Linux verification is implemented using docker compose to facilitate local checks when developing features that may affect a specific platform.

R bindings

The nanoarrow R bindings are distributed as the nanoarrow package on CRAN. The 0.2.0 release of the R bindings includes improvements in type support, improvements in stability, and features required for the forthcoming release of Arrow Database Connectivity (ADBC) R bindings. Notably:

  • Support for conversion of union arrays to R objects was added to facilitate support for an ADBC function that returns such an array.
  • Support for adding an R-level finalizer to an ArrowArrayStream was added to facilitate safely wrapping a stream resulting from an ADBC call at the R level.

Python bindings?

The nanoarrow 0.2.0 release does not include Python bindings, but improvements to the unreleased draft bindings were added to facilitate discussion among Python developers regarding the useful scope of a potential future nanoarrow Python package. If one of those developers is you, feel free to open an issue or send a post to the developer mailing list to engage in the discussion.

Contributors

This initial release consists of contributions from 6 contributors in addition to the invaluable advice and support of the Apache Arrow developer mailing list. Special thanks to David Li for reviewing nearly every PR in this release!

$ git shortlog -sn d91c35b33c7b6ff94f5f929384879352e241ed71..apache-arrow-nanoarrow-0.2.0 | grep -v "GitHub Actions"
    61  Dewey Dunnington
     2  Dirk Eddelbuettel
     2  Joris Van den Bossche
     2  Kirill Müller
     2  William Ayd
     1  David Li