Apache Arrow nanoarrow 0.4.0 Release

Published 29 Jan 2024
By The Apache Arrow PMC (pmc)

The Apache Arrow team is pleased to announce the 0.4.0 release of Apache Arrow nanoarrow. This release covers 46 resolved issues from 5 contributors.

Release Highlights

The primary focus of the nanoarrow 0.4.0 release was testing, stability, and code quality. Notably, an implementation of the C data interface integration test protocol was added to ensure data produced or consumed by nanoarrow can be consumed or produced by other Arrow implementations.

Apache Arrow nanoarrow 0.4.0 also contains experimental Python bindings to the C library for the purposes of community testing and feedback while a more stable set of bindings is prepared (targeted for 0.5.0).

See the Changelog for a detailed list of contributions to this release.

Breaking Changes

Changes included in the nanoarrow 0.4.0 release will not break most downstream code; however, several changes in the C library may result in additional compiler warnings that could cause downstream build failures for projects with strict compiler warning policies.

First, in debug mode (i.e., when NANOARROW_DEBUG is defined), an ignored return value for functions that return ArrowErrorCode now issues a compiler warning for compilers that support an "unused result" attribute. Ignoring the return value of these functions is a common error and return values that are not equal to NANOARROW_OK should be propagated as soon as possible. The C library provides tools to check return values in a readable way. Notably:

NANOARROW_RETURN_NOT_OK() can be used in a wrapper function that also returns ArrowErrorCode.
NANOARROW_THROW_NOT_OK() can be used from C++ code that inclues nanoarrow.hpp and is prepared to handle exceptions.
NANOARROW_ASSERT_OK() can be used to to check for NANOARROW_OK only in debug mode (i.e., silently ignore errors in release mode and will crash in debug mode with a message indicating the location of the error).

Of these, the first or second is preferred. The Getting Started with nanoarrow in C/C++ tutorial includes examples and advice for handling errors emanating from the nanoarrow C library.

Second, in debug mode (i.e., when NANOARROW_DEBUG is defined), the appropriate attribute was added to check the format string passed to ArrowErrorSet() against the provided arguments. Correct code will be unaffected by this change; however, actual arguments that do not match the format string (e.g., an int64_t that is passed to ArrowErrorSet() with a format string of "%d") should be cast to the appropriate C type (e.g., int) or the format string should be fixed to support the type of the actual argument (e.g., using "%" PRId64).

Third, functions in the C library that do not take ownersip of or modify input are now properly marked as const. For example, ArrowArrayViewGetIntUnsafe() previously accepted a struct ArrowArrayView* and now accepts a const struct ArrowArrayView*. This change makes it more difficult to accidentally modify input intended to be read-only and improves usability from C++. Downstream projects that get a new warning about discarding a const qualifier may need to adjust variable declarations or formal parameter types; however, most projects should be unaffected by this change.

C/C++

The nanoarrow 0.4.0 release includes a number of bugfixes and improvements to the core C library and C++ helpers.

An implementation of the C data interface integration test was added, including a reader/writer for the Arrow integration testing JSON format. This was used to improve test coverage of the IPC reader and to add nanoarrow as a participating member of integration testing in the CI job that runs in the main Arrow repository.
The C library now supports a wider range of extended compiler warnings to make it easier to vendor in projects with strict compiler warning policies.
C++ helpers were improved to support const-correctness. As a result, the UniqueSchema, UniqueArray, UniqueArrayView, and UniqueBuffer now work with a wider variety of C++ wrappers (e.g., std::unordered_map).

R bindings

The nanoarrow R bindings are distributed as the nanoarrow package on CRAN. The 0.4.0 release of the R bindings includes improvements in type support and stability. Notably:

Documentation was improved for low-level users of nanoarrow that are producing or consuming ArrowArray, ArrowSchema, and/or ArrowArrayStream structures from C or C++ code in other R packages.
Improved conversion of list()s to support more types when the arrow R package is not available.
Added more implmentations of as_nanoarrow_array_stream() to support more object types from the arrow R package.
Added conversion from Arrow integer arrays to character().

Python bindings

The nanoarrow 0.4.0 release is the first release that contains Python bindings to the nanoarrow C library! These initial Python bindings are experiemntal and are provided to solicit an initial round of feedback from the Arrow community. Like the nanoarrow C library and R bindings, it provides tools to facilitate the use of the Arrow C Data and Arrow C Stream interfaces.

You can install the initial release of the Python bindings from PyPI. The nanoarrow Python package has been submitted to conda-forge and should be available once the recipe has been reviewed.

pip install nanoarrow

The initial release of the Python bindings contain repr()s to print out human-readable representations of structures in the Arrow C Data and Stream interfaces.

import nanoarrow as na
import pyarrow as pa
na.c_schema(pa.decimal128(10, 3))

<nanoarrow.c_lib.CSchema decimal128(10, 3)>
- format: 'd:10,3'
- name: ''
- flags: 2
- metadata: NULL
- dictionary: NULL
- children[0]:

na.c_array(pa.array(["one", "two", "three", None]))

<nanoarrow.c_lib.CArray string>
- length: 4
- offset: 0
- null_count: 1
- buffers: (2939032895680, 2939032895616, 2939032895744)
- dictionary: NULL
- children[0]:

In addition to Arrow C Data interface wrappers, the initial nanoarrow Python bindings expose wrappers for a few nanoarrow C library types like the ArrowArrayView that can be used to interpret the content of the raw structures.

na.c_array_view(pa.array(["one", "two", "three", None]))

<nanoarrow.c_lib.CArrayView>
- storage_type: 'string'
- length: 4
- offset: 0
- null_count: 1
- buffers[3]:
  - <bool validity[1 b] 11100000>
  - <int32 data_offset[20 b] 0 3 6 11 11>
  - <string data[11 b] b'onetwothree'>
- dictionary: NULL
- children[0]:

Finally, the initial bindings contain a user-facing "data type" class. The Schema, like its C Data interface counterpart, can represent a pyarrow.Schema, a pyarrow.Field, or a pyarrow.DataType.

na.int32()

Schema(INT32)

na.struct({"col1": na.int32()})

Schema(STRUCT, fields=[Schema(INT32, name='col1')])

The next release of nanoarrow for Python will include a user-facing Array class among other improvements and features based on community feedback! For a more in-depth review of the initial Python bindings, see the Getting started in Python guide and the Python API reference

Contributors

This release consists of contributions from 4 contributors in addition to the invaluable advice and support of the Apache Arrow developer mailing list.

$ git shortlog -sn 798a1b8f096c84e2b6f887427649f1cb496412b2..apache-arrow-nanoarrow-0.4.0 | grep -v "GitHub Actions"
Dewey Dunnington
William Ayd
Dirk Eddelbuettel
Joris Van den Bossche
eitsupi