Apache Arrow nanoarrow 0.4.0 Release
Published
29 Jan 2024
By
The Apache Arrow PMC (pmc)
The Apache Arrow team is pleased to announce the 0.4.0 release of Apache Arrow nanoarrow. This release covers 46 resolved issues from 5 contributors.
Release Highlights
The primary focus of the nanoarrow 0.4.0 release was testing, stability, and code quality. Notably, an implementation of the C data interface integration test protocol was added to ensure data produced or consumed by nanoarrow can be consumed or produced by other Arrow implementations.
Apache Arrow nanoarrow 0.4.0 also contains experimental Python bindings to the C library for the purposes of community testing and feedback while a more stable set of bindings is prepared (targeted for 0.5.0).
See the Changelog for a detailed list of contributions to this release.
Breaking Changes
Changes included in the nanoarrow 0.4.0 release will not break most downstream code; however, several changes in the C library may result in additional compiler warnings that could cause downstream build failures for projects with strict compiler warning policies.
First, in debug mode (i.e., when NANOARROW_DEBUG
is defined), an ignored return
value for functions that return ArrowErrorCode
now issues a compiler warning
for compilers that support an “unused result” attribute. Ignoring the return value
of these functions is a common error and return values that are not
equal to NANOARROW_OK
should be propagated as soon as possible. The C library
provides tools to check return values in a readable way. Notably:
NANOARROW_RETURN_NOT_OK()
can be used in a wrapper function that also returnsArrowErrorCode
.NANOARROW_THROW_NOT_OK()
can be used from C++ code that incluesnanoarrow.hpp
and is prepared to handle exceptions.NANOARROW_ASSERT_OK()
can be used to to check forNANOARROW_OK
only in debug mode (i.e., silently ignore errors in release mode and will crash in debug mode with a message indicating the location of the error).
Of these, the first or second is preferred. The Getting Started with nanoarrow in C/C++ tutorial includes examples and advice for handling errors emanating from the nanoarrow C library.
Second, in debug mode (i.e., when NANOARROW_DEBUG
is defined), the appropriate
attribute was added to check the format string passed to ArrowErrorSet()
against the provided arguments. Correct code will be unaffected by this change;
however, actual arguments that do not match the format string (e.g., an int64_t
that is passed to ArrowErrorSet()
with a format string of "%d"
) should be
cast to the appropriate C type (e.g., int
) or the format string should be fixed
to support the type of the actual argument (e.g., using "%" PRId64
).
Third, functions in the C library that do not take ownersip of or modify input
are now properly marked as const
. For example, ArrowArrayViewGetIntUnsafe()
previously accepted a struct ArrowArrayView*
and now accepts a
const struct ArrowArrayView*
. This change makes it more difficult to
accidentally modify input intended to be read-only and improves usability
from C++. Downstream projects that get a new warning about discarding a const
qualifier may need to adjust variable declarations or formal parameter types;
however, most projects should be unaffected by this change.
C/C++
The nanoarrow 0.4.0 release includes a number of bugfixes and improvements to the core C library and C++ helpers.
- An implementation of the C data interface integration test was added, including a reader/writer for the Arrow integration testing JSON format. This was used to improve test coverage of the IPC reader and to add nanoarrow as a participating member of integration testing in the CI job that runs in the main Arrow repository.
- The C library now supports a wider range of extended compiler warnings to make it easier to vendor in projects with strict compiler warning policies.
- C++ helpers were improved to support const-correctness. As a result,
the
UniqueSchema
,UniqueArray
,UniqueArrayView
, andUniqueBuffer
now work with a wider variety of C++ wrappers (e.g.,std::unordered_map
).
R bindings
The nanoarrow R bindings are distributed as the nanoarrow
package on
CRAN. The 0.4.0 release of the R bindings includes
improvements in type support and stability. Notably:
- Documentation was improved for low-level users of nanoarrow that are producing or
consuming
ArrowArray
,ArrowSchema
, and/orArrowArrayStream
structures from C or C++ code in other R packages. - Improved conversion of
list()
s to support more types when the arrow R package is not available. - Added more implmentations of
as_nanoarrow_array_stream()
to support more object types from the arrow R package. - Added conversion from Arrow integer arrays to
character()
.
Python bindings
The nanoarrow 0.4.0 release is the first release that contains Python bindings to the nanoarrow C library! These initial Python bindings are experiemntal and are provided to solicit an initial round of feedback from the Arrow community. Like the nanoarrow C library and R bindings, it provides tools to facilitate the use of the Arrow C Data and Arrow C Stream interfaces.
You can install the initial release of the Python bindings from PyPI. The nanoarrow
Python package has been submitted to conda-forge and should be available once the
recipe has been reviewed.
pip install nanoarrow
The initial release of the Python bindings contain repr()
s to print out human-readable
representations of structures in the Arrow C Data and Stream interfaces.
import nanoarrow as na
import pyarrow as pa
na.c_schema(pa.decimal128(10, 3))
<nanoarrow.c_lib.CSchema decimal128(10, 3)>
- format: 'd:10,3'
- name: ''
- flags: 2
- metadata: NULL
- dictionary: NULL
- children[0]:
na.c_array(pa.array(["one", "two", "three", None]))
<nanoarrow.c_lib.CArray string>
- length: 4
- offset: 0
- null_count: 1
- buffers: (2939032895680, 2939032895616, 2939032895744)
- dictionary: NULL
- children[0]:
In addition to Arrow C Data interface wrappers, the initial nanoarrow Python bindings expose
wrappers for a few nanoarrow C library types like the ArrowArrayView
that can be used to
interpret the content of the raw structures.
na.c_array_view(pa.array(["one", "two", "three", None]))
<nanoarrow.c_lib.CArrayView>
- storage_type: 'string'
- length: 4
- offset: 0
- null_count: 1
- buffers[3]:
- <bool validity[1 b] 11100000>
- <int32 data_offset[20 b] 0 3 6 11 11>
- <string data[11 b] b'onetwothree'>
- dictionary: NULL
- children[0]:
Finally, the initial bindings contain a user-facing “data type” class. The Schema
, like its
C Data interface counterpart, can represent a pyarrow.Schema
, a pyarrow.Field
, or a
pyarrow.DataType
.
na.int32()
Schema(INT32)
na.struct({"col1": na.int32()})
Schema(STRUCT, fields=[Schema(INT32, name='col1')])
The next release of nanoarrow for Python will include a user-facing Array
class among other
improvements and features based on community feedback! For a more in-depth review of the
initial Python bindings, see the
Getting started in Python guide and the
Python API reference
Contributors
This release consists of contributions from 4 contributors in addition to the invaluable advice and support of the Apache Arrow developer mailing list.
$ git shortlog -sn 798a1b8f096c84e2b6f887427649f1cb496412b2..apache-arrow-nanoarrow-0.4.0 | grep -v "GitHub Actions"
35 Dewey Dunnington
3 William Ayd
2 Dirk Eddelbuettel
2 Joris Van den Bossche
1 eitsupi