Apache Arrow nanoarrow 0.6.0 Release


Published 07 Oct 2024
By The Apache Arrow PMC (pmc)

The Apache Arrow team is pleased to announce the 0.6.0 release of Apache Arrow nanoarrow. This release covers 114 resolved issues from 10 contributors.

Release Highlights

  • Run End Encoding support
  • StringView support
  • IPC Write support
  • DLPack/device support
  • IPC/Device available from CMake/Meson as feature flags

See the Changelog for a detailed list of contributions to this release.

Breaking Changes

Most changes included in the nanoarrow 0.6.0 release will not break downstream code; however, two changes with respect to packaging and distribution may require users to update the code used to bring nanoarrow in as a dependency.

In nanoarrow 0.5.0 and earlier, the bundled single-file amalgamation was included in the dist/ subdirectory or could be generated using a specially-crafted CMake command. The nanoarrow 0.6.0 release removes the pre-compiled includes and migrates the code used to generate it to Python. This setup is less confusing for contributors (whose editors would frequently jump into the wrong nanoarrow.h) and is a less confusing use of CMake. Users can generate the dist/ subdirectory as it previously existed with:

python ci/scripts/bundle.py \
  --source-output-dir=dist \
  --include-output-dir=dist \
  --header-namespace= \
  --with-device \
  --with-ipc \
  --with-testing \
  --with-flatcc

Second, the Arrow IPC and ArrowDeviceArray implementations previously lived in the extensions/ subdirectory of the repository. This was helpful during the initial development of these features; however, the nanoarrow 0.6.0 release added the requisite feature coverage and testing such that the appropriate home for them is now the main src/ directory. As such, one can now build nanoarrow with IPC and/or device support using:

cmake -S . -B build -DNANOARROW_IPC=ON -DNANOARROW_DEVICE=ON

Features

Float16, StringView, and REE Support

The nanoarrow 0.6.0 release adds support for Arrow’s float16 (half float), string view, and run-end encoding support. The C library supports building float16 arrays using ArrowArrayAppendDouble() and supports building string view and binary view arrays using ArrowArrayAppendString() and/or ArrowArrayAppendBytes() and supports consuming using ArrowArrayViewGetStringUnsafe() and/or ArrowArrayViewGetBytesUnsafe(). R and Python users can request a string view or float16 type when building an array, and conversion back to R/Python strings is suppored.

# pip install nanoarrow
# conda install nanoarrow -c conda-forge
import nanoarrow as na

na.Array(["abc", "def", None], na.string_view())
#> nanoarrow.Array<string_view>[3]
#> 'abc'
#> 'def'
#> None
na.Array([1, 2, 3], na.float16())
#> nanoarrow.Array<half_float>[3]
#> 1.0
#> 2.0
#> 3.0
# install.packages("nanoarrow")
library(nanoarrow)

as_nanoarrow_array(c("abc", "def", NA), schema = na_string_view()) |>
  convert_array()
#> [1] "abc" "def" NA
as_nanoarrow_array(c(1, 2, 3), schema = na_half_float()) |>
  convert_array()
#> [1] 1 2 3

Creating/consuming run-end encoding arrays by element is not yet supported in C, R, or Python; however, arrays can be built or consumed by assembling the correct array/buffer structure in C.

Thank you to cocoa-xu for adding float16 and run-end encoding support and thank you to WillAyd for adding string view support!

IPC Write Support

The nanoarrow library has supported reading Arrow IPC streams since 0.4.0; however, could not write streams of its own. The nanoarrow 0.6.0 release adds support for stream writing from C using the ArrowIpcWriter and stream writing from R and Python:

import io
import nanoarrow as na
from nanoarrow import ipc

out = io.BytesIO()
with ipc.StreamWriter.from_writable(out) as writer:
    writer.write_stream(ipc.InputStream.example())

out.seek(0)
na.ArrayStream.from_readable(out).read_all()
#> nanoarrow.Array<non-nullable struct<some_col: int32>>[3]
#> {'some_col': 1}
#> {'some_col': 2}
#> {'some_col': 3}
library(nanoarrow)

tf <- tempfile()
nycflights13::flights |> write_nanoarrow(tf)

read_nanoarrow(tf) |> tibble::as_tibble()
#> # A tibble: 336,776 × 19
#>     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#>  1  2013     1     1      517            515         2      830            819
#>  2  2013     1     1      533            529         4      850            830
#>  3  2013     1     1      542            540         2      923            850
#>  4  2013     1     1      544            545        -1     1004           1022
#>  5  2013     1     1      554            600        -6      812            837
#>  6  2013     1     1      554            558        -4      740            728
#>  7  2013     1     1      555            600        -5      913            854
#>  8  2013     1     1      557            600        -3      709            723
#>  9  2013     1     1      557            600        -3      838            846
#> 10  2013     1     1      558            600        -2      753            745
#> # ℹ 336,766 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>

As a result of the IPC write support, nanoarrow now joins the Arrow IPC integration tests to ensure compatability across implementations. With the exception of arrow-rs due to a bug in the Rust flatbuffers implementation, nanoarrow is now tested against all participating Arrow implementations with every commit.

A huge thank you to bkietz for implementing this support and the tests (which included multiple bugfixes and identification of inconsistencies of flatbuffer verification in C, Rust, and C++!).

DLPack/CUDA Support

The nanoarrow 0.6.0 release includes improved support for the Arrow C Device data interface. In particular, the CUDA device implementation was improved to more efficiently coordinate synchronization when copying arrays to/from the GPU and migrated to use the driver API for wider compatibility. The nanoarrow Python bindings have limited support for creating ArrowDeviceArray wrappers that implement the __arrow_c_device_array__ protocol from anything that implements DLPack:

# Currently requires:
# export NANOARROW_PYTHON_CUDA=/usr/local/cuda
# pip install --force-reinstall --no-binary=":all:" nanoarrow
import nanoarrow as na
from nanoarrow import device
import cupy as cp

device.c_device_array(cp.array([1, 2, 3]))
#> <nanoarrow.device.CDeviceArray>
#> - device_type: CUDA <2>
#> - device_id: 0
#> - array: <nanoarrow.c_array.CArray int64>
#>   - length: 3
#>   - offset: 0
#>   - null_count: 0
#>   - buffers: (0, 133980798058496)
#>   - dictionary: NULL
#>   - children[0]:

darray = device.c_device_array(cp.array([1, 2, 3]))
cp.from_dlpack(darray.array.view().buffer(1))
#> array([1, 2, 3])

Thank you to AlenkaF, shwina, and danepitkin for their contributions to and review of this feature!

Build System Support for IPC/Device

Lastly, the CMake build system was refactored to enable FetchContent to work in an even wider variety of develop/build/install scenarios. In most cases, CMake-based projects should be able to add the nanoarrow C library with device and/or IPC support as a dependency with:

include(FetchContent)

# If required:
# set(NANOARROW_IPC ON)
# set(NANOARROW_DEVICE ON)
fetchcontent_declare(nanoarrow
                     URL "https://www.apache.org/dyn/closer.lua?action=download&filename=arrow/nanoarrow-0.6.0/apache-arrow-0.6.0.tar.gz")
fetchcontent_makeavailable(nanoarrow)

add_executable(some_target ...)
target_link_libraries(some_target
                      PRIVATE
                      nanoarrow::nanoarrow
                      # If needed
                      # nanoarrow::nanoarrow_ipc
                      # nanoarrow::nanoarrow_device
                      )

Linking against nanoarrow installed via cmake --install and located via find_package() is also supported.

Users of the Meson build system can install the latest nanoarrow with:

mkdir subprojects
meson wrap install nanoarrow

…and declared as a dependency with:

nanoarrow_dep = dependency('nanoarrow')
example_exec = executable('example_meson_minimal_app',
                          'src/app.cc',
                          dependencies: [nanoarrow_dep])

Contributors

This release consists of contributions from 10 contributors in addition to the invaluable advice and support of the Apache Arrow community.

$ git shortlog -sn apache-arrow-nanoarrow-0.6.0.dev..apache-arrow-nanoarrow-0.6.0 | grep -v "GitHub Actions"
    64  Dewey Dunnington
    19  William Ayd
    16  Benjamin Kietzman
     5  Cocoa
     2  Abhishek Singh
     1  Ashwin Srinath
     1  Dane Pitkin
     1  Jacob Wujciak-Jens
     1  Matt Topol
     1  Tao Zuhong