Apache Arrow nanoarrow 0.6.0 Release
Published
07 Oct 2024
By
The Apache Arrow PMC (pmc)
The Apache Arrow team is pleased to announce the 0.6.0 release of Apache Arrow nanoarrow. This release covers 114 resolved issues from 10 contributors.
Release Highlights
- Run End Encoding support
- StringView support
- IPC Write support
- DLPack/device support
- IPC/Device available from CMake/Meson as feature flags
See the Changelog for a detailed list of contributions to this release.
Breaking Changes
Most changes included in the nanoarrow 0.6.0 release will not break downstream code; however, two changes with respect to packaging and distribution may require users to update the code used to bring nanoarrow in as a dependency.
In nanoarrow 0.5.0 and earlier, the bundled single-file amalgamation was included in
the dist/
subdirectory or could be generated using a specially-crafted CMake
command. The nanoarrow 0.6.0 release removes the pre-compiled includes and migrates
the code used to generate it to Python. This setup is less confusing for contributors
(whose editors would frequently jump into the wrong nanoarrow.h
) and is a less confusing
use of CMake. Users can generate the dist/
subdirectory as it previously existed
with:
python ci/scripts/bundle.py \
--source-output-dir=dist \
--include-output-dir=dist \
--header-namespace= \
--with-device \
--with-ipc \
--with-testing \
--with-flatcc
Second, the Arrow IPC and ArrowDeviceArray implementations previously lived in the extensions/
subdirectory of the repository. This was helpful during the initial development of these
features; however, the nanoarrow 0.6.0 release added the requisite feature coverage and testing
such that the appropriate home for them is now the main src/
directory. As such, one
can now build nanoarrow with IPC and/or device support using:
cmake -S . -B build -DNANOARROW_IPC=ON -DNANOARROW_DEVICE=ON
Features
Float16, StringView, and REE Support
The nanoarrow 0.6.0 release adds support for Arrow’s float16 (half float), string view,
and run-end encoding support. The C library supports building float16 arrays using
ArrowArrayAppendDouble()
and supports building string view and binary view arrays
using ArrowArrayAppendString()
and/or ArrowArrayAppendBytes()
and supports consuming
using ArrowArrayViewGetStringUnsafe()
and/or ArrowArrayViewGetBytesUnsafe()
. R and
Python users can request a string view or float16 type when building an array, and
conversion back to R/Python strings is suppored.
# pip install nanoarrow
# conda install nanoarrow -c conda-forge
import nanoarrow as na
na.Array(["abc", "def", None], na.string_view())
#> nanoarrow.Array<string_view>[3]
#> 'abc'
#> 'def'
#> None
na.Array([1, 2, 3], na.float16())
#> nanoarrow.Array<half_float>[3]
#> 1.0
#> 2.0
#> 3.0
# install.packages("nanoarrow")
library(nanoarrow)
as_nanoarrow_array(c("abc", "def", NA), schema = na_string_view()) |>
convert_array()
#> [1] "abc" "def" NA
as_nanoarrow_array(c(1, 2, 3), schema = na_half_float()) |>
convert_array()
#> [1] 1 2 3
Creating/consuming run-end encoding arrays by element is not yet supported in C, R, or Python; however, arrays can be built or consumed by assembling the correct array/buffer structure in C.
Thank you to cocoa-xu for adding float16 and run-end encoding support and thank you to WillAyd for adding string view support!
IPC Write Support
The nanoarrow library has supported reading
Arrow IPC streams
since 0.4.0; however, could not write streams of its own. The nanoarrow 0.6.0 release adds
support for stream writing from C using the ArrowIpcWriter
and stream writing
from R and Python:
import io
import nanoarrow as na
from nanoarrow import ipc
out = io.BytesIO()
with ipc.StreamWriter.from_writable(out) as writer:
writer.write_stream(ipc.InputStream.example())
out.seek(0)
na.ArrayStream.from_readable(out).read_all()
#> nanoarrow.Array<non-nullable struct<some_col: int32>>[3]
#> {'some_col': 1}
#> {'some_col': 2}
#> {'some_col': 3}
library(nanoarrow)
tf <- tempfile()
nycflights13::flights |> write_nanoarrow(tf)
read_nanoarrow(tf) |> tibble::as_tibble()
#> # A tibble: 336,776 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> 7 2013 1 1 555 600 -5 913 854
#> 8 2013 1 1 557 600 -3 709 723
#> 9 2013 1 1 557 600 -3 838 846
#> 10 2013 1 1 558 600 -2 753 745
#> # ℹ 336,766 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> # hour <dbl>, minute <dbl>, time_hour <dttm>
As a result of the IPC write support, nanoarrow now joins the Arrow IPC integration tests to ensure compatability across implementations. With the exception of arrow-rs due to a bug in the Rust flatbuffers implementation, nanoarrow is now tested against all participating Arrow implementations with every commit.
A huge thank you to bkietz for implementing this support and the tests (which included multiple bugfixes and identification of inconsistencies of flatbuffer verification in C, Rust, and C++!).
DLPack/CUDA Support
The nanoarrow 0.6.0 release includes improved support for the
Arrow C Device data interface.
In particular, the CUDA device implementation was improved to more efficiently coordinate
synchronization when copying arrays to/from the GPU and migrated to use the driver API
for wider compatibility. The nanoarrow Python bindings have limited support for creating
ArrowDeviceArray
wrappers that implement the
__arrow_c_device_array__
protocol
from anything that implements DLPack:
# Currently requires:
# export NANOARROW_PYTHON_CUDA=/usr/local/cuda
# pip install --force-reinstall --no-binary=":all:" nanoarrow
import nanoarrow as na
from nanoarrow import device
import cupy as cp
device.c_device_array(cp.array([1, 2, 3]))
#> <nanoarrow.device.CDeviceArray>
#> - device_type: CUDA <2>
#> - device_id: 0
#> - array: <nanoarrow.c_array.CArray int64>
#> - length: 3
#> - offset: 0
#> - null_count: 0
#> - buffers: (0, 133980798058496)
#> - dictionary: NULL
#> - children[0]:
darray = device.c_device_array(cp.array([1, 2, 3]))
cp.from_dlpack(darray.array.view().buffer(1))
#> array([1, 2, 3])
Thank you to AlenkaF, shwina, and danepitkin for their contributions to and review of this feature!
Build System Support for IPC/Device
Lastly, the CMake build system was refactored to enable FetchContent
to
work in an even wider variety of
develop/build/install scenarios. In most cases, CMake-based projects should be able
to add the nanoarrow C library with device and/or IPC support as a dependency with:
include(FetchContent)
# If required:
# set(NANOARROW_IPC ON)
# set(NANOARROW_DEVICE ON)
fetchcontent_declare(nanoarrow
URL "https://www.apache.org/dyn/closer.lua?action=download&filename=arrow/nanoarrow-0.6.0/apache-arrow-0.6.0.tar.gz")
fetchcontent_makeavailable(nanoarrow)
add_executable(some_target ...)
target_link_libraries(some_target
PRIVATE
nanoarrow::nanoarrow
# If needed
# nanoarrow::nanoarrow_ipc
# nanoarrow::nanoarrow_device
)
Linking against nanoarrow installed via cmake --install
and located
via find_package()
is also supported.
Users of the Meson build system can install the latest nanoarrow with:
mkdir subprojects
meson wrap install nanoarrow
…and declared as a dependency with:
nanoarrow_dep = dependency('nanoarrow')
example_exec = executable('example_meson_minimal_app',
'src/app.cc',
dependencies: [nanoarrow_dep])
Contributors
This release consists of contributions from 10 contributors in addition to the invaluable advice and support of the Apache Arrow community.
$ git shortlog -sn apache-arrow-nanoarrow-0.6.0.dev..apache-arrow-nanoarrow-0.6.0 | grep -v "GitHub Actions"
64 Dewey Dunnington
19 William Ayd
16 Benjamin Kietzman
5 Cocoa
2 Abhishek Singh
1 Ashwin Srinath
1 Dane Pitkin
1 Jacob Wujciak-Jens
1 Matt Topol
1 Tao Zuhong