.. raw:: html
Getting started with nanoarrow
==============================
This tutorial provides a short example of writing a C++ library that
exposes an Arrow-based API and uses nanoarrow to implement a simple text
file reader/writer. In general, nanoarrow can help you write a library
or application that:
- exposes an Arrow-based API to read from a data source or format,
- exposes an Arrow-based API to write to a data source or format,
- exposes one or more compute functions that operates on and produces
data in the form of Arrow arrays, and/or
- exposes an extension type implementation.
Because Arrow has bindings in many languages, it means that you or
others can easily bind or use your tool in higher-level runtimes like R,
Java, C++, Python, Rust, Julia, Go, or Ruby, among others.
The nanoarrow library is not the only way that an Arrow-based API can be
implemented: Arrow C++, Rust, and Go are all excellent choices and can
compile into static libraries that are C-linkable from other languages;
however, existing Arrow implementations produce relatively large static
libraries and can present complex build-time or run-time linking
requirements depending on the implementation and features used. If the
set of libraries you’re working with already provide the conveniences
you require, nanoarrow may provide all the functionality you need.
Now that we’ve talked about why you might want to build a library with
nanoarrow…let’s build one!
.. note::
This tutorial also goes over some of the basic structure of writing a C++ library.
If you already know how to do this, feel free to scroll to the code examples provided
below or take a look at the
`final example project `__.
The library
-----------
The library we’ll write in this tutorial is a simple text processing
library that splits and reassembles lines of text. It will be able to:
- Read text from a buffer into an ``ArrowArray`` as one element per
line, and
- Write elements of an ``ArrowArray`` into a buffer, inserting line
breaks after every element.
For the sake of argument, we’ll call it ``linesplitter``.
The development environment
---------------------------
There are many excellent IDEs that can be used to develop C and C++
libraries. For this tutorial, we will use
`VSCode `__ and
`CMake `__. You’ll need both installed to follow
along: VSCode can be downloaded from the official site for most
platforms; CMake is typically installed via your favourite package
manager (e.g., ``brew install cmake``, ``apt-get install cmake``,
``dnf install cmake``, etc.). You will also need a C and C++ compiler:
on MacOS these can be installed using ``xcode-select --install``; on
Linux you will need the packages that provide ``gcc``, ``g++``, and
``make`` (e.g., ``apt-get install build-essential``); on Windows you
will need to install `Visual
Studio `__ and CMake from
the official download pages.
Once you have VSCode installed, ensure you have the **CMake Tools** and
**C/C++** extensions installed. Once your environment is set up, create
a folder called ``linesplitter`` and open it using **File -> Open
Folder**.
The interface
-------------
We’ll expose the interface to our library as a header called
``linesplitter.h``. To ensure the definitions are only included once in
any given source file, we’ll add the following line at the top:
.. code:: cpp
#pragma once
Then, we need the `Arrow C Data
interface `__
itself, since it provides the type definitions that are recognized by
other Arrow implementations on which our API will be built. It’s
designed to be copy and pasted in this way - there’s no need to put it
in another file include something from another project.
.. code:: cpp
#include
#ifndef ARROW_C_DATA_INTERFACE
#define ARROW_C_DATA_INTERFACE
#define ARROW_FLAG_DICTIONARY_ORDERED 1
#define ARROW_FLAG_NULLABLE 2
#define ARROW_FLAG_MAP_KEYS_SORTED 4
struct ArrowSchema {
// Array type description
const char* format;
const char* name;
const char* metadata;
int64_t flags;
int64_t n_children;
struct ArrowSchema** children;
struct ArrowSchema* dictionary;
// Release callback
void (*release)(struct ArrowSchema*);
// Opaque producer-specific data
void* private_data;
};
struct ArrowArray {
// Array data description
int64_t length;
int64_t null_count;
int64_t offset;
int64_t n_buffers;
int64_t n_children;
const void** buffers;
struct ArrowArray** children;
struct ArrowArray* dictionary;
// Release callback
void (*release)(struct ArrowArray*);
// Opaque producer-specific data
void* private_data;
};
#endif // ARROW_C_DATA_INTERFACE
Next, we’ll provide definitions for the functions we’ll implement below:
.. code:: c
// Builds an ArrowArray of type string that will contain one element for each line
// in src and places it into out.
//
// On success, returns {0, ""}; on error, returns {, }
std::pair linesplitter_read(const std::string& src,
struct ArrowArray* out);
// Concatenates all elements of a string ArrowArray inserting a newline between
// elements.
//
// On success, returns {0, }; on error, returns {, }
std::pair linesplitter_write(struct ArrowArray* input);
.. note::
You may notice that we don't include or mention nanoarrow in any way in the header
that is exposed to users. Because nanoarrow is designed to be vendored and is not
distributed as a system library, it is not safe for users of your library to
``#include "nanoarrow.h"`` because it might conflict with another library that does
the same (with possibly a different version of nanoarrow).
Arrow C data/nanoarrow interface basics
---------------------------------------
Now that we’ve seen the functions we need to implement and the Arrow
types exposed in the C data interface, let’s unpack a few basics about
using the Arrow C data interface and a few conventions used in the
nanoarrow implementation.
First, let’s discuss the ``ArrowSchema`` and the ``ArrowArray``. You can
think of an ``ArrowSchema`` as an expression of a data type, whereas an
``ArrowArray`` is the data itself. These structures accommodate nested
types: columns are encoded in the ``children`` member of each. You
always need to know the data type of an ``ArrowArray`` before accessing
its contents. In our case we only operate on arrays of one type
(“string”) and document that in our interface; for functions that
operate on more than one type of array you will need to accept an
``ArrowSchema`` and inspect it (e.g., using nanoarrow’s helper
functions).
Second, let’s discuss error handling. You may have noticed in the
function definitions above that we return ``int``, which is an
errno-compatible error code or ``0`` to indicate success. Functions in
nanoarrow that need to communicate more detailed error information
accept an ``ArrowError*`` argument (which can be ``NULL`` if the caller
does care about the extra information). Any nanoarrow function that
might fail communicates errors in this way. To avoid verbose code like
the following:
.. code:: c
int init_string_non_null(struct ArrowSchema* schema) {
int code = ArrowSchemaInitFromType(&schema, NANOARROW_TYPE_STRING);
if (code != NANOARROW_OK) {
return code;
}
schema->flags &= ~ARROW_FLAG_NULLABLE;
return NANOARROW_OK;
}
…you can use the ``NANOARROW_RETURN_NOT_OK()`` macro:
.. code:: c
int init_string_non_null(struct ArrowSchema* schema) {
NANOARROW_RETURN_NOT_OK(ArrowSchemaInitFromType(&schema, NANOARROW_TYPE_STRING));
schema->flags &= ~ARROW_FLAG_NULLABLE;
return NANOARROW_OK;
}
This works as long as your internal functions that use nanoarrow also
return ``int`` and/or an ``ArrowError*`` argument. This usually means
that there is an outer function that presents a more idiomatic interface
(e.g., returning ``std::optional<>`` or throwing an exception) and an
inner function that uses nanoarrow-style error handling. Embracing
``NANOARROW_RETURN_NOT_OK()`` is key to happiness when using the
nanoarrow library.
Third, let’s discuss memory management. Because nanoarrow is implemented
in C and provides a C interface, the library by default uses C-style
memory management (i.e., if you allocate it, you clean it up). This is
unnecessary when you have C++ at your disposal, so nanoarrow also
provides a C++ header (``nanoarrow.hpp``) with
``std::unique_ptr<>``-like wrappers around anything that requires
explicit clean up. Whereas in C you might have to write code like this:
.. code:: c
struct ArrowSchema schema;
struct ArrowArray array;
// Ok: if this returns, array was not initialized
NANOARROW_RETURN_NOT_OK(ArrowSchemaInitFromType(&schema, NANOARROW_TYPE_STRING));
// Verbose: if this fails, we need to release schema before returning
// or it will leak.
int code = ArrowArrayInitFromSchema(&array, &schema, NULL);
if (code != NANOARROW_OK) {
schema.release(&schema);
return code;
}
…using the ``nanoarrow.hpp`` types we can do:
.. code:: cpp
nanoarrow::UniqueSchema schema;
nanoarrow::UniqueArray array;
NANOARROW_RETURN_NOT_OK(ArrowSchemaInitFromType(schema.get(), NANOARROW_TYPE_STRING));
NANOARROW_RETURN_NOT_OK(ArrowArrayInitFromSchema(array.get(), schema.get(), NULL));
Building the library
--------------------
Our library implementation will live in ``linesplitter.cc``. Before
writing the actual implementations, let’s add just enough to our project
that we can build it using VSCode’s C/C++/CMake integration:
.. code:: cpp
#include
#include
#include
#include
#include
#include "nanoarrow/nanoarrow.hpp"
#include "linesplitter.h"
std::pair linesplitter_read(const std::string& src,
struct ArrowArray* out) {
return {ENOTSUP, ""};
}
std::pair linesplitter_write(struct ArrowArray* input) {
return {ENOTSUP, ""};
}
We also need a ``CMakeLists.txt`` file that tells CMake and VSCode what
to build. CMake has a lot of options and can scale to coordinate very
large projects; however we only need a few lines to leverage VSCode’s
integration.
.. code:: cmake
project(linesplitter)
set(CMAKE_CXX_STANDARD 11)
include(FetchContent)
FetchContent_Declare(
nanoarrow
URL https://github.com/apache/arrow-nanoarrow/releases/download/apache-arrow-nanoarrow-0.2.0/apache-arrow-nanoarrow-0.2.0.tar.gz
URL_HASH SHA512=38a100ae5c36a33aa330010eb27b051cff98671e9c82fff22b1692bb77ae61bd6dc2a52ac6922c6c8657bd4c79a059ab26e8413de8169eeed3c9b7fdb216c817)
FetchContent_MakeAvailable(nanoarrow)
add_library(linesplitter linesplitter.cc)
target_link_libraries(linesplitter PRIVATE nanoarrow)
After saving ``CMakeLists.txt``, you may have to close and re-open the
``linesplitter`` directory in VSCode to activate the CMake integration.
From the command palette (i.e., Control/Command-Shift-P), choose
**CMake: Build**. If all went well, you should see a few lines of output
indicating progress towards building and linking ``linesplitter``.
.. note::
Depending on your version of CMake you might also see a few warnings. This CMakeLists.txt
is intentionally minimal and as such does not attempt to silence them.
.. note::
If you're not using VSCode, you can accomplish the equivalent task in in a terminal
with ``mkdir build && cd build && cmake .. && cmake --build .``.
Building an ArrowArray
----------------------
The input for our ``linesplitter_read()`` function is an
``std::string``, which we’ll iterate over and add each detected line as
its own element. First, we’ll define a function for the core logic of
detecting the number of characters until the next ``\n`` or
end-of-string.
.. code:: cpp
static int64_t find_newline(const ArrowStringView& src) {
for (int64_t i = 0; i < src.size_bytes; i++) {
if (src.data[i] == '\n') {
return i;
}
}
return src.size_bytes;
}
The next function we’ll define is an internal function that uses
nanoarrow-style error handling. This uses the ``ArrowArrayAppend*()``
family of functions provided by nanoarrow to build the array:
.. code:: cpp
static int linesplitter_read_internal(const std::string& src, ArrowArray* out,
ArrowError* error) {
nanoarrow::UniqueArray tmp;
NANOARROW_RETURN_NOT_OK(ArrowArrayInitFromType(tmp.get(), NANOARROW_TYPE_STRING));
NANOARROW_RETURN_NOT_OK(ArrowArrayStartAppending(tmp.get()));
ArrowStringView src_view = {src.data(), static_cast(src.size())};
ArrowStringView line_view;
int64_t next_newline = -1;
while ((next_newline = find_newline(src_view)) >= 0) {
line_view = {src_view.data, next_newline};
NANOARROW_RETURN_NOT_OK(ArrowArrayAppendString(tmp.get(), line_view));
src_view.data += next_newline + 1;
src_view.size_bytes -= next_newline + 1;
}
NANOARROW_RETURN_NOT_OK(ArrowArrayFinishBuildingDefault(tmp.get(), error));
ArrowArrayMove(tmp.get(), out);
return NANOARROW_OK;
}
Finally, we define a wrapper that corresponds to the outer function
definition.
.. code:: cpp
std::pair linesplitter_read(const std::string& src, ArrowArray* out) {
ArrowError error;
int code = linesplitter_read_internal(src, out, &error);
if (code != NANOARROW_OK) {
return {code, std::string(ArrowErrorMessage(&error))};
} else {
return {NANOARROW_OK, ""};
}
}
Reading an ArrowArray
---------------------
The input for our ``linesplitter_write()`` function is an
``ArrowArray*`` like the one we create in ``linesplitter_read()``. Just
as nanoarrow provides helpers to build arrays, it also provides helpers
to read them via the ``ArrowArrayView*()`` family of functions. Again,
we first define an internal function that uses nanoarrow-style error
handling:
.. code:: cpp
static int linesplitter_write_internal(ArrowArray* input, std::stringstream& out,
ArrowError* error) {
nanoarrow::UniqueArrayView input_view;
ArrowArrayViewInitFromType(input_view.get(), NANOARROW_TYPE_STRING);
NANOARROW_RETURN_NOT_OK(ArrowArrayViewSetArray(input_view.get(), input, error));
ArrowStringView item;
for (int64_t i = 0; i < input->length; i++) {
if (ArrowArrayViewIsNull(input_view.get(), i)) {
out << "\n";
} else {
item = ArrowArrayViewGetStringUnsafe(input_view.get(), i);
out << std::string(item.data, item.size_bytes) << "\n";
}
}
return NANOARROW_OK;
}
Then, provide an outer wrapper that corresponds to the outer function
definition.
.. code:: cpp
std::pair linesplitter_write(ArrowArray* input) {
std::stringstream out;
ArrowError error;
int code = linesplitter_write_internal(input, out, &error);
if (code != NANOARROW_OK) {
return {code, std::string(ArrowErrorMessage(&error))};
} else {
return {NANOARROW_OK, out.str()};
}
}
Testing
-------
We have an implementation, but does it work? Unlike higher-level
runtimes like R and Python, we can’t just open a prompt and type some
code to find out. For C and C++ libraries, the
`googletest `__
framework provides a quick and easy way to do this that scales nicely as
the complexity of your project grows.
First, we’ll add a stub test and some CMake to get going. In
``linesplitter_test.cc``, add the following:
.. code:: cpp
#include
#include "nanoarrow/nanoarrow.hpp"
#include "linesplitter.h"
TEST(Linesplitter, LinesplitterRoundtrip) {
EXPECT_EQ(4, 4);
}
Then, add the following to your ``CMakeLists.txt``:
.. code:: cmake
FetchContent_Declare(
googletest
URL https://github.com/google/googletest/archive/refs/tags/v1.13.0.zip
)
FetchContent_MakeAvailable(googletest)
enable_testing()
add_executable(linesplitter_test linesplitter_test.cc)
target_link_libraries(linesplitter_test linesplitter GTest::gtest_main)
include(GoogleTest)
gtest_discover_tests(linesplitter_test)
After you’re done, build the project again using the **CMake: Build**
command from the command palette. If all goes well, choose **CMake:
Refresh Tests** and then **Test: Run All Tests** from the command
palette to run them! You should see some output indicating that tests
ran successfully, or you can use VSCode’s “Testing” panel to visually
inspect which tests passed.
.. note::
If you're not using VSCode, you can accomplish the equivalent task in in a terminal
with ``cd build && ctest .``.
Now we’re ready to fill in the test! Our two functions happen to round
trip, so a useful first test might be to check.
.. code:: cpp
TEST(Linesplitter, LinesplitterRoundtrip) {
nanoarrow::UniqueArray out;
auto result = linesplitter_read("line1\nline2\nline3", out.get());
ASSERT_EQ(result.first, 0);
ASSERT_EQ(result.second, "");
ASSERT_EQ(out->length, 3);
nanoarrow::UniqueArrayView out_view;
ArrowArrayViewInitFromType(out_view.get(), NANOARROW_TYPE_STRING);
ASSERT_EQ(ArrowArrayViewSetArray(out_view.get(), out.get(), nullptr), 0);
ArrowStringView item;
item = ArrowArrayViewGetStringUnsafe(out_view.get(), 0);
ASSERT_EQ(std::string(item.data, item.size_bytes), "line1");
item = ArrowArrayViewGetStringUnsafe(out_view.get(), 1);
ASSERT_EQ(std::string(item.data, item.size_bytes), "line2");
item = ArrowArrayViewGetStringUnsafe(out_view.get(), 2);
ASSERT_EQ(std::string(item.data, item.size_bytes), "line3");
auto result2 = linesplitter_write(out.get());
ASSERT_EQ(result2.first, 0);
ASSERT_EQ(result2.second, "line1\nline2\nline3\n");
}
Writing tests in this way also opens up a relatively straightforward
debug path via the **CMake: Set Debug target** and **CMake: Debug**
commands. If the first thing that happens when you write run your test
is a crash, running the tests with the debugger turned on will
automatically pause at the line of code that caused the crash. For more
fine-tuned debugging, you can set breakpoints and step through code.
Summary
-------
This tutorial covered the basics of writing and testing a C++ library
exposing an Arrow-based API implemented using the nanoarrow C library.