Getting started with nanoarrow in C/C++#
This tutorial provides a short example of writing a C++ library that exposes an Arrow-based API and uses nanoarrow to implement a simple text file reader/writer. In general, nanoarrow can help you write a library or application that:
exposes an Arrow-based API to read from a data source or format,
exposes an Arrow-based API to write to a data source or format,
exposes one or more compute functions that operates on and produces data in the form of Arrow arrays, and/or
exposes an extension type implementation.
Because Arrow has bindings in many languages, it means that you or others can easily bind or use your tool in higher-level runtimes like R, Java, C++, Python, Rust, Julia, Go, or Ruby, among others.
The nanoarrow library is not the only way that an Arrow-based API can be implemented: Arrow C++, Rust, and Go are all excellent choices and can compile into static libraries that are C-linkable from other languages; however, existing Arrow implementations produce relatively large static libraries and can present complex build-time or run-time linking requirements depending on the implementation and features used. If the set of libraries you’re working with already provide the conveniences you require, nanoarrow may provide all the functionality you need.
Now that we’ve talked about why you might want to build a library with nanoarrow…let’s build one!
Note
This tutorial also goes over some of the basic structure of writing a C++ library. If you already know how to do this, feel free to scroll to the code examples provided below or take a look at the final example project.
The library#
The library we’ll write in this tutorial is a simple text processing library that splits and reassembles lines of text. It will be able to:
Read text from a buffer into an
ArrowArray
as one element per line, andWrite elements of an
ArrowArray
into a buffer, inserting line breaks after every element.
For the sake of argument, we’ll call it linesplitter
.
The development environment#
There are many excellent IDEs that can be used to develop C and C++
libraries. For this tutorial, we will use
VSCode and
CMake. You’ll need both installed to follow
along: VSCode can be downloaded from the official site for most
platforms; CMake is typically installed via your favourite package
manager (e.g., brew install cmake
, apt-get install cmake
,
dnf install cmake
, etc.). You will also need a C and C++ compiler:
on MacOS these can be installed using xcode-select --install
; on
Linux you will need the packages that provide gcc
, g++
, and
make
(e.g., apt-get install build-essential
); on Windows you
will need to install Visual
Studio and CMake from
the official download pages.
Once you have VSCode installed, ensure you have the CMake Tools and
C/C++ extensions installed. Once your environment is set up, create
a folder called linesplitter
and open it using File -> Open
Folder.
The interface#
We’ll expose the interface to our library as a header called
linesplitter.h
. To ensure the definitions are only included once in
any given source file, we’ll add the following line at the top:
#pragma once
Then, we need the Arrow C Data interface itself, since it provides the type definitions that are recognized by other Arrow implementations on which our API will be built. It’s designed to be copy and pasted in this way - there’s no need to put it in another file include something from another project.
#include <stdint.h>
#ifndef ARROW_C_DATA_INTERFACE
#define ARROW_C_DATA_INTERFACE
#define ARROW_FLAG_DICTIONARY_ORDERED 1
#define ARROW_FLAG_NULLABLE 2
#define ARROW_FLAG_MAP_KEYS_SORTED 4
struct ArrowSchema {
// Array type description
const char* format;
const char* name;
const char* metadata;
int64_t flags;
int64_t n_children;
struct ArrowSchema** children;
struct ArrowSchema* dictionary;
// Release callback
void (*release)(struct ArrowSchema*);
// Opaque producer-specific data
void* private_data;
};
struct ArrowArray {
// Array data description
int64_t length;
int64_t null_count;
int64_t offset;
int64_t n_buffers;
int64_t n_children;
const void** buffers;
struct ArrowArray** children;
struct ArrowArray* dictionary;
// Release callback
void (*release)(struct ArrowArray*);
// Opaque producer-specific data
void* private_data;
};
#endif // ARROW_C_DATA_INTERFACE
Next, we’ll provide definitions for the functions we’ll implement below:
// Builds an ArrowArray of type string that will contain one element for each line
// in src and places it into out.
//
// On success, returns {0, ""}; on error, returns {<errno code>, <error message>}
std::pair<int, std::string> linesplitter_read(const std::string& src,
struct ArrowArray* out);
// Concatenates all elements of a string ArrowArray inserting a newline between
// elements.
//
// On success, returns {0, <result>}; on error, returns {<errno code>, <error message>}
std::pair<int, std::string> linesplitter_write(struct ArrowArray* input);
Note
You may notice that we don’t include or mention nanoarrow in any way in the header
that is exposed to users. Because nanoarrow is designed to be vendored and is not
distributed as a system library, it is not safe for users of your library to
#include "nanoarrow.h"
because it might conflict with another library that does
the same (with possibly a different version of nanoarrow).
Arrow C data/nanoarrow interface basics#
Now that we’ve seen the functions we need to implement and the Arrow types exposed in the C data interface, let’s unpack a few basics about using the Arrow C data interface and a few conventions used in the nanoarrow implementation.
First, let’s discuss the ArrowSchema
and the ArrowArray
. You can
think of an ArrowSchema
as an expression of a data type, whereas an
ArrowArray
is the data itself. These structures accommodate nested
types: columns are encoded in the children
member of each. You
always need to know the data type of an ArrowArray
before accessing
its contents. In our case we only operate on arrays of one type
(“string”) and document that in our interface; for functions that
operate on more than one type of array you will need to accept an
ArrowSchema
and inspect it (e.g., using nanoarrow’s helper
functions).
Second, let’s discuss error handling. You may have noticed in the
function definitions above that we return int
, which is an
errno-compatible error code or 0
to indicate success. Functions in
nanoarrow that need to communicate more detailed error information
accept an ArrowError*
argument (which can be NULL
if the caller
does care about the extra information). Any nanoarrow function that
might fail communicates errors in this way. To avoid verbose code like
the following:
int init_string_non_null(struct ArrowSchema* schema) {
int code = ArrowSchemaInitFromType(&schema, NANOARROW_TYPE_STRING);
if (code != NANOARROW_OK) {
return code;
}
schema->flags &= ~ARROW_FLAG_NULLABLE;
return NANOARROW_OK;
}
…you can use the NANOARROW_RETURN_NOT_OK()
macro:
int init_string_non_null(struct ArrowSchema* schema) {
NANOARROW_RETURN_NOT_OK(ArrowSchemaInitFromType(&schema, NANOARROW_TYPE_STRING));
schema->flags &= ~ARROW_FLAG_NULLABLE;
return NANOARROW_OK;
}
This works as long as your internal functions that use nanoarrow also
return int
and/or an ArrowError*
argument. This usually means
that there is an outer function that presents a more idiomatic interface
(e.g., returning std::optional<>
or throwing an exception) and an
inner function that uses nanoarrow-style error handling. Embracing
NANOARROW_RETURN_NOT_OK()
is key to happiness when using the
nanoarrow library.
Third, let’s discuss memory management. Because nanoarrow is implemented
in C and provides a C interface, the library by default uses C-style
memory management (i.e., if you allocate it, you clean it up). This is
unnecessary when you have C++ at your disposal, so nanoarrow also
provides a C++ header (nanoarrow.hpp
) with
std::unique_ptr<>
-like wrappers around anything that requires
explicit clean up. Whereas in C you might have to write code like this:
struct ArrowSchema schema;
struct ArrowArray array;
// Ok: if this returns, array was not initialized
NANOARROW_RETURN_NOT_OK(ArrowSchemaInitFromType(&schema, NANOARROW_TYPE_STRING));
// Verbose: if this fails, we need to release schema before returning
// or it will leak.
int code = ArrowArrayInitFromSchema(&array, &schema, NULL);
if (code != NANOARROW_OK) {
ArrowSchemaRelease(&schema);
return code;
}
…using the nanoarrow.hpp
types we can do:
nanoarrow::UniqueSchema schema;
nanoarrow::UniqueArray array;
NANOARROW_RETURN_NOT_OK(ArrowSchemaInitFromType(schema.get(), NANOARROW_TYPE_STRING));
NANOARROW_RETURN_NOT_OK(ArrowArrayInitFromSchema(array.get(), schema.get(), NULL));
Building the library#
Our library implementation will live in linesplitter.cc
. Before
writing the actual implementations, let’s add just enough to our project
that we can build it using VSCode’s C/C++/CMake integration:
#include <cerrno>
#include <cstdint>
#include <sstream>
#include <string>
#include <utility>
#include "nanoarrow/nanoarrow.hpp"
#include "linesplitter.h"
std::pair<int, std::string> linesplitter_read(const std::string& src,
struct ArrowArray* out) {
return {ENOTSUP, ""};
}
std::pair<int, std::string> linesplitter_write(struct ArrowArray* input) {
return {ENOTSUP, ""};
}
We also need a CMakeLists.txt
file that tells CMake and VSCode what
to build. CMake has a lot of options and can scale to coordinate very
large projects; however we only need a few lines to leverage VSCode’s
integration.
project(linesplitter)
set(CMAKE_CXX_STANDARD 11)
include(FetchContent)
FetchContent_Declare(
nanoarrow
URL https://github.com/apache/arrow-nanoarrow/releases/download/apache-arrow-nanoarrow-0.2.0/apache-arrow-nanoarrow-0.2.0.tar.gz
URL_HASH SHA512=38a100ae5c36a33aa330010eb27b051cff98671e9c82fff22b1692bb77ae61bd6dc2a52ac6922c6c8657bd4c79a059ab26e8413de8169eeed3c9b7fdb216c817)
FetchContent_MakeAvailable(nanoarrow)
add_library(linesplitter linesplitter.cc)
target_link_libraries(linesplitter PRIVATE nanoarrow)
After saving CMakeLists.txt
, you may have to close and re-open the
linesplitter
directory in VSCode to activate the CMake integration.
From the command palette (i.e., Control/Command-Shift-P), choose
CMake: Build. If all went well, you should see a few lines of output
indicating progress towards building and linking linesplitter
.
Note
Depending on your version of CMake you might also see a few warnings. This CMakeLists.txt is intentionally minimal and as such does not attempt to silence them.
Note
If you’re not using VSCode, you can accomplish the equivalent task in in a terminal
with mkdir build && cd build && cmake .. && cmake --build .
.
Building an ArrowArray#
The input for our linesplitter_read()
function is an
std::string
, which we’ll iterate over and add each detected line as
its own element. First, we’ll define a function for the core logic of
detecting the number of characters until the next \n
or
end-of-string.
static int64_t find_newline(const ArrowStringView& src) {
for (int64_t i = 0; i < src.size_bytes; i++) {
if (src.data[i] == '\n') {
return i;
}
}
return src.size_bytes;
}
The next function we’ll define is an internal function that uses
nanoarrow-style error handling. This uses the ArrowArrayAppend*()
family of functions provided by nanoarrow to build the array:
static int linesplitter_read_internal(const std::string& src, ArrowArray* out,
ArrowError* error) {
nanoarrow::UniqueArray tmp;
NANOARROW_RETURN_NOT_OK(ArrowArrayInitFromType(tmp.get(), NANOARROW_TYPE_STRING));
NANOARROW_RETURN_NOT_OK(ArrowArrayStartAppending(tmp.get()));
ArrowStringView src_view = {src.data(), static_cast<int64_t>(src.size())};
ArrowStringView line_view;
int64_t next_newline = -1;
while ((next_newline = find_newline(src_view)) >= 0) {
line_view = {src_view.data, next_newline};
NANOARROW_RETURN_NOT_OK(ArrowArrayAppendString(tmp.get(), line_view));
src_view.data += next_newline + 1;
src_view.size_bytes -= next_newline + 1;
}
NANOARROW_RETURN_NOT_OK(ArrowArrayFinishBuildingDefault(tmp.get(), error));
ArrowArrayMove(tmp.get(), out);
return NANOARROW_OK;
}
Finally, we define a wrapper that corresponds to the outer function definition.
std::pair<int, std::string> linesplitter_read(const std::string& src, ArrowArray* out) {
ArrowError error;
int code = linesplitter_read_internal(src, out, &error);
if (code != NANOARROW_OK) {
return {code, std::string(ArrowErrorMessage(&error))};
} else {
return {NANOARROW_OK, ""};
}
}
Reading an ArrowArray#
The input for our linesplitter_write()
function is an
ArrowArray*
like the one we create in linesplitter_read()
. Just
as nanoarrow provides helpers to build arrays, it also provides helpers
to read them via the ArrowArrayView*()
family of functions. Again,
we first define an internal function that uses nanoarrow-style error
handling:
static int linesplitter_write_internal(ArrowArray* input, std::stringstream& out,
ArrowError* error) {
nanoarrow::UniqueArrayView input_view;
ArrowArrayViewInitFromType(input_view.get(), NANOARROW_TYPE_STRING);
NANOARROW_RETURN_NOT_OK(ArrowArrayViewSetArray(input_view.get(), input, error));
ArrowStringView item;
for (int64_t i = 0; i < input->length; i++) {
if (ArrowArrayViewIsNull(input_view.get(), i)) {
out << "\n";
} else {
item = ArrowArrayViewGetStringUnsafe(input_view.get(), i);
out << std::string(item.data, item.size_bytes) << "\n";
}
}
return NANOARROW_OK;
}
Then, provide an outer wrapper that corresponds to the outer function definition.
std::pair<int, std::string> linesplitter_write(ArrowArray* input) {
std::stringstream out;
ArrowError error;
int code = linesplitter_write_internal(input, out, &error);
if (code != NANOARROW_OK) {
return {code, std::string(ArrowErrorMessage(&error))};
} else {
return {NANOARROW_OK, out.str()};
}
}
Testing#
We have an implementation, but does it work? Unlike higher-level runtimes like R and Python, we can’t just open a prompt and type some code to find out. For C and C++ libraries, the googletest framework provides a quick and easy way to do this that scales nicely as the complexity of your project grows.
First, we’ll add a stub test and some CMake to get going. In
linesplitter_test.cc
, add the following:
#include <gtest/gtest.h>
#include "nanoarrow/nanoarrow.hpp"
#include "linesplitter.h"
TEST(Linesplitter, LinesplitterRoundtrip) {
EXPECT_EQ(4, 4);
}
Then, add the following to your CMakeLists.txt
:
FetchContent_Declare(
googletest
URL https://github.com/google/googletest/archive/refs/tags/v1.13.0.zip
)
FetchContent_MakeAvailable(googletest)
enable_testing()
add_executable(linesplitter_test linesplitter_test.cc)
target_link_libraries(linesplitter_test linesplitter GTest::gtest_main)
include(GoogleTest)
gtest_discover_tests(linesplitter_test)
After you’re done, build the project again using the CMake: Build command from the command palette. If all goes well, choose CMake: Refresh Tests and then Test: Run All Tests from the command palette to run them! You should see some output indicating that tests ran successfully, or you can use VSCode’s “Testing” panel to visually inspect which tests passed.
Note
If you’re not using VSCode, you can accomplish the equivalent task in in a terminal
with cd build && ctest .
.
Now we’re ready to fill in the test! Our two functions happen to round trip, so a useful first test might be to check.
TEST(Linesplitter, LinesplitterRoundtrip) {
nanoarrow::UniqueArray out;
auto result = linesplitter_read("line1\nline2\nline3", out.get());
ASSERT_EQ(result.first, 0);
ASSERT_EQ(result.second, "");
ASSERT_EQ(out->length, 3);
nanoarrow::UniqueArrayView out_view;
ArrowArrayViewInitFromType(out_view.get(), NANOARROW_TYPE_STRING);
ASSERT_EQ(ArrowArrayViewSetArray(out_view.get(), out.get(), nullptr), 0);
ArrowStringView item;
item = ArrowArrayViewGetStringUnsafe(out_view.get(), 0);
ASSERT_EQ(std::string(item.data, item.size_bytes), "line1");
item = ArrowArrayViewGetStringUnsafe(out_view.get(), 1);
ASSERT_EQ(std::string(item.data, item.size_bytes), "line2");
item = ArrowArrayViewGetStringUnsafe(out_view.get(), 2);
ASSERT_EQ(std::string(item.data, item.size_bytes), "line3");
auto result2 = linesplitter_write(out.get());
ASSERT_EQ(result2.first, 0);
ASSERT_EQ(result2.second, "line1\nline2\nline3\n");
}
Writing tests in this way also opens up a relatively straightforward debug path via the CMake: Set Debug target and CMake: Debug commands. If the first thing that happens when you write run your test is a crash, running the tests with the debugger turned on will automatically pause at the line of code that caused the crash. For more fine-tuned debugging, you can set breakpoints and step through code.
Summary#
This tutorial covered the basics of writing and testing a C++ library exposing an Arrow-based API implemented using the nanoarrow C library.