Memory Management¶
See also
Buffers¶
To avoid passing around raw data pointers with varying and non-obvious
lifetime rules, Arrow provides a generic abstraction called arrow::Buffer
.
A Buffer encapsulates a pointer and data size, and generally also ties its
lifetime to that of an underlying provider (in other words, a Buffer should
always point to valid memory till its destruction). Buffers are untyped:
they simply denote a physical memory area regardless of its intended meaning
or interpretation.
Buffers may be allocated by Arrow itself , or by third-party routines. For example, it is possible to pass the data of a Python bytestring as a Arrow buffer, keeping the Python object alive as necessary.
In addition, buffers come in various flavours: mutable or not, resizable or not. Generally, you will hold a mutable buffer when building up a piece of data, then it will be frozen as an immutable container such as an array.
Note
Some buffers may point to non-CPU memory, such as GPU-backed memory provided by a CUDA context. If you’re writing a GPU-aware application, you will need to be careful not to interpret a GPU memory pointer as a CPU-reachable pointer, or vice-versa.
Accessing Buffer Memory¶
Buffers provide fast access to the underlying memory using the
size()
and data()
accessors
(or mutable_data()
for writable access to a mutable
buffer).
Slicing¶
It is possible to make zero-copy slices of buffers, to obtain a buffer
referring to some contiguous subset of the underlying data. This is done
by calling the arrow::SliceBuffer()
and arrow::SliceMutableBuffer()
functions.
Allocating a Buffer¶
You can allocate a buffer yourself by calling one of the
arrow::AllocateBuffer()
or arrow::AllocateResizableBuffer()
overloads:
arrow::Result<std::unique_ptr<Buffer>> maybe_buffer = arrow::AllocateBuffer(4096);
if (!maybe_buffer.ok()) {
// ... handle allocation error
}
std::shared_ptr<arrow::Buffer> buffer = *std::move(maybe_buffer);
uint8_t* buffer_data = buffer->mutable_data();
memcpy(buffer_data, "hello world", 11);
Allocating a buffer this way ensures it is 64-bytes aligned and padded as recommended by the Arrow memory specification.
Building a Buffer¶
You can also allocate and build a Buffer incrementally, using the
arrow::BufferBuilder
API:
BufferBuilder builder;
builder.Resize(11); // reserve enough space for 11 bytes
builder.Append("hello ", 6);
builder.Append("world", 5);
auto maybe_buffer = builder.Finish();
if (!maybe_buffer.ok()) {
// ... handle buffer allocation error
}
std::shared_ptr<arrow::Buffer> buffer = *maybe_buffer;
If a Buffer is meant to contain values of a given fixed-width type (for
example the 32-bit offsets of a List array), it can be more convenient to
use the template arrow::TypedBufferBuilder
API:
TypedBufferBuilder<int32_t> builder;
builder.Reserve(2); // reserve enough space for two int32_t values
builder.Append(0x12345678);
builder.Append(-0x765643210);
auto maybe_buffer = builder.Finish();
if (!maybe_buffer.ok()) {
// ... handle buffer allocation error
}
std::shared_ptr<arrow::Buffer> buffer = *maybe_buffer;
Memory Pools¶
When allocating a Buffer using the Arrow C++ API, the buffer’s underlying
memory is allocated by a arrow::MemoryPool
instance. Usually this
will be the process-wide default memory pool, but many Arrow APIs allow
you to pass another MemoryPool instance for their internal allocations.
Memory pools are used for large long-lived data such as array buffers. Other data, such as small C++ objects and temporary workspaces, usually goes through the regular C++ allocators.
Default Memory Pool¶
The default memory pool depends on how Arrow C++ was compiled:
Overriding the Default Memory Pool¶
One can override the above selection algorithm by setting the
ARROW_DEFAULT_MEMORY_POOL
environment variable to one of the following
values: jemalloc
, mimalloc
or system
. This variable is inspected
once when Arrow C++ is loaded in memory (for example when the Arrow C++ DLL
is loaded).
STL Integration¶
If you wish to use a Arrow memory pool to allocate the data of STL containers,
you can do so using the arrow::stl::allocator
wrapper.
Conversely, you can also use a STL allocator to allocate Arrow memory,
using the arrow::stl::STLMemoryPool
class. However, this may be less
performant, as STL allocators don’t provide a resizing operation.
Devices¶
Many Arrow applications only access host (CPU) memory. However, in some cases it is desirable to handle on-device memory (such as on-board memory on a GPU) as well as host memory.
Arrow represents the CPU and other devices using the
arrow::Device
abstraction. The associated class arrow::MemoryManager
specifies how to allocate on a given device. Each device has a default memory manager, but
additional instances may be constructed (for example, wrapping a custom
arrow::MemoryPool
the CPU).
arrow::MemoryManager
instances which specify how to allocate
memory on a given device (for example, using a particular
arrow::MemoryPool
on the CPU).
Device-Agnostic Programming¶
If you receive a Buffer from third-party code, you can query whether it is
CPU-readable by calling its is_cpu()
method.
You can also view the Buffer on a given device, in a generic way, by calling
arrow::Buffer::View()
or arrow::Buffer::ViewOrCopy()
. This will
be a no-operation if the source and destination devices are identical.
Otherwise, a device-dependent mechanism will attempt to construct a memory
address for the destination device that gives access to the buffer contents.
Actual device-to-device transfer may happen lazily, when reading the buffer
contents.
Similarly, if you want to do I/O on a buffer without assuming a CPU-readable
buffer, you can call arrow::Buffer::GetReader()
and
arrow::Buffer::GetWriter()
.
For example, to get an on-CPU view or copy of an arbitrary buffer, you can simply do:
std::shared_ptr<arrow::Buffer> arbitrary_buffer = ... ;
std::shared_ptr<arrow::Buffer> cpu_buffer = arrow::Buffer::ViewOrCopy(
arbitrary_buffer, arrow::default_cpu_memory_manager());