Faster, scalable memory allocations in Apache Arrow with jemalloc


Published 20 Jul 2018
By Uwe Korn (uwe)

With the release of the 0.9 version of Apache Arrow, we have switched our default allocator for array buffers from the system allocator to jemalloc on OSX and Linux. This applies to the C++/GLib/Python implementations of Arrow. In most cases changing the default allocator is normally done to avoid problems that occur with many small, frequent (de)allocations. In contrast, in Arrow we normally deal with large in-memory datasets. While jemalloc provides good strategies for avoiding RAM fragmentation for allocations that are lower than a memory page (4kb), it also provides functionality that improves performance on allocations that span several memory pages.

Outside of Apache Arrow, jemalloc powers the infrastructure of Facebook (this is also where most of its development happens). It is also used as the default allocator in Rust as well as it helps Redis reduce the memory fragmentation on Linux (“Allocator”).

One allocation specialty that we require in Arrow is that memory should be 64byte aligned. This is so that we can get the most performance out of SIMD instruction sets like AVX. While the most modern SIMD instructions also work on unaligned memory, their performance is much better on aligned memory. To get the best performance for our analytical applications, we want all memory to be allocated such that SIMD performance is maximized.

For aligned allocations, the POSIX APIs only provide the aligned_alloc(void** ptr, size_t alignment, size_t size) function to allocate aligned memory. There is also posix_memalign(void **ptr, size_t alignment, size_t size) to modify an allocation to the preferred alignment. But neither of them cater for expansions of the allocation. While the realloc function can often expand allocations without moving them physically, it does not ensure that in the case the allocation is moved that the alignment is kept.

In the case when Arrow was built without jemalloc being enabled, this resulted in copying the data on each new expansion of an allocation. To reduce the number of memory copies, we use jemalloc’s *allocx()-APIs to create, modify and free aligned allocations. One of the typical tasks where this gives us a major speedup is on the incremental construction of an Arrow table that consists of several columns. We often don’t know the size of the table in advance and need to expand our allocations as the data is loaded.

To incrementally build a vector using memory expansion of a factor of 2, we would use the following C-code with the standard POSIX APIs:

size_t size = 128 * 1024;
void* ptr = aligned_alloc(64, size);
for (int i = 0; i < 10; i++) {
  size_t new_size = size * 2;
  void* ptr2 = aligned_alloc(64, new_size);
  memcpy(ptr2, ptr, size);
  free(ptr);
  ptr = ptr2;
  size = new_size;
}
free(ptr);

With jemalloc’s special APIs, we are able to omit the explicit call to memcpy. In the case where a memory expansion cannot be done in-place, it is still called by the allocator but not needed on all occasions. This simplifies our user code to:

size_t size = 128 * 1024;
void* ptr = mallocx(size, MALLOCX_ALIGN(64));
for (int i = 0; i < 10; i++) {
  size *= 2;
  ptr = rallocx(ptr, size, MALLOCX_ALIGN(64));
}
dallocx(ptr, MALLOCX_ALIGN(64));

To see the real world benefits of using jemalloc, we look at the benchmarks in Arrow C++. There we have modeled a typical use case of incrementally building up an array of primitive values. For the build-up of the array, we don’t know the number of elements in the final array so we need to continuously expand the memory region in which the data is stored. The code for this benchmark is part of the builder-benchmark in the Arrow C++ sources as BuildPrimitiveArrayNoNulls.

Runtimes without jemalloc:

BM_BuildPrimitiveArrayNoNulls/repeats:3                 636726 us   804.114MB/s
BM_BuildPrimitiveArrayNoNulls/repeats:3                 621345 us   824.019MB/s
BM_BuildPrimitiveArrayNoNulls/repeats:3                 625008 us    819.19MB/s
BM_BuildPrimitiveArrayNoNulls/repeats:3_mean            627693 us   815.774MB/s
BM_BuildPrimitiveArrayNoNulls/repeats:3_median          625008 us    819.19MB/s
BM_BuildPrimitiveArrayNoNulls/repeats:3_stddev            8034 us   10.3829MB/s

Runtimes with jemalloc:

BM_BuildPrimitiveArrayNoNulls/repeats:3                 630881 us   811.563MB/s
BM_BuildPrimitiveArrayNoNulls/repeats:3                 352891 us   1.41687GB/s
BM_BuildPrimitiveArrayNoNulls/repeats:3                 351039 us   1.42434GB/s
BM_BuildPrimitiveArrayNoNulls/repeats:3_mean            444937 us   1.21125GB/s
BM_BuildPrimitiveArrayNoNulls/repeats:3_median          352891 us   1.41687GB/s
BM_BuildPrimitiveArrayNoNulls/repeats:3_stddev          161035 us   371.335MB/s

The benchmark was run three times for each configuration to see the performance differences. The first run in each configuration yielded the same performance but in all subsequent runs, the version using jemalloc was about twice as fast. In these cases, the memory region that was used for constructing the array could be expanded in place without moving the data around. This was possible as there were memory pages assigned to the process that were unused but not reclaimed by the operating system. Without jemalloc, we cannot make use of them simply by the fact that the default allocator has no API that provides aligned reallocation.