CUDA Integration#

Arrow is not limited to CPU buffers (located in the computer’s main memory, also named “host memory”). It also has provisions for accessing buffers located on a CUDA-capable GPU device (in “device memory”).


This functionality is optional and must have been enabled at build time. If this is not done by your package manager, you might have to build Arrow yourself.

CUDA Contexts#

A CUDA context represents access to a particular CUDA-capable device. For example, this is creating a CUDA context accessing CUDA device number 0:

>>> from pyarrow import cuda
>>> ctx = cuda.Context(0)

CUDA Buffers#

A CUDA buffer can be created by copying data from host memory to the memory of a CUDA device, using the Context.buffer_from_data() method. The source data can be any Python buffer-like object, including Arrow buffers:

>>> import numpy as np
>>> arr = np.arange(4, dtype=np.int32)
>>> arr.nbytes
>>> cuda_buf = ctx.buffer_from_data(arr)
>>> type(cuda_buf)
>>> cuda_buf.size     # The buffer's size in bytes
>>> cuda_buf.address  # The buffer's address in device memory
>>> cuda_buf.context.device_number

Conversely, you can copy back a CUDA buffer to device memory, getting a regular CPU buffer:

>>> buf = cuda_buf.copy_to_host()
>>> type(buf)
>>> np.frombuffer(buf, dtype=np.int32)
array([0, 1, 2, 3], dtype=int32)


Many Arrow functions expect a CPU buffer but will not check the buffer’s actual type. You will get a crash if you pass a CUDA buffer to such a function:

>>> pa.py_buffer(b"x" * 16).equals(cuda_buf)
Segmentation fault

Numba Integration#

There is not much you can do directly with Arrow CUDA buffers from Python, but they support interoperation with Numba, a JIT compiler which can turn Python code into optimized CUDA kernels.

Arrow to Numba#

First let’s define a Numba CUDA kernel operating on an int32 array. Here, we will simply increment each array element (assuming the array is writable):

import numba.cuda

def increment_by_one(an_array):
    pos = numba.cuda.grid(1)
    if pos < an_array.size:
        an_array[pos] += 1

Then we need to wrap our CUDA buffer into a Numba “device array” with the right array metadata (shape, strides and datatype). This is necessary so that Numba can identify the array’s characteristics and compile the kernel with the appropriate type declarations.

In this case the metadata can simply be got from the original Numpy array. Note the GPU data isn’t copied, just pointed to:

>>> from numba.cuda.cudadrv.devicearray import DeviceNDArray
>>> device_arr = DeviceNDArray(arr.shape, arr.strides, arr.dtype, gpu_data=cuda_buf.to_numba())

(ideally we could have defined an Arrow array in CPU memory, copied it to CUDA memory without losing type information, and then invoked the Numba kernel on it without constructing the DeviceNDArray by hand; this is not yet possible)

Finally we can run the Numba CUDA kernel on the Numba device array (here with a 16x16 grid size):

>>> increment_by_one[16, 16](device_arr)

And the results can be checked by copying back the CUDA buffer to CPU memory:

>>> np.frombuffer(cuda_buf.copy_to_host(), dtype=np.int32)
array([1, 2, 3, 4], dtype=int32)

Numba to Arrow#

Conversely, a Numba-created device array can be viewed as an Arrow CUDA buffer, using the CudaBuffer.from_numba() factory method.

For the sake of example, let’s first create a Numba device array:

>>> arr = np.arange(10, 14, dtype=np.int32)
>>> arr
array([10, 11, 12, 13], dtype=int32)
>>> device_arr = numba.cuda.to_device(arr)

Then we can create a CUDA buffer pointing the device array’s memory. We don’t need to pass a CUDA context explicitly this time: the appropriate CUDA context is automatically retrieved and adapted from the Numba object.

>>> cuda_buf = cuda.CudaBuffer.from_numba(device_arr.gpu_data)
>>> cuda_buf.size
>>> cuda_buf.address
>>> cuda_buf.context.device_number

Of course, we can copy the CUDA buffer back to host memory:

>>> np.frombuffer(cuda_buf.copy_to_host(), dtype=np.int32)
array([10, 11, 12, 13], dtype=int32)

See also

Documentation for Numba’s CUDA support.