CUDA support

CUDA Contexts

class CudaDeviceManager

Public Functions

Status GetContext(int device_number, std::shared_ptr<CudaContext> *out)

Get the CUDA driver context for a particular device.

Parameters
  • [in] device_number: the CUDA device

  • [out] out: cached context

Status GetSharedContext(int device_number, void *handle, std::shared_ptr<CudaContext> *out)

Get the shared CUDA driver context for a particular device.

Parameters
  • [in] device_number: the CUDA device

  • [in] handle: CUDA context handler created by another library

  • [out] out: shared context

Status AllocateHost(int device_number, int64_t nbytes, std::shared_ptr<CudaHostBuffer> *out)

Allocate host memory with fast access to given GPU device.

Parameters
  • [in] device_number: the CUDA device

  • [in] nbytes: number of bytes

  • [out] out: the allocated buffer

class CudaContext : public std::enable_shared_from_this<CudaContext>

Friendlier interface to the CUDA driver API.

Public Functions

Status Allocate(int64_t nbytes, std::shared_ptr<CudaBuffer> *out)

Allocate CUDA memory on GPU device for this context.

Return

Status

Parameters
  • [in] nbytes: number of bytes

  • [out] out: the allocated buffer

Status View(uint8_t *data, int64_t nbytes, std::shared_ptr<CudaBuffer> *out)

Create a view of CUDA memory on GPU device of this context.

Return

Status

Note

The caller is responsible for allocating and freeing the memory as well as ensuring that the memory belongs to the CUDA context that this CudaContext instance holds.

Parameters
  • [in] data: the starting device address

  • [in] nbytes: number of bytes

  • [out] out: the view buffer

Status OpenIpcBuffer(const CudaIpcMemHandle &ipc_handle, std::shared_ptr<CudaBuffer> *out)

Open existing CUDA IPC memory handle.

Return

Status

Parameters
  • [in] ipc_handle: opaque pointer to CUipcMemHandle (driver API)

  • [out] out: a CudaBuffer referencing the IPC segment

Status CloseIpcBuffer(CudaBuffer *buffer)

Close memory mapped with IPC buffer.

Return

Status

Parameters

Status Synchronize(void)

Block until the all device tasks are completed.

void *handle() const

Expose CUDA context handle to other libraries.

int device_number() const

Return device number.

Status GetDeviceAddress(uint8_t *addr, uint8_t **devaddr)

Return the device address that is reachable from kernels running in the context.

The device address is defined as a memory address accessible by device. While it is often a device memory address, it can be also a host memory address, for instance, when the memory is allocated as host memory (using cudaMallocHost or cudaHostAlloc) or as managed memory (using cudaMallocManaged) or the host memory is page-locked (using cudaHostRegister).

Return

Status

Parameters
  • [in] addr: device or host memory address

  • [out] devaddr: the device address

Status Free(void *device_ptr, int64_t nbytes)

Release CUDA memory on GPU device for this context.

Return

Status

Parameters
  • [in] device_ptr: the buffer address

  • [in] nbytes: number of bytes

Device and Host Buffers

class CudaBuffer : public arrow::Buffer

An Arrow buffer located on a GPU device.

Be careful using this in any Arrow code which may not be GPU-aware

Public Functions

Status CopyToHost(const int64_t position, const int64_t nbytes, void *out) const

Copy memory from GPU device to CPU host.

Return

Status

Parameters
  • [in] position: start position inside buffer to copy bytes from

  • [in] nbytes: number of bytes to copy

  • [out] out: start address of the host memory area to copy to

Status CopyFromHost(const int64_t position, const void *data, int64_t nbytes)

Copy memory to device at position.

Return

Status

Parameters
  • [in] position: start position to copy bytes to

  • [in] data: the host data to copy

  • [in] nbytes: number of bytes to copy

Status CopyFromDevice(const int64_t position, const void *data, int64_t nbytes)

Copy memory from device to device at position.

Return

Status

Note

It is assumed that both source and destination device memories have been allocated within the same context.

Parameters
  • [in] position: start position inside buffer to copy bytes to

  • [in] data: start address of the device memory area to copy from

  • [in] nbytes: number of bytes to copy

Status CopyFromAnotherDevice(const std::shared_ptr<CudaContext> &src_ctx, const int64_t position, const void *data, int64_t nbytes)

Copy memory from another device to device at position.

Return

Status

Parameters
  • [in] src_ctx: context of the source device memory

  • [in] position: start position inside buffer to copy bytes to

  • [in] data: start address of the another device memory area to copy from

  • [in] nbytes: number of bytes to copy

virtual Status ExportForIpc(std::shared_ptr<CudaIpcMemHandle> *handle)

Expose this device buffer as IPC memory which can be used in other processes.

Return

Status

Note

After calling this function, this device memory will not be freed when the CudaBuffer is destructed

Parameters
  • [out] handle: the exported IPC handle

Public Static Functions

static Status FromBuffer(std::shared_ptr<Buffer> buffer, std::shared_ptr<CudaBuffer> *out)

Convert back generic buffer into CudaBuffer.

Return

Status

Note

This function returns an error if the buffer isn’t backed by GPU memory

Parameters
  • [in] buffer: buffer to convert

  • [out] out: conversion result

Status arrow::cuda::AllocateCudaHostBuffer(int device_number, const int64_t size, std::shared_ptr<CudaHostBuffer> *out)

Allocate CUDA-accessible memory on CPU host.

Return

Status

Parameters
  • [in] device_number: device to expose host memory

  • [in] size: number of bytes

  • [out] out: the allocated buffer

class CudaHostBuffer : public arrow::MutableBuffer

Device-accessible CPU memory created using cudaHostAlloc.

Device Memory Input / Output

class CudaBufferReader : public arrow::io::BufferReader

File interface for zero-copy read from CUDA buffers.

Note: Reads return pointers to device memory. This means you must be careful using this interface with any Arrow code which may expect to be able to do anything other than pointer arithmetic on the returned buffers

Public Functions

Status Read(int64_t nbytes, int64_t *bytes_read, void *buffer)

Read bytes into pre-allocated host memory.

Parameters
  • [in] nbytes: number of bytes to read

  • [out] bytes_read: actual number of bytes read

  • [out] buffer: pre-allocated memory to write into

Status Read(int64_t nbytes, std::shared_ptr<Buffer> *out)

Zero-copy read from device memory.

Return

Status

Parameters
  • [in] nbytes: number of bytes to read

  • [out] out: a Buffer referencing device memory

class CudaBufferWriter : public arrow::io::WritableFile

File interface for writing to CUDA buffers, with optional buffering.

Public Functions

Status Close()

Close writer and flush buffered bytes to GPU.

Status Flush()

Flush buffered bytes to GPU.

Status SetBufferSize(const int64_t buffer_size)

Set CPU buffer size to limit calls to cudaMemcpy.

By default writes are unbuffered

Return

Status

Parameters
  • [in] buffer_size: the size of CPU buffer to allocate

int64_t buffer_size() const

Returns size of host (CPU) buffer, 0 for unbuffered.

int64_t num_bytes_buffered() const

Returns number of bytes buffered on host.

CUDA IPC

class CudaIpcMemHandle

Public Functions

Status Serialize(MemoryPool *pool, std::shared_ptr<Buffer> *out) const

Write CudaIpcMemHandle to a Buffer.

Return

Status

Parameters
  • [in] pool: a MemoryPool to allocate memory from

  • [out] out: the serialized buffer

Public Static Functions

static Status FromBuffer(const void *opaque_handle, std::shared_ptr<CudaIpcMemHandle> *handle)

Create CudaIpcMemHandle from opaque buffer (e.g.

from another process)

Return

Status

Parameters
  • [in] opaque_handle: a CUipcMemHandle as a const void*

  • [out] handle: the CudaIpcMemHandle instance

Status arrow::cuda::SerializeRecordBatch(const RecordBatch &batch, CudaContext *ctx, std::shared_ptr<CudaBuffer> *out)

Write record batch message to GPU device memory.

Return

Status

Parameters
  • [in] batch: record batch to write

  • [in] ctx: CudaContext to allocate device memory from

  • [out] out: the returned device buffer which contains the record batch message

Status arrow::cuda::ReadRecordBatch(const std::shared_ptr<Schema> &schema, const std::shared_ptr<CudaBuffer> &buffer, MemoryPool *pool, std::shared_ptr<RecordBatch> *out)

ReadRecordBatch specialized to handle metadata on CUDA device.

Parameters
  • [in] schema: the Schema for the record batch

  • [in] buffer: a CudaBuffer containing the complete IPC message

  • [in] pool: a MemoryPool to use for allocating space for the metadata

  • [out] out: the reconstructed RecordBatch, with device pointers