The Arrow PyCapsule Interface#
Warning
The Arrow PyCapsule Interface should be considered experimental
Rationale#
The C data interface, C stream interface
and C device interface allow moving Arrow data between
different implementations of Arrow. However, these interfaces don’t specify how
Python libraries should expose these structs to other libraries. Prior to this,
many libraries simply provided export to PyArrow data structures, using the
_import_from_c
and _export_to_c
methods. However, this always required
PyArrow to be installed. In addition, those APIs could cause memory leaks if
handled improperly.
This interface allows any library to export Arrow data structures to other libraries that understand the same protocol.
Goals#
Standardize the PyCapsule objects that represent
ArrowSchema
,ArrowArray
,ArrowArrayStream
,ArrowDeviceArray
andArrowDeviceArrayStream
.Define standard methods that export Arrow data into such capsule objects, so that any Python library wanting to accept Arrow data as input can call the corresponding method instead of hardcoding support for specific Arrow producers.
Non-goals#
Standardize what public APIs should be used for import. This is left up to individual libraries.
PyCapsule Standard#
When exporting Arrow data through Python, the C Data Interface / C Stream Interface structures should be wrapped in capsules. Capsules avoid invalid access by attaching a name to the pointer and avoid memory leaks by attaching a destructor. Thus, they are much safer than passing pointers as integers.
PyCapsule allows for a name
to be associated with the capsule, allowing
consumers to verify that the capsule contains the expected kind of data. To make sure
Arrow structures are recognized, the following names must be used:
C Interface Type |
PyCapsule Name |
---|---|
ArrowSchema |
|
ArrowArray |
|
ArrowArrayStream |
|
ArrowDeviceArray |
|
ArrowDeviceArrayStream |
|
Lifetime Semantics#
The exported PyCapsules should have a destructor that calls the release callback of the Arrow struct, if it is not already null. This prevents a memory leak in case the capsule was never passed to another consumer.
If the capsule has been passed to a consumer, the consumer should have moved the data and marked the release callback as null, so there isn’t a risk of releasing data the consumer is using. Read more in the C Data Interface specification.
In case of a device struct, the above mentioned release callback is the
release
member of the embedded ArrowArray
structure.
Read more in the C Device Interface specification.
Just like in the C Data Interface, the PyCapsule objects defined here can only be consumed once.
For an example of a PyCapsule with a destructor, see Create a PyCapsule.
Export Protocol#
The interface consists of three separate protocols:
ArrowSchemaExportable
, which defines the__arrow_c_schema__
method.ArrowArrayExportable
, which defines the__arrow_c_array__
method.ArrowStreamExportable
, which defines the__arrow_c_stream__
method.
Two additional protocols are defined for the Device interface:
ArrowDeviceArrayExportable
, which defines the__arrow_c_device_array__
method.ArrowDeviceStreamExportable
, which defines the__arrow_c_device_stream__
method.
ArrowSchema Export#
Schemas, fields, and data types can implement the method __arrow_c_schema__
.
- __arrow_c_schema__(self)#
Export the object as an ArrowSchema.
- Returns:
A PyCapsule containing a C ArrowSchema representation of the object. The capsule must have a name of
"arrow_schema"
.
ArrowArray Export#
Arrays and record batches (contiguous tables) can implement the method
__arrow_c_array__
.
- __arrow_c_array__(self, requested_schema=None)#
Export the object as a pair of ArrowSchema and ArrowArray structures.
- Parameters:
requested_schema (PyCapsule or None) – A PyCapsule containing a C ArrowSchema representation of a requested schema. Conversion to this schema is best-effort. See Schema Requests.
- Returns:
A pair of PyCapsules containing a C ArrowSchema and ArrowArray, respectively. The schema capsule should have the name
"arrow_schema"
and the array capsule should have the name"arrow_array"
.
Libraries supporting the Device interface can implement a __arrow_c_device_array__
method on those objects, which works the same as __arrow_c_array__
except
for returning an ArrowDeviceArray structure instead of an ArrowArray structure:
- __arrow_c_device_array__(self, requested_schema=None, **kwargs)#
Export the object as a pair of ArrowSchema and ArrowDeviceArray structures.
- Parameters:
requested_schema (PyCapsule or None) – A PyCapsule containing a C ArrowSchema representation of a requested schema. Conversion to this schema is best-effort. See Schema Requests.
kwargs – Additional keyword arguments should only be accepted if they have a default value of
None
, to allow for future addition of new keywords. See Device Support for more details.
- Returns:
A pair of PyCapsules containing a C ArrowSchema and ArrowDeviceArray, respectively. The schema capsule should have the name
"arrow_schema"
and the array capsule should have the name"arrow_device_array"
.
ArrowStream Export#
Tables / DataFrames and streams can implement the method __arrow_c_stream__
.
- __arrow_c_stream__(self, requested_schema=None)#
Export the object as an ArrowArrayStream.
- Parameters:
requested_schema (PyCapsule or None) – A PyCapsule containing a C ArrowSchema representation of a requested schema. Conversion to this schema is best-effort. See Schema Requests.
- Returns:
A PyCapsule containing a C ArrowArrayStream representation of the object. The capsule must have a name of
"arrow_array_stream"
.
Libraries supporting the Device interface can implement a __arrow_c_device_stream__
method on those objects, which works the same as __arrow_c_stream__
except
for returning an ArrowDeviceArrayStream structure instead of an ArrowArrayStream
structure:
- __arrow_c_device_stream__(self, requested_schema=None, **kwargs)#
Export the object as an ArrowDeviceArrayStream.
- Parameters:
requested_schema (PyCapsule or None) – A PyCapsule containing a C ArrowSchema representation of a requested schema. Conversion to this schema is best-effort. See Schema Requests.
kwargs – Additional keyword arguments should only be accepted if they have a default value of
None
, to allow for future addition of new keywords. See Device Support for more details.
- Returns:
A PyCapsule containing a C ArrowDeviceArrayStream representation of the object. The capsule must have a name of
"arrow_device_array_stream"
.
Schema Requests#
In some cases, there might be multiple possible Arrow representations of the same data. For example, a library might have a single integer type, but Arrow has multiple integer types with different sizes and sign. As another example, Arrow has several possible encodings for an array of strings: 32-bit offsets, 64-bit offsets, string view, and dictionary-encoded. A sequence of strings could export to any one of these Arrow representations.
In order to allow the caller to request a specific representation, the
__arrow_c_array__()
and __arrow_c_stream__()
methods take an optional
requested_schema
parameter. This parameter is a PyCapsule containing an
ArrowSchema
.
The callee should attempt to provide the data in the requested schema. However,
if the callee cannot provide the data in the requested schema, they may return
with the same schema as if None
were passed to requested_schema
.
If the caller requests a schema that is not compatible with the data, say requesting a schema with a different number of fields, the callee should raise an exception. The requested schema mechanism is only meant to negotiate between different representations of the same data and not to allow arbitrary schema transformations.
Device Support#
The PyCapsule interface has cross hardware support through using the C device interface. This means it is possible to exchange data on non-CPU devices (e.g. CUDA GPUs) and to inspect on what device the exchanged data lives.
For exchanging the data structures, this interface has two sets of protocol
methods: the standard CPU-only versions (__arrow_c_array__()
and
__arrow_c_stream__()
) and the equivalent device-aware versions
(__arrow_c_device_array__()
, and __arrow_c_device_stream__()
).
For CPU-only producers, it is allowed to either implement only the standard CPU-only protocol methods, or either implement both the CPU-only and device-aware methods. The absence of the device version methods implies CPU-only data. For CPU-only consumers, it is encouraged to be able to consume both versions of the protocol.
For a device-aware producer whose data structures can only reside in
non-CPU memory, it is recommended to only implement the device version of the
protocol (e.g. only add __arrow_c_device_array__
, and not add __arrow_c_array__
).
Producers that have data structures that can live both on CPU or non-CPU devices
can implement both versions of the protocol, but the CPU-only versions
(__arrow_c_array__()
and __arrow_c_stream__()
) should be guaranteed
to contain valid pointers for CPU memory (thus, when trying to export non-CPU data,
either raise an error or make a copy to CPU memory).
Producing the ArrowDeviceArray
and ArrowDeviceArrayStream
structures
is expected to not involve any cross-device copying of data.
The device-aware methods (__arrow_c_device_array__()
, and __arrow_c_device_stream__()
)
should accept additional keyword arguments (**kwargs
), if they have a
default value of None
. This allows for future addition of new optional
keywords, where the default value for such a new keyword will always be None
.
The implementor is responsible for raising a NotImplementedError
for any
additional keyword being passed by the user which is not recognised. For
example:
def __arrow_c_device_array__(self, requested_schema=None, **kwargs):
non_default_kwargs = [
name for name, value in kwargs.items() if value is not None
]
if non_default_kwargs:
raise NotImplementedError(
f"Received unsupported keyword argument(s): {non_default_kwargs}"
)
...
Protocol Typehints#
The following typehints can be copied into your library to annotate that a function accepts an object implementing one of these protocols.
from typing import Tuple, Protocol
from typing_extensions import Self
class ArrowSchemaExportable(Protocol):
def __arrow_c_schema__(self) -> object: ...
class ArrowArrayExportable(Protocol):
def __arrow_c_array__(
self,
requested_schema: object | None = None
) -> Tuple[object, object]:
...
class ArrowStreamExportable(Protocol):
def __arrow_c_stream__(
self,
requested_schema: object | None = None
) -> object:
...
class ArrowDeviceArrayExportable(Protocol):
def __arrow_c_device_array__(
self,
requested_schema: object | None = None,
**kwargs,
) -> Tuple[object, object]:
...
class ArrowDeviceStreamExportable(Protocol):
def __arrow_c_device_stream__(
self,
requested_schema: object | None = None,
**kwargs,
) -> object:
...
Examples#
Create a PyCapsule#
To create a PyCapsule, use the PyCapsule_New function. The function must be passed a destructor function that will be called to release the data the capsule points to. It must first call the release callback if it is not null, then free the struct.
Below is the code to create a PyCapsule for an ArrowSchema
. The code for
ArrowArray
and ArrowArrayStream
is similar.
#include <Python.h>
void ReleaseArrowSchemaPyCapsule(PyObject* capsule) {
struct ArrowSchema* schema =
(struct ArrowSchema*)PyCapsule_GetPointer(capsule, "arrow_schema");
if (schema->release != NULL) {
schema->release(schema);
}
free(schema);
}
PyObject* ExportArrowSchemaPyCapsule() {
struct ArrowSchema* schema =
(struct ArrowSchema*)malloc(sizeof(struct ArrowSchema));
// Fill in ArrowSchema fields
// ...
return PyCapsule_New(schema, "arrow_schema", ReleaseArrowSchemaPyCapsule);
}
cimport cpython
from libc.stdlib cimport malloc, free
cdef void release_arrow_schema_py_capsule(object schema_capsule):
cdef ArrowSchema* schema = <ArrowSchema*>cpython.PyCapsule_GetPointer(
schema_capsule, 'arrow_schema'
)
if schema.release != NULL:
schema.release(schema)
free(schema)
cdef object export_arrow_schema_py_capsule():
cdef ArrowSchema* schema = <ArrowSchema*>malloc(sizeof(ArrowSchema))
# It's recommended to immediately wrap the struct in a capsule, so
# if subsequent lines raise an exception memory will not be leaked.
schema.release = NULL
capsule = cpython.PyCapsule_New(
<void*>schema, 'arrow_schema', release_arrow_schema_py_capsule
)
# Fill in ArrowSchema fields:
# schema.format = ...
# ...
return capsule
Consume a PyCapsule#
To consume a PyCapsule, use the PyCapsule_GetPointer function to get the pointer to the underlying struct. Import the struct using your system’s Arrow C Data Interface import function. Only after that should the capsule be freed.
The below example shows how to consume a PyCapsule for an ArrowSchema
. The
code for ArrowArray
and ArrowArrayStream
is similar.
#include <Python.h>
// If the capsule is not an ArrowSchema, will return NULL and set an exception.
struct ArrowSchema* GetArrowSchemaPyCapsule(PyObject* capsule) {
return PyCapsule_GetPointer(capsule, "arrow_schema");
}
cimport cpython
cdef ArrowSchema* get_arrow_schema_py_capsule(object capsule) except NULL:
return <ArrowSchema*>cpython.PyCapsule_GetPointer(capsule, 'arrow_schema')
Backwards Compatibility with PyArrow#
When interacting with PyArrow, the PyCapsule interface should be preferred over
the _export_to_c
and _import_from_c
methods. However, many libraries will
want to support a range of PyArrow versions. This can be done via Duck typing.
For example, if your library had an import method such as:
# OLD METHOD
def from_arrow(arr: pa.Array)
array_import_ptr = make_array_import_ptr()
schema_import_ptr = make_schema_import_ptr()
arr._export_to_c(array_import_ptr, schema_import_ptr)
return import_c_data(array_import_ptr, schema_import_ptr)
You can rewrite this method to support both PyArrow and other libraries that implement the PyCapsule interface:
# NEW METHOD
def from_arrow(arr)
# Newer versions of PyArrow as well as other libraries with Arrow data
# implement this method, so prefer it over _export_to_c.
if hasattr(arr, "__arrow_c_array__"):
schema_ptr, array_ptr = arr.__arrow_c_array__()
return import_c_capsule_data(schema_ptr, array_ptr)
elif isinstance(arr, pa.Array):
# Deprecated method, used for older versions of PyArrow
array_import_ptr = make_array_import_ptr()
schema_import_ptr = make_schema_import_ptr()
arr._export_to_c(array_import_ptr, schema_import_ptr)
return import_c_data(array_import_ptr, schema_import_ptr)
else:
raise TypeError(f"Cannot import {type(arr)} as Arrow array data.")
You may also wish to accept objects implementing the protocol in your
constructors. For example, in PyArrow, the array()
and record_batch()
constructors accept any object that implements the __arrow_c_array__()
method
protocol. Similarly, the PyArrow’s schema()
constructor accepts any object
that implements the __arrow_c_schema__()
method.
Now if your library has an export to PyArrow function, such as:
# OLD METHOD
def to_arrow(self) -> pa.Array:
array_export_ptr = make_array_export_ptr()
schema_export_ptr = make_schema_export_ptr()
self.export_c_data(array_export_ptr, schema_export_ptr)
return pa.Array._import_from_c(array_export_ptr, schema_export_ptr)
You can rewrite this function to use the PyCapsule interface by passing your
object to the array()
constructor, which accepts any object that
implements the protocol. An easy way to check if the PyArrow version is new
enough to support this is to check whether pa.Array
has the
__arrow_c_array__
method.
import warnings
# NEW METHOD
def to_arrow(self) -> pa.Array:
# PyArrow added support for constructing arrays from objects implementing
# __arrow_c_array__ in the same version it added the method for it's own
# arrays. So we can use hasattr to check if the method is available as
# a proxy for checking the PyArrow version.
if hasattr(pa.Array, "__arrow_c_array__"):
return pa.array(self)
else:
array_export_ptr = make_array_export_ptr()
schema_export_ptr = make_schema_export_ptr()
self.export_c_data(array_export_ptr, schema_export_ptr)
return pa.Array._import_from_c(array_export_ptr, schema_export_ptr)
Comparison with Other Protocols#
Comparison to DataFrame Interchange Protocol#
The DataFrame Interchange Protocol is another protocol in Python that allows for the sharing of data between libraries. This protocol is complementary to the DataFrame Interchange Protocol. Many of the objects that implement this protocol will also implement the DataFrame Interchange Protocol.
This protocol is specific to Arrow-based data structures, while the DataFrame Interchange Protocol allows non-Arrow data frames and arrays to be shared as well. Because of this, these PyCapsules can support Arrow-specific features such as nested columns.
This protocol is also much more minimal than the DataFrame Interchange Protocol. It just handles data export, rather than defining accessors for details like number of rows or columns.
In summary, if you are implementing this protocol, you should also consider implementing the DataFrame Interchange Protocol.
Comparison to __arrow_array__
protocol#
The Controlling conversion to pyarrow.Array with the __arrow_array__ protocol protocol is a dunder method that defines how PyArrow should import an object as an Arrow array. Unlike this protocol, it is specific to PyArrow and isn’t used by other libraries. It is also limited to arrays and does not support schemas, tabular structures, or streams.