nanoarrow for Python#
The nanoarrow Python package provides bindings to the nanoarrow C library. Like the nanoarrow C library, it provides tools to facilitate the use of the Arrow C Data and Arrow C Stream interfaces.
Installation#
Python bindings for nanoarrow are not yet available on PyPI. You can install via URL (requires a C compiler):
python -m pip install "git+https://github.com/apache/arrow-nanoarrow.git#egg=nanoarrow&subdirectory=python"
If you can import the namespace, you’re good to go!
import nanoarrow as na
Low-level C library bindings#
The Arrow C Data and Arrow C Stream interfaces are comprised of three
structures: the ArrowSchema
which represents a data type of an
array, the ArrowArray
which represents the values of an array, and
an ArrowArrayStream
, which represents zero or more ArrowArray
s
with a common ArrowSchema
.
Schemas#
Use nanoarrow.c_schema()
to convert an object to an ArrowSchema
and wrap it as a Python object. This works for any object implementing
the Arrow PyCapsule
Interface
(e.g., pyarrow.Schema
, pyarrow.DataType
, and pyarrow.Field
).
import pyarrow as pa
schema = na.c_schema(pa.decimal128(10, 3))
schema
<nanoarrow.c_lib.CSchema decimal128(10, 3)>
- format: 'd:10,3'
- name: ''
- flags: 2
- metadata: NULL
- dictionary: NULL
- children[0]:
You can extract the fields of a CSchema
object one at a time or
parse it into a view to extract deserialized parameters.
na.c_schema_view(schema)
<nanoarrow.c_lib.CSchemaView>
- type: 'decimal128'
- storage_type: 'decimal128'
- decimal_bitwidth: 128
- decimal_precision: 10
- decimal_scale: 3
Advanced users can allocate an empty CSchema
and populate its
contents by passing its ._addr()
to a schema-exporting function.
schema = na.allocate_c_schema()
pa.int32()._export_to_c(schema._addr())
schema
<nanoarrow.c_lib.CSchema int32>
- format: 'i'
- name: ''
- flags: 2
- metadata: NULL
- dictionary: NULL
- children[0]:
The CSchema
object cleans up after itself: when the object is
deleted, the underlying ArrowSchema
is released.
Arrays#
You can use nanoarrow.c_array()
to convert an array-like object to
an ArrowArray
, wrap it as a Python object, and attach a schema that
can be used to interpret its contents. This works for any object
implementing the Arrow PyCapsule
Interface
(e.g., pyarrow.Array
, pyarrow.RecordBatch
).
array = na.c_array(pa.array(["one", "two", "three", None]))
array
<nanoarrow.c_lib.CArray string>
- length: 4
- offset: 0
- null_count: 1
- buffers: (2939032895680, 2939032895616, 2939032895744)
- dictionary: NULL
- children[0]:
You can extract the fields of a CArray
one at a time or parse it
into a view to extract deserialized content:
na.c_array_view(array)
<nanoarrow.c_lib.CArrayView>
- storage_type: 'string'
- length: 4
- offset: 0
- null_count: 1
- buffers[3]:
- <bool validity[1 b] 11100000>
- <int32 data_offset[20 b] 0 3 6 11 11>
- <string data[11 b] b'onetwothree'>
- dictionary: NULL
- children[0]:
Like the CSchema
, you can allocate an empty one and access its
address with _addr()
to pass to other array-exporting functions.
array = na.allocate_c_array()
pa.array([1, 2, 3])._export_to_c(array._addr(), array.schema._addr())
array.length
3
Array streams#
You can use nanoarrow.c_array_stream()
to wrap an object
representing a sequence of CArray
s with a common CSchema
to an
ArrowArrayStream
and wrap it as a Python object. This works for any
object implementing the Arrow PyCapsule
Interface
(e.g., pyarrow.RecordBatchReader
).
pa_array_child = pa.array([1, 2, 3], pa.int32())
pa_array = pa.record_batch([pa_array_child], names=["some_column"])
reader = pa.RecordBatchReader.from_batches(pa_array.schema, [pa_array])
array_stream = na.c_array_stream(reader)
array_stream
<nanoarrow.c_lib.CArrayStream>
- get_schema(): <nanoarrow.c_lib.CSchema struct>
- format: '+s'
- name: ''
- flags: 0
- metadata: NULL
- dictionary: NULL
- children[1]:
'some_column': <nanoarrow.c_lib.CSchema int32>
- format: 'i'
- name: 'some_column'
- flags: 2
- metadata: NULL
- dictionary: NULL
- children[0]:
You can pull the next array from the stream using .get_next()
or use
it like an iterator. The .get_next()
method will raise
StopIteration
when there are no more arrays in the stream.
for array in array_stream:
print(array)
<nanoarrow.c_lib.CArray struct>
- length: 3
- offset: 0
- null_count: 0
- buffers: (0,)
- dictionary: NULL
- children[1]:
'some_column': <nanoarrow.c_lib.CArray int32>
- length: 3
- offset: 0
- null_count: 0
- buffers: (0, 2939033026688)
- dictionary: NULL
- children[0]:
You can also get the address of a freshly-allocated stream to pass to a suitable exporting function:
array_stream = na.allocate_c_array_stream()
reader._export_to_c(array_stream._addr())
array_stream
<nanoarrow.c_lib.CArrayStream>
- get_schema(): <nanoarrow.c_lib.CSchema struct>
- format: '+s'
- name: ''
- flags: 0
- metadata: NULL
- dictionary: NULL
- children[1]:
'some_column': <nanoarrow.c_lib.CSchema int32>
- format: 'i'
- name: 'some_column'
- flags: 2
- metadata: NULL
- dictionary: NULL
- children[0]:
Development#
Python bindings for nanoarrow are managed with setuptools. This means you can build the project using:
git clone https://github.com/apache/arrow-nanoarrow.git
cd arrow-nanoarrow/python
pip install -e .
Tests use pytest:
# Install dependencies
pip install -e .[test]
# Run tests
pytest -vvx