.. raw:: html
.. raw:: html
nanoarrow for Python
====================
The nanoarrow Python package provides bindings to the nanoarrow C
library. Like the nanoarrow C library, it provides tools to facilitate
the use of the `Arrow C
Data `__ and
`Arrow C
Stream `__
interfaces.
Installation
------------
Python bindings for nanoarrow are not yet available on PyPI. You can
install via URL (requires a C compiler):
.. code:: bash
python -m pip install "git+https://github.com/apache/arrow-nanoarrow.git#egg=nanoarrow&subdirectory=python"
If you can import the namespace, you’re good to go!
.. code:: python
import nanoarrow as na
Low-level C library bindings
----------------------------
The Arrow C Data and Arrow C Stream interfaces are comprised of three
structures: the ``ArrowSchema`` which represents a data type of an
array, the ``ArrowArray`` which represents the values of an array, and
an ``ArrowArrayStream``, which represents zero or more ``ArrowArray``\ s
with a common ``ArrowSchema``.
Schemas
~~~~~~~
Use ``nanoarrow.c_schema()`` to convert an object to an ``ArrowSchema``
and wrap it as a Python object. This works for any object implementing
the `Arrow PyCapsule
Interface `__
(e.g., ``pyarrow.Schema``, ``pyarrow.DataType``, and ``pyarrow.Field``).
.. code:: python
import pyarrow as pa
schema = na.c_schema(pa.decimal128(10, 3))
schema
::
- format: 'd:10,3'
- name: ''
- flags: 2
- metadata: NULL
- dictionary: NULL
- children[0]:
You can extract the fields of a ``CSchema`` object one at a time or
parse it into a view to extract deserialized parameters.
.. code:: python
na.c_schema_view(schema)
::
- type: 'decimal128'
- storage_type: 'decimal128'
- decimal_bitwidth: 128
- decimal_precision: 10
- decimal_scale: 3
Advanced users can allocate an empty ``CSchema`` and populate its
contents by passing its ``._addr()`` to a schema-exporting function.
.. code:: python
schema = na.allocate_c_schema()
pa.int32()._export_to_c(schema._addr())
schema
::
- format: 'i'
- name: ''
- flags: 2
- metadata: NULL
- dictionary: NULL
- children[0]:
The ``CSchema`` object cleans up after itself: when the object is
deleted, the underlying ``ArrowSchema`` is released.
Arrays
~~~~~~
You can use ``nanoarrow.c_array()`` to convert an array-like object to
an ``ArrowArray``, wrap it as a Python object, and attach a schema that
can be used to interpret its contents. This works for any object
implementing the `Arrow PyCapsule
Interface `__
(e.g., ``pyarrow.Array``, ``pyarrow.RecordBatch``).
.. code:: python
array = na.c_array(pa.array(["one", "two", "three", None]))
array
::
- length: 4
- offset: 0
- null_count: 1
- buffers: (2939032895680, 2939032895616, 2939032895744)
- dictionary: NULL
- children[0]:
You can extract the fields of a ``CArray`` one at a time or parse it
into a view to extract deserialized content:
.. code:: python
na.c_array_view(array)
::
- storage_type: 'string'
- length: 4
- offset: 0
- null_count: 1
- buffers[3]:
-
-
-
- dictionary: NULL
- children[0]:
Like the ``CSchema``, you can allocate an empty one and access its
address with ``_addr()`` to pass to other array-exporting functions.
.. code:: python
array = na.allocate_c_array()
pa.array([1, 2, 3])._export_to_c(array._addr(), array.schema._addr())
array.length
::
3
Array streams
~~~~~~~~~~~~~
You can use ``nanoarrow.c_array_stream()`` to wrap an object
representing a sequence of ``CArray``\ s with a common ``CSchema`` to an
``ArrowArrayStream`` and wrap it as a Python object. This works for any
object implementing the `Arrow PyCapsule
Interface `__
(e.g., ``pyarrow.RecordBatchReader``).
.. code:: python
pa_array_child = pa.array([1, 2, 3], pa.int32())
pa_array = pa.record_batch([pa_array_child], names=["some_column"])
reader = pa.RecordBatchReader.from_batches(pa_array.schema, [pa_array])
array_stream = na.c_array_stream(reader)
array_stream
::
- get_schema():
- format: '+s'
- name: ''
- flags: 0
- metadata: NULL
- dictionary: NULL
- children[1]:
'some_column':
- format: 'i'
- name: 'some_column'
- flags: 2
- metadata: NULL
- dictionary: NULL
- children[0]:
You can pull the next array from the stream using ``.get_next()`` or use
it like an iterator. The ``.get_next()`` method will raise
``StopIteration`` when there are no more arrays in the stream.
.. code:: python
for array in array_stream:
print(array)
::
- length: 3
- offset: 0
- null_count: 0
- buffers: (0,)
- dictionary: NULL
- children[1]:
'some_column':
- length: 3
- offset: 0
- null_count: 0
- buffers: (0, 2939033026688)
- dictionary: NULL
- children[0]:
You can also get the address of a freshly-allocated stream to pass to a
suitable exporting function:
.. code:: python
array_stream = na.allocate_c_array_stream()
reader._export_to_c(array_stream._addr())
array_stream
::
- get_schema():
- format: '+s'
- name: ''
- flags: 0
- metadata: NULL
- dictionary: NULL
- children[1]:
'some_column':
- format: 'i'
- name: 'some_column'
- flags: 2
- metadata: NULL
- dictionary: NULL
- children[0]:
Development
-----------
Python bindings for nanoarrow are managed with
`setuptools `__. This
means you can build the project using:
.. code:: shell
git clone https://github.com/apache/arrow-nanoarrow.git
cd arrow-nanoarrow/python
pip install -e .
Tests use `pytest `__:
.. code:: shell
# Install dependencies
pip install -e .[test]
# Run tests
pytest -vvx