.. raw:: html
.. raw:: html
nanoarrow for Python
====================
The nanoarrow Python package provides bindings to the nanoarrow C
library. Like the nanoarrow C library, it provides tools to facilitate
the use of the `Arrow C
Data `__ and
`Arrow C
Stream `__
interfaces.
Installation
------------
The nanoarrow Python bindings are available from
`PyPI `__ and
`conda-forge `__:
.. code:: shell
pip install nanoarrow
conda install nanoarrow -c conda-forge
Development versions (based on the ``main`` branch) are also available:
.. code:: shell
pip install --extra-index-url https://pypi.fury.io/arrow-nightlies/ \
--prefer-binary --pre nanoarrow
If you can import the namespace, you’re good to go!
.. code:: python
import nanoarrow as na
Data types, arrays, and array streams
-------------------------------------
The Arrow C Data and Arrow C Stream interfaces are comprised of three
structures: the ``ArrowSchema`` which represents a data type of an
array, the ``ArrowArray`` which represents the values of an array, and
an ``ArrowArrayStream``, which represents zero or more ``ArrowArray``\ s
with a common ``ArrowSchema``. These concepts map to the
``nanoarrow.Schema``, ``nanoarrow.Array``, and ``nanoarrow.ArrayStream``
in the Python package.
.. code:: python
na.int32()
::
int32
.. code:: python
na.Array([1, 2, 3], na.int32())
::
nanoarrow.Array[3]
1
2
3
The ``nanoarrow.Array`` can accommodate arrays with any number of
chunks, reflecting the reality that many array containers (e.g.,
``pyarrow.ChunkedArray``, ``polars.Series``) support this.
.. code:: python
chunked = na.Array.from_chunks([[1, 2, 3], [4, 5, 6]], na.int32())
chunked
::
nanoarrow.Array[6]
1
2
3
4
5
6
Whereas chunks of an ``Array`` are always fully materialized when the
object is constructed, the chunks of an ``ArrayStream`` have not
necessarily been resolved yet.
.. code:: python
stream = na.ArrayStream(chunked)
stream
::
nanoarrow.ArrayStream
.. code:: python
with stream:
for chunk in stream:
print(chunk)
::
nanoarrow.Array[3]
1
2
3
nanoarrow.Array[3]
4
5
6
The ``nanoarrow.ArrayStream`` also provides an interface to nanoarrow’s
`Arrow
IPC `__
reader:
.. code:: python
url = "https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows"
na.ArrayStream.from_url(url)
::
nanoarrow.ArrayStream
These objects implement the `Arrow PyCapsule
interface `__
for both producing and consuming and are interchangeable with
``pyarrow`` objects in many cases:
.. code:: python
import pyarrow as pa
pa.field(na.int32())
::
pyarrow.Field<: int32>
.. code:: python
pa.chunked_array(chunked)
::
[
[
1,
2,
3
],
[
4,
5,
6
]
]
.. code:: python
pa.array(chunked.chunk(1))
::
[
4,
5,
6
]
.. code:: python
na.Array(pa.array([10, 11, 12]))
::
nanoarrow.Array[3]
10
11
12
.. code:: python
na.Schema(pa.string())
::
string
Low-level C library bindings
----------------------------
The nanoarrow Python package also provides lower level wrappers around
Arrow C interface structures. You can create these using
``nanoarrow.c_schema()``, ``nanoarrow.c_array()``, and
``nanoarrow.c_array_stream()``.
Schemas
~~~~~~~
Use ``nanoarrow.c_schema()`` to convert an object to an ``ArrowSchema``
and wrap it as a Python object. This works for any object implementing
the `Arrow PyCapsule
Interface `__
(e.g., ``pyarrow.Schema``, ``pyarrow.DataType``, and ``pyarrow.Field``).
.. code:: python
na.c_schema(pa.decimal128(10, 3))
::
- format: 'd:10,3'
- name: ''
- flags: 2
- metadata: NULL
- dictionary: NULL
- children[0]:
Using ``c_schema()`` is a good fit for testing and for ephemeral schema
objects that are being passed from one library to another. To extract
the fields of a schema in a more convenient form, use ``Schema()``:
.. code:: python
schema = na.Schema(pa.decimal128(10, 3))
schema.precision, schema.scale
::
(10, 3)
The ``CSchema`` object cleans up after itself: when the object is
deleted, the underlying ``ArrowSchema`` is released.
Arrays
~~~~~~
You can use ``nanoarrow.c_array()`` to convert an array-like object to
an ``ArrowArray``, wrap it as a Python object, and attach a schema that
can be used to interpret its contents. This works for any object
implementing the `Arrow PyCapsule
Interface `__
(e.g., ``pyarrow.Array``, ``pyarrow.RecordBatch``).
.. code:: python
na.c_array(["one", "two", "three", None], na.string())
::
- length: 4
- offset: 0
- null_count: 1
- buffers: (4754305168, 4754307808, 4754310464)
- dictionary: NULL
- children[0]:
Using ``c_array()`` is a good fit for testing and for ephemeral array
objects that are being passed from one library to another. For a higher
level interface, use ``Array()``:
.. code:: python
array = na.Array(["one", "two", "three", None], na.string())
array.to_pylist()
::
['one', 'two', 'three', None]
.. code:: python
array.buffers
::
(nanoarrow.c_lib.CBufferView(bool[1 b] 11100000),
nanoarrow.c_lib.CBufferView(int32[20 b] 0 3 6 11 11),
nanoarrow.c_lib.CBufferView(string[11 b] b'onetwothree'))
Advanced users can create arrays directly from buffers using
``c_array_from_buffers()``:
.. code:: python
na.c_array_from_buffers(
na.string(),
2,
[None, na.c_buffer([0, 3, 6], na.int32()), b"abcdef"]
)
::
- length: 2
- offset: 0
- null_count: 0
- buffers: (0, 5002908320, 4999694624)
- dictionary: NULL
- children[0]:
Array streams
~~~~~~~~~~~~~
You can use ``nanoarrow.c_array_stream()`` to wrap an object
representing a sequence of ``CArray``\ s with a common ``CSchema`` to an
``ArrowArrayStream`` and wrap it as a Python object. This works for any
object implementing the `Arrow PyCapsule
Interface `__
(e.g., ``pyarrow.RecordBatchReader``, ``pyarrow.ChunkedArray``).
.. code:: python
pa_batch = pa.record_batch({"col1": [1, 2, 3]})
reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch])
array_stream = na.c_array_stream(reader)
array_stream
::
- get_schema(): struct
You can pull the next array from the stream using ``.get_next()`` or use
it like an iterator. The ``.get_next()`` method will raise
``StopIteration`` when there are no more arrays in the stream.
.. code:: python
for array in array_stream:
print(array)
::
>
- length: 3
- offset: 0
- null_count: 0
- buffers: (0,)
- dictionary: NULL
- children[1]:
'col1':
- length: 3
- offset: 0
- null_count: 0
- buffers: (0, 2642948588352)
- dictionary: NULL
- children[0]:
Use ``ArrayStream()`` for a higher level interface:
.. code:: python
reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch])
na.ArrayStream(reader).read_all()
::
nanoarrow.Array>[3]
{'col1': 1}
{'col1': 2}
{'col1': 3}
Development
-----------
Python bindings for nanoarrow are managed with
`setuptools `__. This
means you can build the project using:
.. code:: shell
git clone https://github.com/apache/arrow-nanoarrow.git
cd arrow-nanoarrow/python
pip install -e .
Tests use `pytest `__:
.. code:: shell
# Install dependencies
pip install -e ".[test]"
# Run tests
pytest -vvx
CMake is currently required to ensure that the vendored copy of
nanoarrow in the Python package stays in sync with the nanoarrow sources
in the working tree.