.. raw:: html .. raw:: html nanoarrow for Python ==================== The nanoarrow Python package provides bindings to the nanoarrow C library. Like the nanoarrow C library, it provides tools to facilitate the use of the `Arrow C Data `__ and `Arrow C Stream `__ interfaces. Installation ------------ The nanoarrow Python bindings are available from `PyPI `__ and `conda-forge `__: .. code:: shell pip install nanoarrow conda install nanoarrow -c conda-forge Development versions (based on the ``main`` branch) are also available: .. code:: shell pip install --extra-index-url https://pypi.fury.io/arrow-nightlies/ \ --prefer-binary --pre nanoarrow If you can import the namespace, you’re good to go! .. code:: python import nanoarrow as na Data types, arrays, and array streams ------------------------------------- The Arrow C Data and Arrow C Stream interfaces are comprised of three structures: the ``ArrowSchema`` which represents a data type of an array, the ``ArrowArray`` which represents the values of an array, and an ``ArrowArrayStream``, which represents zero or more ``ArrowArray``\ s with a common ``ArrowSchema``. These concepts map to the ``nanoarrow.Schema``, ``nanoarrow.Array``, and ``nanoarrow.ArrayStream`` in the Python package. .. code:: python na.int32() :: int32 .. code:: python na.Array([1, 2, 3], na.int32()) :: nanoarrow.Array[3] 1 2 3 The ``nanoarrow.Array`` can accommodate arrays with any number of chunks, reflecting the reality that many array containers (e.g., ``pyarrow.ChunkedArray``, ``polars.Series``) support this. .. code:: python chunked = na.Array.from_chunks([[1, 2, 3], [4, 5, 6]], na.int32()) chunked :: nanoarrow.Array[6] 1 2 3 4 5 6 Whereas chunks of an ``Array`` are always fully materialized when the object is constructed, the chunks of an ``ArrayStream`` have not necessarily been resolved yet. .. code:: python stream = na.ArrayStream(chunked) stream :: nanoarrow.ArrayStream .. code:: python with stream: for chunk in stream: print(chunk) :: nanoarrow.Array[3] 1 2 3 nanoarrow.Array[3] 4 5 6 The ``nanoarrow.ArrayStream`` also provides an interface to nanoarrow’s `Arrow IPC `__ reader: .. code:: python url = "https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows" na.ArrayStream.from_url(url) :: nanoarrow.ArrayStream These objects implement the `Arrow PyCapsule interface `__ for both producing and consuming and are interchangeable with ``pyarrow`` objects in many cases: .. code:: python import pyarrow as pa pa.field(na.int32()) :: pyarrow.Field<: int32> .. code:: python pa.chunked_array(chunked) :: [ [ 1, 2, 3 ], [ 4, 5, 6 ] ] .. code:: python pa.array(chunked.chunk(1)) :: [ 4, 5, 6 ] .. code:: python na.Array(pa.array([10, 11, 12])) :: nanoarrow.Array[3] 10 11 12 .. code:: python na.Schema(pa.string()) :: string Low-level C library bindings ---------------------------- The nanoarrow Python package also provides lower level wrappers around Arrow C interface structures. You can create these using ``nanoarrow.c_schema()``, ``nanoarrow.c_array()``, and ``nanoarrow.c_array_stream()``. Schemas ~~~~~~~ Use ``nanoarrow.c_schema()`` to convert an object to an ``ArrowSchema`` and wrap it as a Python object. This works for any object implementing the `Arrow PyCapsule Interface `__ (e.g., ``pyarrow.Schema``, ``pyarrow.DataType``, and ``pyarrow.Field``). .. code:: python na.c_schema(pa.decimal128(10, 3)) :: - format: 'd:10,3' - name: '' - flags: 2 - metadata: NULL - dictionary: NULL - children[0]: Using ``c_schema()`` is a good fit for testing and for ephemeral schema objects that are being passed from one library to another. To extract the fields of a schema in a more convenient form, use ``Schema()``: .. code:: python schema = na.Schema(pa.decimal128(10, 3)) schema.precision, schema.scale :: (10, 3) The ``CSchema`` object cleans up after itself: when the object is deleted, the underlying ``ArrowSchema`` is released. Arrays ~~~~~~ You can use ``nanoarrow.c_array()`` to convert an array-like object to an ``ArrowArray``, wrap it as a Python object, and attach a schema that can be used to interpret its contents. This works for any object implementing the `Arrow PyCapsule Interface `__ (e.g., ``pyarrow.Array``, ``pyarrow.RecordBatch``). .. code:: python na.c_array(["one", "two", "three", None], na.string()) :: - length: 4 - offset: 0 - null_count: 1 - buffers: (4754305168, 4754307808, 4754310464) - dictionary: NULL - children[0]: Using ``c_array()`` is a good fit for testing and for ephemeral array objects that are being passed from one library to another. For a higher level interface, use ``Array()``: .. code:: python array = na.Array(["one", "two", "three", None], na.string()) array.to_pylist() :: ['one', 'two', 'three', None] .. code:: python array.buffers :: (nanoarrow.c_lib.CBufferView(bool[1 b] 11100000), nanoarrow.c_lib.CBufferView(int32[20 b] 0 3 6 11 11), nanoarrow.c_lib.CBufferView(string[11 b] b'onetwothree')) Advanced users can create arrays directly from buffers using ``c_array_from_buffers()``: .. code:: python na.c_array_from_buffers( na.string(), 2, [None, na.c_buffer([0, 3, 6], na.int32()), b"abcdef"] ) :: - length: 2 - offset: 0 - null_count: 0 - buffers: (0, 5002908320, 4999694624) - dictionary: NULL - children[0]: Array streams ~~~~~~~~~~~~~ You can use ``nanoarrow.c_array_stream()`` to wrap an object representing a sequence of ``CArray``\ s with a common ``CSchema`` to an ``ArrowArrayStream`` and wrap it as a Python object. This works for any object implementing the `Arrow PyCapsule Interface `__ (e.g., ``pyarrow.RecordBatchReader``, ``pyarrow.ChunkedArray``). .. code:: python pa_batch = pa.record_batch({"col1": [1, 2, 3]}) reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch]) array_stream = na.c_array_stream(reader) array_stream :: - get_schema(): struct You can pull the next array from the stream using ``.get_next()`` or use it like an iterator. The ``.get_next()`` method will raise ``StopIteration`` when there are no more arrays in the stream. .. code:: python for array in array_stream: print(array) :: > - length: 3 - offset: 0 - null_count: 0 - buffers: (0,) - dictionary: NULL - children[1]: 'col1': - length: 3 - offset: 0 - null_count: 0 - buffers: (0, 2642948588352) - dictionary: NULL - children[0]: Use ``ArrayStream()`` for a higher level interface: .. code:: python reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch]) na.ArrayStream(reader).read_all() :: nanoarrow.Array>[3] {'col1': 1} {'col1': 2} {'col1': 3} Development ----------- Python bindings for nanoarrow are managed with `setuptools `__. This means you can build the project using: .. code:: shell git clone https://github.com/apache/arrow-nanoarrow.git cd arrow-nanoarrow/python pip install -e . Tests use `pytest `__: .. code:: shell # Install dependencies pip install -e ".[test]" # Run tests pytest -vvx CMake is currently required to ensure that the vendored copy of nanoarrow in the Python package stays in sync with the nanoarrow sources in the working tree.