.. raw:: html .. raw:: html nanoarrow for Python ==================== The nanoarrow Python package provides bindings to the nanoarrow C library. Like the nanoarrow C library, it provides tools to facilitate the use of the `Arrow C Data `__ and `Arrow C Stream `__ interfaces. Installation ------------ Python bindings for nanoarrow are not yet available on PyPI. You can install via URL (requires a C compiler): .. code:: bash python -m pip install "git+https://github.com/apache/arrow-nanoarrow.git#egg=nanoarrow&subdirectory=python" If you can import the namespace, you’re good to go! .. code:: python import nanoarrow as na Low-level C library bindings ---------------------------- The Arrow C Data and Arrow C Stream interfaces are comprised of three structures: the ``ArrowSchema`` which represents a data type of an array, the ``ArrowArray`` which represents the values of an array, and an ``ArrowArrayStream``, which represents zero or more ``ArrowArray``\ s with a common ``ArrowSchema``. Schemas ~~~~~~~ Use ``nanoarrow.c_schema()`` to convert an object to an ``ArrowSchema`` and wrap it as a Python object. This works for any object implementing the `Arrow PyCapsule Interface `__ (e.g., ``pyarrow.Schema``, ``pyarrow.DataType``, and ``pyarrow.Field``). .. code:: python import pyarrow as pa schema = na.c_schema(pa.decimal128(10, 3)) schema :: - format: 'd:10,3' - name: '' - flags: 2 - metadata: NULL - dictionary: NULL - children[0]: You can extract the fields of a ``CSchema`` object one at a time or parse it into a view to extract deserialized parameters. .. code:: python na.c_schema_view(schema) :: - type: 'decimal128' - storage_type: 'decimal128' - decimal_bitwidth: 128 - decimal_precision: 10 - decimal_scale: 3 Advanced users can allocate an empty ``CSchema`` and populate its contents by passing its ``._addr()`` to a schema-exporting function. .. code:: python schema = na.allocate_c_schema() pa.int32()._export_to_c(schema._addr()) schema :: - format: 'i' - name: '' - flags: 2 - metadata: NULL - dictionary: NULL - children[0]: The ``CSchema`` object cleans up after itself: when the object is deleted, the underlying ``ArrowSchema`` is released. Arrays ~~~~~~ You can use ``nanoarrow.c_array()`` to convert an array-like object to an ``ArrowArray``, wrap it as a Python object, and attach a schema that can be used to interpret its contents. This works for any object implementing the `Arrow PyCapsule Interface `__ (e.g., ``pyarrow.Array``, ``pyarrow.RecordBatch``). .. code:: python array = na.c_array(pa.array(["one", "two", "three", None])) array :: - length: 4 - offset: 0 - null_count: 1 - buffers: (2939032895680, 2939032895616, 2939032895744) - dictionary: NULL - children[0]: You can extract the fields of a ``CArray`` one at a time or parse it into a view to extract deserialized content: .. code:: python na.c_array_view(array) :: - storage_type: 'string' - length: 4 - offset: 0 - null_count: 1 - buffers[3]: - - - - dictionary: NULL - children[0]: Like the ``CSchema``, you can allocate an empty one and access its address with ``_addr()`` to pass to other array-exporting functions. .. code:: python array = na.allocate_c_array() pa.array([1, 2, 3])._export_to_c(array._addr(), array.schema._addr()) array.length :: 3 Array streams ~~~~~~~~~~~~~ You can use ``nanoarrow.c_array_stream()`` to wrap an object representing a sequence of ``CArray``\ s with a common ``CSchema`` to an ``ArrowArrayStream`` and wrap it as a Python object. This works for any object implementing the `Arrow PyCapsule Interface `__ (e.g., ``pyarrow.RecordBatchReader``). .. code:: python pa_array_child = pa.array([1, 2, 3], pa.int32()) pa_array = pa.record_batch([pa_array_child], names=["some_column"]) reader = pa.RecordBatchReader.from_batches(pa_array.schema, [pa_array]) array_stream = na.c_array_stream(reader) array_stream :: - get_schema(): - format: '+s' - name: '' - flags: 0 - metadata: NULL - dictionary: NULL - children[1]: 'some_column': - format: 'i' - name: 'some_column' - flags: 2 - metadata: NULL - dictionary: NULL - children[0]: You can pull the next array from the stream using ``.get_next()`` or use it like an iterator. The ``.get_next()`` method will raise ``StopIteration`` when there are no more arrays in the stream. .. code:: python for array in array_stream: print(array) :: - length: 3 - offset: 0 - null_count: 0 - buffers: (0,) - dictionary: NULL - children[1]: 'some_column': - length: 3 - offset: 0 - null_count: 0 - buffers: (0, 2939033026688) - dictionary: NULL - children[0]: You can also get the address of a freshly-allocated stream to pass to a suitable exporting function: .. code:: python array_stream = na.allocate_c_array_stream() reader._export_to_c(array_stream._addr()) array_stream :: - get_schema(): - format: '+s' - name: '' - flags: 0 - metadata: NULL - dictionary: NULL - children[1]: 'some_column': - format: 'i' - name: 'some_column' - flags: 2 - metadata: NULL - dictionary: NULL - children[0]: Development ----------- Python bindings for nanoarrow are managed with `setuptools `__. This means you can build the project using: .. code:: shell git clone https://github.com/apache/arrow-nanoarrow.git cd arrow-nanoarrow/python pip install -e . Tests use `pytest `__: .. code:: shell # Install dependencies pip install -e .[test] # Run tests pytest -vvx