.. raw:: html
.. raw:: html
nanoarrow
=========
.. raw:: html
.. raw:: html
The goal of nanoarrow is to provide minimal useful bindings to the
`Arrow C
Data `__ and
`Arrow C
Stream `__
interfaces using the `nanoarrow C
library `__.
Installation
------------
You can install the released version of nanoarrow from
`CRAN `__ with:
.. code:: r
install.packages("nanoarrow")
You can install the development version of nanoarrow from
`GitHub `__ with:
.. code:: r
# install.packages("remotes")
remotes::install_github("apache/arrow-nanoarrow/r")
If you can load the package, you’re good to go!
.. code:: r
library(nanoarrow)
Example
-------
The Arrow C Data and Arrow C Stream interfaces are comprised of three
structures: the ``ArrowSchema`` which represents a data type of an
array, the ``ArrowArray`` which represents the values of an array, and
an ``ArrowArrayStream``, which represents zero or more ``ArrowArray``\ s
with a common ``ArrowSchema``. All three can be wrapped by R objects
using the nanoarrow R package.
Schemas
~~~~~~~
Use ``infer_nanoarrow_schema()`` to get the ArrowSchema object that
corresponds to a given R vector type; use ``as_nanoarrow_schema()`` to
convert an object from some other data type representation (e.g., an
arrow R package ``DataType`` like ``arrow::int32()``); or use
``na_XXX()`` functions to construct them.
.. code:: r
infer_nanoarrow_schema(1:5)
#>
#> $ format : chr "i"
#> $ name : chr ""
#> $ metadata : list()
#> $ flags : int 2
#> $ children : list()
#> $ dictionary: NULL
as_nanoarrow_schema(arrow::schema(col1 = arrow::float64()))
#>
#> $ format : chr "+s"
#> $ name : chr ""
#> $ metadata : list()
#> $ flags : int 0
#> $ children :List of 1
#> ..$ col1:
#> .. ..$ format : chr "g"
#> .. ..$ name : chr "col1"
#> .. ..$ metadata : list()
#> .. ..$ flags : int 2
#> .. ..$ children : list()
#> .. ..$ dictionary: NULL
#> $ dictionary: NULL
na_int64()
#>
#> $ format : chr "l"
#> $ name : chr ""
#> $ metadata : list()
#> $ flags : int 2
#> $ children : list()
#> $ dictionary: NULL
Arrays
~~~~~~
Use ``as_nanoarrow_array()`` to convert an object to an ArrowArray
object:
.. code:: r
as_nanoarrow_array(1:5)
#>
#> $ length : int 5
#> $ null_count: int 0
#> $ offset : int 0
#> $ buffers :List of 2
#> ..$ :[0][0 b]> ``
#> ..$ :[5][20 b]> `1 2 3 4 5`
#> $ dictionary: NULL
#> $ children : list()
as_nanoarrow_array(data.frame(col1 = c(1.1, 2.2)))
#>
#> $ length : int 2
#> $ null_count: int 0
#> $ offset : int 0
#> $ buffers :List of 1
#> ..$ :[0][0 b]> ``
#> $ children :List of 1
#> ..$ col1:
#> .. ..$ length : int 2
#> .. ..$ null_count: int 0
#> .. ..$ offset : int 0
#> .. ..$ buffers :List of 2
#> .. .. ..$ :[0][0 b]> ``
#> .. .. ..$ :[2][16 b]> `1.1 2.2`
#> .. ..$ dictionary: NULL
#> .. ..$ children : list()
#> $ dictionary: NULL
You can use ``as.vector()`` or ``as.data.frame()`` to get the R
representation of the object back:
.. code:: r
array <- as_nanoarrow_array(data.frame(col1 = c(1.1, 2.2)))
as.data.frame(array)
#> col1
#> 1 1.1
#> 2 2.2
Even though at the C level the ArrowArray is distinct from the
ArrowSchema, at the R level we attach a schema wherever possible. You
can access the attached schema using ``infer_nanoarrow_schema()``:
.. code:: r
infer_nanoarrow_schema(array)
#>
#> $ format : chr "+s"
#> $ name : chr ""
#> $ metadata : list()
#> $ flags : int 0
#> $ children :List of 1
#> ..$ col1:
#> .. ..$ format : chr "g"
#> .. ..$ name : chr "col1"
#> .. ..$ metadata : list()
#> .. ..$ flags : int 2
#> .. ..$ children : list()
#> .. ..$ dictionary: NULL
#> $ dictionary: NULL
Array Streams
~~~~~~~~~~~~~
The easiest way to create an ArrowArrayStream is from a list of arrays
or objects that can be converted to an array using
``as_nanoarrow_array()``:
.. code:: r
stream <- basic_array_stream(
list(
data.frame(col1 = c(1.1, 2.2)),
data.frame(col1 = c(3.3, 4.4))
)
)
You can pull batches from the stream using the ``$get_next()`` method.
The last batch will return ``NULL``.
.. code:: r
stream$get_next()
#>
#> $ length : int 2
#> $ null_count: int 0
#> $ offset : int 0
#> $ buffers :List of 1
#> ..$ :[0][0 b]> ``
#> $ children :List of 1
#> ..$ col1:
#> .. ..$ length : int 2
#> .. ..$ null_count: int 0
#> .. ..$ offset : int 0
#> .. ..$ buffers :List of 2
#> .. .. ..$ :[0][0 b]> ``
#> .. .. ..$ :[2][16 b]> `1.1 2.2`
#> .. ..$ dictionary: NULL
#> .. ..$ children : list()
#> $ dictionary: NULL
stream$get_next()
#>
#> $ length : int 2
#> $ null_count: int 0
#> $ offset : int 0
#> $ buffers :List of 1
#> ..$ :[0][0 b]> ``
#> $ children :List of 1
#> ..$ col1:
#> .. ..$ length : int 2
#> .. ..$ null_count: int 0
#> .. ..$ offset : int 0
#> .. ..$ buffers :List of 2
#> .. .. ..$ :[0][0 b]> ``
#> .. .. ..$ :[2][16 b]> `3.3 4.4`
#> .. ..$ dictionary: NULL
#> .. ..$ children : list()
#> $ dictionary: NULL
stream$get_next()
#> NULL
You can pull all the batches into a ``data.frame()`` by calling
``as.data.frame()`` or ``as.vector()``:
.. code:: r
stream <- basic_array_stream(
list(
data.frame(col1 = c(1.1, 2.2)),
data.frame(col1 = c(3.3, 4.4))
)
)
as.data.frame(stream)
#> col1
#> 1 1.1
#> 2 2.2
#> 3 3.3
#> 4 4.4
After consuming a stream, you should call the release method as soon as
you can. This lets the implementation of the stream release any
resources (like open files) it may be holding in a more predictable way
than waiting for the garbage collector to clean up the object.
Integration with the arrow package
----------------------------------
The nanoarrow package implements ``as_nanoarrow_schema()``,
``as_nanoarrow_array()``, and ``as_nanoarrow_array_stream()`` for most
arrow package types. Similarly, it implements
``arrow::as_arrow_array()``, ``arrow::as_record_batch()``,
``arrow::as_arrow_table()``, ``arrow::as_record_batch_reader()``,
``arrow::infer_type()``, ``arrow::as_data_type()``, and
``arrow::as_schema()`` for nanoarrow objects such that you can pass
equivalent nanoarrow objects into many arrow functions and vice versa.