.. raw:: html .. raw:: html nanoarrow ========= .. raw:: html .. raw:: html The goal of nanoarrow is to provide minimal useful bindings to the `Arrow C Data `__ and `Arrow C Stream `__ interfaces using the `nanoarrow C library `__. Installation ------------ You can install the released version of nanoarrow from `CRAN `__ with: .. code:: r install.packages("nanoarrow") You can install the development version of nanoarrow from `GitHub `__ with: .. code:: r # install.packages("remotes") remotes::install_github("apache/arrow-nanoarrow/r") If you can load the package, you’re good to go! .. code:: r library(nanoarrow) Example ------- The Arrow C Data and Arrow C Stream interfaces are comprised of three structures: the ``ArrowSchema`` which represents a data type of an array, the ``ArrowArray`` which represents the values of an array, and an ``ArrowArrayStream``, which represents zero or more ``ArrowArray``\ s with a common ``ArrowSchema``. All three can be wrapped by R objects using the nanoarrow R package. Schemas ~~~~~~~ Use ``infer_nanoarrow_schema()`` to get the ArrowSchema object that corresponds to a given R vector type; use ``as_nanoarrow_schema()`` to convert an object from some other data type representation (e.g., an arrow R package ``DataType`` like ``arrow::int32()``); or use ``na_XXX()`` functions to construct them. .. code:: r infer_nanoarrow_schema(1:5) #> #> $ format : chr "i" #> $ name : chr "" #> $ metadata : list() #> $ flags : int 2 #> $ children : list() #> $ dictionary: NULL as_nanoarrow_schema(arrow::schema(col1 = arrow::float64())) #> #> $ format : chr "+s" #> $ name : chr "" #> $ metadata : list() #> $ flags : int 0 #> $ children :List of 1 #> ..$ col1: #> .. ..$ format : chr "g" #> .. ..$ name : chr "col1" #> .. ..$ metadata : list() #> .. ..$ flags : int 2 #> .. ..$ children : list() #> .. ..$ dictionary: NULL #> $ dictionary: NULL na_int64() #> #> $ format : chr "l" #> $ name : chr "" #> $ metadata : list() #> $ flags : int 2 #> $ children : list() #> $ dictionary: NULL Arrays ~~~~~~ Use ``as_nanoarrow_array()`` to convert an object to an ArrowArray object: .. code:: r as_nanoarrow_array(1:5) #> #> $ length : int 5 #> $ null_count: int 0 #> $ offset : int 0 #> $ buffers :List of 2 #> ..$ :[0][0 b]> `` #> ..$ :[5][20 b]> `1 2 3 4 5` #> $ dictionary: NULL #> $ children : list() as_nanoarrow_array(data.frame(col1 = c(1.1, 2.2))) #> #> $ length : int 2 #> $ null_count: int 0 #> $ offset : int 0 #> $ buffers :List of 1 #> ..$ :[0][0 b]> `` #> $ children :List of 1 #> ..$ col1: #> .. ..$ length : int 2 #> .. ..$ null_count: int 0 #> .. ..$ offset : int 0 #> .. ..$ buffers :List of 2 #> .. .. ..$ :[0][0 b]> `` #> .. .. ..$ :[2][16 b]> `1.1 2.2` #> .. ..$ dictionary: NULL #> .. ..$ children : list() #> $ dictionary: NULL You can use ``as.vector()`` or ``as.data.frame()`` to get the R representation of the object back: .. code:: r array <- as_nanoarrow_array(data.frame(col1 = c(1.1, 2.2))) as.data.frame(array) #> col1 #> 1 1.1 #> 2 2.2 Even though at the C level the ArrowArray is distinct from the ArrowSchema, at the R level we attach a schema wherever possible. You can access the attached schema using ``infer_nanoarrow_schema()``: .. code:: r infer_nanoarrow_schema(array) #> #> $ format : chr "+s" #> $ name : chr "" #> $ metadata : list() #> $ flags : int 0 #> $ children :List of 1 #> ..$ col1: #> .. ..$ format : chr "g" #> .. ..$ name : chr "col1" #> .. ..$ metadata : list() #> .. ..$ flags : int 2 #> .. ..$ children : list() #> .. ..$ dictionary: NULL #> $ dictionary: NULL Array Streams ~~~~~~~~~~~~~ The easiest way to create an ArrowArrayStream is from a list of arrays or objects that can be converted to an array using ``as_nanoarrow_array()``: .. code:: r stream <- basic_array_stream( list( data.frame(col1 = c(1.1, 2.2)), data.frame(col1 = c(3.3, 4.4)) ) ) You can pull batches from the stream using the ``$get_next()`` method. The last batch will return ``NULL``. .. code:: r stream$get_next() #> #> $ length : int 2 #> $ null_count: int 0 #> $ offset : int 0 #> $ buffers :List of 1 #> ..$ :[0][0 b]> `` #> $ children :List of 1 #> ..$ col1: #> .. ..$ length : int 2 #> .. ..$ null_count: int 0 #> .. ..$ offset : int 0 #> .. ..$ buffers :List of 2 #> .. .. ..$ :[0][0 b]> `` #> .. .. ..$ :[2][16 b]> `1.1 2.2` #> .. ..$ dictionary: NULL #> .. ..$ children : list() #> $ dictionary: NULL stream$get_next() #> #> $ length : int 2 #> $ null_count: int 0 #> $ offset : int 0 #> $ buffers :List of 1 #> ..$ :[0][0 b]> `` #> $ children :List of 1 #> ..$ col1: #> .. ..$ length : int 2 #> .. ..$ null_count: int 0 #> .. ..$ offset : int 0 #> .. ..$ buffers :List of 2 #> .. .. ..$ :[0][0 b]> `` #> .. .. ..$ :[2][16 b]> `3.3 4.4` #> .. ..$ dictionary: NULL #> .. ..$ children : list() #> $ dictionary: NULL stream$get_next() #> NULL You can pull all the batches into a ``data.frame()`` by calling ``as.data.frame()`` or ``as.vector()``: .. code:: r stream <- basic_array_stream( list( data.frame(col1 = c(1.1, 2.2)), data.frame(col1 = c(3.3, 4.4)) ) ) as.data.frame(stream) #> col1 #> 1 1.1 #> 2 2.2 #> 3 3.3 #> 4 4.4 After consuming a stream, you should call the release method as soon as you can. This lets the implementation of the stream release any resources (like open files) it may be holding in a more predictable way than waiting for the garbage collector to clean up the object. Integration with the arrow package ---------------------------------- The nanoarrow package implements ``as_nanoarrow_schema()``, ``as_nanoarrow_array()``, and ``as_nanoarrow_array_stream()`` for most arrow package types. Similarly, it implements ``arrow::as_arrow_array()``, ``arrow::as_record_batch()``, ``arrow::as_arrow_table()``, ``arrow::as_record_batch_reader()``, ``arrow::infer_type()``, ``arrow::as_data_type()``, and ``arrow::as_schema()`` for nanoarrow objects such that you can pass equivalent nanoarrow objects into many arrow functions and vice versa.