Skip to contents

Arrow Flight is a general-purpose client-server framework for high performance transport of large datasets over network interfaces, built as part of the Apache Arrow project. It allows for highly efficient data transfer by several means:

  • Flight removes the need for deserialization during data transfer.
  • Flight allows for parallel data streaming.
  • Flight employs optimizations designed to take advantage of Arrow’s columnar format.

The arrow package provides methods for connecting to Flight servers to send and receive data.

Prerequisites

At present the arrow package in R does not supply an independent implementation of Arrow Flight: it works by calling Flight methods supplied by PyArrow Python, and requires both the reticulate package and the Python PyArrow library to be installed. If you are using them for the first time you can install them like this:

install.packages("reticulate")
arrow::install_pyarrow()

See the python integrations article for more details on setting up pyarrow.

Example

The package includes methods for starting a Python-based Flight server, as well as methods for connecting to a Flight server running elsewhere. To illustrate both sides, in one R process we’ll start a demo server:

library(arrow)
demo_server <- load_flight_server("demo_flight_server")
server <- demo_server$DemoFlightServer(port = 8089)
server$serve()

We’ll leave that one running.

In a different R process, let’s connect to it and put some data in it.

library(arrow)
client <- flight_connect(port = 8089)
flight_put(client, iris, path = "test_data/iris")

Now, in yet another R process, we can connect to the server and pull the data we put there:

library(arrow)
library(dplyr)
client <- flight_connect(port = 8089)
client %>%
  flight_get("test_data/iris") %>%
  group_by(Species) %>%
  summarize(max_petal = max(Petal.Length))

## # A tibble: 3 x 2
##   Species    max_petal
##   <fct>          <dbl>
## 1 setosa           1.9
## 2 versicolor       5.1
## 3 virginica        6.9

Because flight_get() returns an Arrow data structure, you can directly pipe its result into a dplyr workflow. See the article on data wrangling for more information on working with Arrow objects via a dplyr interface.

Further reading