Arrow Flight RPC

Arrow Flight is an RPC framework for high-performance data services based on Arrow data, and is built on top of gRPC and the IPC format.

Flight is organized around streams of Arrow record batches, being either downloaded from or uploaded to another service. A set of metadata methods offers discovery and introspection of streams, as well as the ability to implement application-specific methods.

Methods and message wire formats are defined by Protobuf, enabling interoperability with clients that may support gRPC and Arrow separately, but not Flight. However, Flight implementations include further optimizations to avoid overhead in usage of Protobuf (mostly around avoiding excessive memory copies).

RPC Methods

Flight defines a set of RPC methods for uploading/downloading data, retrieving metadata about a data stream, listing available data streams, and for implementing application-specific RPC methods. A Flight service implements some subset of these methods, while a Flight client can call any of these methods. Thus, one Flight client can connect to any Flight service and perform basic operations.

Data streams are identified by descriptors, which are either a path or an arbitrary binary command. A client that wishes to download the data would:

  1. Construct or acquire a FlightDescriptor for the data set they are interested in. A client may know what descriptor they want already, or they may use methods like ListFlights to discover them.

  2. Call GetFlightInfo(FlightDescriptor) to get a FlightInfo message containing details on where the data is located (as well as other metadata, like the schema and possibly an estimate of the dataset size).

    Flight does not require that data live on the same server as metadata: this call may list other servers to connect to. The FlightInfo message includes a Ticket, an opaque binary token that the server uses to identify the exact data set being requested.

  3. Connect to other servers (if needed).

  4. Call DoGet(Ticket) to get back a stream of Arrow record batches.

To upload data, a client would:

  1. Construct or acquire a FlightDescriptor, as before.

  2. Call DoPut(FlightData) and upload a stream of Arrow record batches. They would also include the FlightDescriptor with the first message.

See Protocol Buffer Definitions for full details on the methods and messages involved.

Authentication

Flight supports application-implemented authentication methods. Authentication, if enabled, has two phases: at connection time, the client and server can exchange any number of messages. Then, the client can provide a token alongside each call, and the server can validate that token.

Applications may use any part of this; for instance, they may ignore the initial handshake and send an externally acquired token on each call, or they may establish trust during the handshake and not validate a token for each call. (Note that the latter is not secure if you choose to deploy a layer 7 load balancer, as is common with gRPC.)

Error Handling

Arrow Flight defines its own set of error codes. The implementation differs between languages (e.g. in C++, Unimplemented is a general Arrow error status while it’s a Flight-specific exception in Java), but the following set is exposed:

Error Code

Description

UNKNOWN

An unknown error. The default if no other error applies.

INTERNAL

An error internal to the service implementation occurred.

INVALID_ARGUMENT

The client passed an invalid argument to the RPC.

TIMED_OUT

The operation exceeded a timeout or deadline.

NOT_FOUND

The requested resource (action, data stream) was not found.

ALREADY_EXISTS

The resource already exists.

CANCELLED

The operation was cancelled (either by the client or the server).

UNAUTHENTICATED

The client is not authenticated.

UNAUTHORIZED

The client is authenticated, but does not have permissions for the requested operation.

UNIMPLEMENTED

The RPC is not implemented.

UNAVAILABLE

The server is not available. May be emitted by the client for connectivity reasons.

Protocol Buffer Definitions

  1 /*
  2  * Licensed to the Apache Software Foundation (ASF) under one
  3  * or more contributor license agreements.  See the NOTICE file
  4  * distributed with this work for additional information
  5  * regarding copyright ownership.  The ASF licenses this file
  6  * to you under the Apache License, Version 2.0 (the
  7  * "License"); you may not use this file except in compliance
  8  * with the License.  You may obtain a copy of the License at
  9  * <p>
 10  * http://www.apache.org/licenses/LICENSE-2.0
 11  * <p>
 12  * Unless required by applicable law or agreed to in writing, software
 13  * distributed under the License is distributed on an "AS IS" BASIS,
 14  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 15  * See the License for the specific language governing permissions and
 16  * limitations under the License.
 17  */
 18 
 19 syntax = "proto3";
 20 
 21 option java_package = "org.apache.arrow.flight.impl";
 22 option go_package = "github.com/apache/arrow/go/flight;flight";
 23 option csharp_namespace = "Apache.Arrow.Flight.Protocol";
 24 
 25 package arrow.flight.protocol;
 26 
 27 /*
 28  * A flight service is an endpoint for retrieving or storing Arrow data. A
 29  * flight service can expose one or more predefined endpoints that can be
 30  * accessed using the Arrow Flight Protocol. Additionally, a flight service
 31  * can expose a set of actions that are available.
 32  */
 33 service FlightService {
 34 
 35   /*
 36    * Handshake between client and server. Depending on the server, the
 37    * handshake may be required to determine the token that should be used for
 38    * future operations. Both request and response are streams to allow multiple
 39    * round-trips depending on auth mechanism.
 40    */
 41   rpc Handshake(stream HandshakeRequest) returns (stream HandshakeResponse) {}
 42 
 43   /*
 44    * Get a list of available streams given a particular criteria. Most flight
 45    * services will expose one or more streams that are readily available for
 46    * retrieval. This api allows listing the streams available for
 47    * consumption. A user can also provide a criteria. The criteria can limit
 48    * the subset of streams that can be listed via this interface. Each flight
 49    * service allows its own definition of how to consume criteria.
 50    */
 51   rpc ListFlights(Criteria) returns (stream FlightInfo) {}
 52 
 53   /*
 54    * For a given FlightDescriptor, get information about how the flight can be
 55    * consumed. This is a useful interface if the consumer of the interface
 56    * already can identify the specific flight to consume. This interface can
 57    * also allow a consumer to generate a flight stream through a specified
 58    * descriptor. For example, a flight descriptor might be something that
 59    * includes a SQL statement or a Pickled Python operation that will be
 60    * executed. In those cases, the descriptor will not be previously available
 61    * within the list of available streams provided by ListFlights but will be
 62    * available for consumption for the duration defined by the specific flight
 63    * service.
 64    */
 65   rpc GetFlightInfo(FlightDescriptor) returns (FlightInfo) {}
 66 
 67   /*
 68    * For a given FlightDescriptor, get the Schema as described in Schema.fbs::Schema
 69    * This is used when a consumer needs the Schema of flight stream. Similar to
 70    * GetFlightInfo this interface may generate a new flight that was not previously
 71    * available in ListFlights.
 72    */
 73    rpc GetSchema(FlightDescriptor) returns (SchemaResult) {}
 74 
 75   /*
 76    * Retrieve a single stream associated with a particular descriptor
 77    * associated with the referenced ticket. A Flight can be composed of one or
 78    * more streams where each stream can be retrieved using a separate opaque
 79    * ticket that the flight service uses for managing a collection of streams.
 80    */
 81   rpc DoGet(Ticket) returns (stream FlightData) {}
 82 
 83   /*
 84    * Push a stream to the flight service associated with a particular
 85    * flight stream. This allows a client of a flight service to upload a stream
 86    * of data. Depending on the particular flight service, a client consumer
 87    * could be allowed to upload a single stream per descriptor or an unlimited
 88    * number. In the latter, the service might implement a 'seal' action that
 89    * can be applied to a descriptor once all streams are uploaded.
 90    */
 91   rpc DoPut(stream FlightData) returns (stream PutResult) {}
 92 
 93   /*
 94    * Open a bidirectional data channel for a given descriptor. This
 95    * allows clients to send and receive arbitrary Arrow data and
 96    * application-specific metadata in a single logical stream. In
 97    * contrast to DoGet/DoPut, this is more suited for clients
 98    * offloading computation (rather than storage) to a Flight service.
 99    */
100   rpc DoExchange(stream FlightData) returns (stream FlightData) {}
101 
102   /*
103    * Flight services can support an arbitrary number of simple actions in
104    * addition to the possible ListFlights, GetFlightInfo, DoGet, DoPut
105    * operations that are potentially available. DoAction allows a flight client
106    * to do a specific action against a flight service. An action includes
107    * opaque request and response objects that are specific to the type action
108    * being undertaken.
109    */
110   rpc DoAction(Action) returns (stream Result) {}
111 
112   /*
113    * A flight service exposes all of the available action types that it has
114    * along with descriptions. This allows different flight consumers to
115    * understand the capabilities of the flight service.
116    */
117   rpc ListActions(Empty) returns (stream ActionType) {}
118 
119 }
120 
121 /*
122  * The request that a client provides to a server on handshake.
123  */
124 message HandshakeRequest {
125 
126   /*
127    * A defined protocol version
128    */
129   uint64 protocol_version = 1;
130 
131   /*
132    * Arbitrary auth/handshake info.
133    */
134   bytes payload = 2;
135 }
136 
137 message HandshakeResponse {
138 
139   /*
140    * A defined protocol version
141    */
142   uint64 protocol_version = 1;
143 
144   /*
145    * Arbitrary auth/handshake info.
146    */
147   bytes payload = 2;
148 }
149 
150 /*
151  * A message for doing simple auth.
152  */
153 message BasicAuth {
154   string username = 2;
155   string password = 3;
156 }
157 
158 message Empty {}
159 
160 /*
161  * Describes an available action, including both the name used for execution
162  * along with a short description of the purpose of the action.
163  */
164 message ActionType {
165   string type = 1;
166   string description = 2;
167 }
168 
169 /*
170  * A service specific expression that can be used to return a limited set
171  * of available Arrow Flight streams.
172  */
173 message Criteria {
174   bytes expression = 1;
175 }
176 
177 /*
178  * An opaque action specific for the service.
179  */
180 message Action {
181   string type = 1;
182   bytes body = 2;
183 }
184 
185 /*
186  * An opaque result returned after executing an action.
187  */
188 message Result {
189   bytes body = 1;
190 }
191 
192 /*
193  * Wrap the result of a getSchema call
194  */
195 message SchemaResult {
196   // schema of the dataset as described in Schema.fbs::Schema.
197   bytes schema = 1;
198 }
199 
200 /*
201  * The name or tag for a Flight. May be used as a way to retrieve or generate
202  * a flight or be used to expose a set of previously defined flights.
203  */
204 message FlightDescriptor {
205 
206   /*
207    * Describes what type of descriptor is defined.
208    */
209   enum DescriptorType {
210 
211     // Protobuf pattern, not used.
212     UNKNOWN = 0;
213 
214     /*
215      * A named path that identifies a dataset. A path is composed of a string
216      * or list of strings describing a particular dataset. This is conceptually
217      *  similar to a path inside a filesystem.
218      */
219     PATH = 1;
220 
221     /*
222      * An opaque command to generate a dataset.
223      */
224     CMD = 2;
225   }
226 
227   DescriptorType type = 1;
228 
229   /*
230    * Opaque value used to express a command. Should only be defined when
231    * type = CMD.
232    */
233   bytes cmd = 2;
234 
235   /*
236    * List of strings identifying a particular dataset. Should only be defined
237    * when type = PATH.
238    */
239   repeated string path = 3;
240 }
241 
242 /*
243  * The access coordinates for retrieval of a dataset. With a FlightInfo, a
244  * consumer is able to determine how to retrieve a dataset.
245  */
246 message FlightInfo {
247   // schema of the dataset as described in Schema.fbs::Schema.
248   bytes schema = 1;
249 
250   /*
251    * The descriptor associated with this info.
252    */
253   FlightDescriptor flight_descriptor = 2;
254 
255   /*
256    * A list of endpoints associated with the flight. To consume the whole
257    * flight, all endpoints must be consumed.
258    */
259   repeated FlightEndpoint endpoint = 3;
260 
261   // Set these to -1 if unknown.
262   int64 total_records = 4;
263   int64 total_bytes = 5;
264 }
265 
266 /*
267  * A particular stream or split associated with a flight.
268  */
269 message FlightEndpoint {
270 
271   /*
272    * Token used to retrieve this stream.
273    */
274   Ticket ticket = 1;
275 
276   /*
277    * A list of URIs where this ticket can be redeemed. If the list is
278    * empty, the expectation is that the ticket can only be redeemed on the
279    * current service where the ticket was generated.
280    */
281   repeated Location location = 2;
282 }
283 
284 /*
285  * A location where a Flight service will accept retrieval of a particular
286  * stream given a ticket.
287  */
288 message Location {
289   string uri = 1;
290 }
291 
292 /*
293  * An opaque identifier that the service can use to retrieve a particular
294  * portion of a stream.
295  */
296 message Ticket {
297   bytes ticket = 1;
298 }
299 
300 /*
301  * A batch of Arrow data as part of a stream of batches.
302  */
303 message FlightData {
304 
305   /*
306    * The descriptor of the data. This is only relevant when a client is
307    * starting a new DoPut stream.
308    */
309   FlightDescriptor flight_descriptor = 1;
310 
311   /*
312    * Header for message data as described in Message.fbs::Message.
313    */
314   bytes data_header = 2;
315 
316   /*
317    * Application-defined metadata.
318    */
319   bytes app_metadata = 3;
320 
321   /*
322    * The actual batch of Arrow data. Preferably handled with minimal-copies
323    * coming last in the definition to help with sidecar patterns (it is
324    * expected that some implementations will fetch this field off the wire
325    * with specialized code to avoid extra memory copies).
326    */
327   bytes data_body = 1000;
328 }
329 
330 /**
331  * The response message associated with the submission of a DoPut.
332  */
333 message PutResult {
334   bytes app_metadata = 1;
335 }