The Arrow C stream interface#

The C stream interface builds on the structures defined in the C data interface and combines them into a higher-level specification so as to ease the communication of streaming data within a single process.

Semantics#

An Arrow C stream exposes a streaming source of data chunks, each with the same schema. Chunks are obtained by calling a blocking pull-style iteration function.

Structure definition#

The C stream interface is defined by a single struct definition:

#ifndef ARROW_C_STREAM_INTERFACE
#define ARROW_C_STREAM_INTERFACE

struct ArrowArrayStream {
  // Callbacks providing stream functionality
  int (*get_schema)(struct ArrowArrayStream*, struct ArrowSchema* out);
  int (*get_next)(struct ArrowArrayStream*, struct ArrowArray* out);
  const char* (*get_last_error)(struct ArrowArrayStream*);

  // Release callback
  void (*release)(struct ArrowArrayStream*);

  // Opaque producer-specific data
  void* private_data;
};

#endif  // ARROW_C_STREAM_INTERFACE

Note

The canonical guard ARROW_C_STREAM_INTERFACE is meant to avoid duplicate definitions if two projects copy the C data interface definitions in their own headers, and a third-party project includes from these two projects. It is therefore important that this guard is kept exactly as-is when these definitions are copied.

The ArrowArrayStream structure#

The ArrowArrayStream provides the required callbacks to interact with a streaming source of Arrow arrays. It has the following fields:

int (*ArrowArrayStream.get_schema)(struct ArrowArrayStream*, struct ArrowSchema *out)#

Mandatory. This callback allows the consumer to query the schema of the chunks of data in the stream. The schema is the same for all data chunks.

This callback must NOT be called on a released ArrowArrayStream.

Return value: 0 on success, a non-zero error code otherwise.

int (*ArrowArrayStream.get_next)(struct ArrowArrayStream*, struct ArrowArray *out)#

Mandatory. This callback allows the consumer to get the next chunk of data in the stream.

This callback must NOT be called on a released ArrowArrayStream.

Return value: 0 on success, a non-zero error code otherwise.

On success, the consumer must check whether the ArrowArray is marked released. If the ArrowArray is released, then the end of stream has been reached. Otherwise, the ArrowArray contains a valid data chunk.

const char *(*ArrowArrayStream.get_last_error)(struct ArrowArrayStream*)#

Mandatory. This callback allows the consumer to get a textual description of the last error.

This callback must ONLY be called if the last operation on the ArrowArrayStream returned an error. It must NOT be called on a released ArrowArrayStream.

Return value: a pointer to a NULL-terminated character string (UTF8-encoded). NULL can also be returned if no detailed description is available.

The returned pointer is only guaranteed to be valid until the next call of one of the stream’s callbacks. The character string it points to should be copied to consumer-managed storage if it is intended to survive longer.

void (*ArrowArrayStream.release)(struct ArrowArrayStream*)#

Mandatory. A pointer to a producer-provided release callback.

void *ArrowArrayStream.private_data#

Optional. An opaque pointer to producer-provided private data.

Consumers MUST not process this member. Lifetime of this member is handled by the producer, and especially by the release callback.

Error codes#

The get_schema and get_next callbacks may return an error under the form of a non-zero integer code. Such error codes should be interpreted like errno numbers (as defined by the local platform). Note that the symbolic forms of these constants are stable from platform to platform, but their numeric values are platform-specific.

In particular, it is recommended to recognize the following values:

  • EINVAL: for a parameter or input validation error

  • ENOMEM: for a memory allocation failure (out of memory)

  • EIO: for a generic input/output error

Result lifetimes#

The data returned by the get_schema and get_next callbacks must be released independently. Their lifetimes are not tied to that of the ArrowArrayStream.

Stream lifetime#

Lifetime of the C stream is managed using a release callback with similar usage as in the C data interface.

Thread safety#

The stream source is not assumed to be thread-safe. Consumers wanting to call get_next from several threads should ensure those calls are serialized.

C consumer example#

Let’s say a particular database provides the following C API to execute a SQL query and return the result set as a Arrow C stream:

void MyDB_Query(const char* query, struct ArrowArrayStream* result_set);

Then a consumer could use the following code to iterate over the results:

static void handle_error(int errcode, struct ArrowArrayStream* stream) {
   // Print stream error
   const char* errdesc = stream->get_last_error(stream);
   if (errdesc != NULL) {
      fputs(errdesc, stderr);
   } else {
      fputs(strerror(errcode), stderr);
   }
   // Release stream and abort
   stream->release(stream),
   exit(1);
}

void run_query() {
   struct ArrowArrayStream stream;
   struct ArrowSchema schema;
   struct ArrowArray chunk;
   int errcode;

   MyDB_Query("SELECT * FROM my_table", &stream);

   // Query result set schema
   errcode = stream.get_schema(&stream, &schema);
   if (errcode != 0) {
      handle_error(errcode, &stream);
   }

   int64_t num_rows = 0;

   // Iterate over results: loop until error or end of stream
   while ((errcode = stream.get_next(&stream, &chunk) == 0) &&
          chunk.release != NULL) {
      // Do something with chunk...
      fprintf(stderr, "Result chunk: got %lld rows\n", chunk.length);
      num_rows += chunk.length;

      // Release chunk
      chunk.release(&chunk);
   }

   // Was it an error?
   if (errcode != 0) {
      handle_error(errcode, &stream);
   }

   fprintf(stderr, "Result stream ended: total %lld rows\n", num_rows);

   // Release schema and stream
   schema.release(&schema);
   stream.release(&stream);
}