Apache Arrow defines two formats for serializing data for interprocess communication (IPC): a "stream" format and a "file" format, known as Feather. RecordBatchStreamReader and RecordBatchFileReader are interfaces for accessing record batches from input sources in those formats, respectively.

For guidance on how to use these classes, see the examples section.

Factory

The RecordBatchFileReader$create() and RecordBatchStreamReader$create() factory methods instantiate the object and take a single argument, named according to the class:

Methods

  • $read_next_batch(): Returns a RecordBatch, iterating through the Reader. If there are no further batches in the Reader, it returns NULL.

  • $schema: Returns a Schema (active binding)

  • $batches(): Returns a list of RecordBatches

  • $read_table(): Collects the reader's RecordBatches into a Table

  • $get_batch(i): For RecordBatchFileReader, return a particular batch by an integer index.

  • $num_record_batches(): For RecordBatchFileReader, see how many batches are in the file.

See also

read_ipc_stream() and read_feather() provide a much simpler interface for reading data from these formats and are sufficient for many use cases.

Examples

tf <- tempfile()
on.exit(unlink(tf))

batch <- record_batch(chickwts)

# This opens a connection to the file in Arrow
file_obj <- FileOutputStream$create(tf)
# Pass that to a RecordBatchWriter to write data conforming to a schema
writer <- RecordBatchFileWriter$create(file_obj, batch$schema)
writer$write(batch)
# You may write additional batches to the stream, provided that they have
# the same schema.
# Call "close" on the writer to indicate end-of-file/stream
writer$close()
# Then, close the connection--closing the IPC message does not close the file
file_obj$close()

# Now, we have a file we can read from. Same pattern: open file connection,
# then pass it to a RecordBatchReader
read_file_obj <- ReadableFile$create(tf)
reader <- RecordBatchFileReader$create(read_file_obj)
# RecordBatchFileReader knows how many batches it has (StreamReader does not)
reader$num_record_batches
#> [1] 1
# We could consume the Reader by calling $read_next_batch() until all are,
# consumed, or we can call $read_table() to pull them all into a Table
tab <- reader$read_table()
# Call as.data.frame to turn that Table into an R data.frame
df <- as.data.frame(tab)
# This should be the same data we sent
all.equal(df, chickwts, check.attributes = FALSE)
#> [1] TRUE
# Unlike the Writers, we don't have to close RecordBatchReaders,
# but we do still need to close the file connection
read_file_obj$close()