Apache Arrow defines two formats for serializing data for interprocess communication (IPC):
a "stream" format and a "file" format, known as Feather.
RecordBatchStreamWriter
and RecordBatchFileWriter
are
interfaces for writing record batches to those formats, respectively.
For guidance on how to use these classes, see the examples section.
The RecordBatchFileWriter$create()
and RecordBatchStreamWriter$create()
factory methods instantiate the object and take the following arguments:
sink
An OutputStream
schema
A Schema for the data to be written
use_legacy_format
logical: write data formatted so that Arrow libraries
versions 0.14 and lower can read it. Default is FALSE
. You can also
enable this by setting the environment variable ARROW_PRE_0_15_IPC_FORMAT=1
.
metadata_version
: A string like "V5" or the equivalent integer indicating
the Arrow IPC MetadataVersion. Default (NULL) will use the latest version,
unless the environment variable ARROW_PRE_1_0_METADATA_VERSION=1
, in
which case it will be V4.
$write(x)
: Write a RecordBatch, Table, or data.frame
, dispatching
to the methods below appropriately
$write_batch(batch)
: Write a RecordBatch
to stream
$write_table(table)
: Write a Table
to stream
$close()
: close stream. Note that this indicates end-of-file or
end-of-stream--it does not close the connection to the sink
. That needs
to be closed separately.
write_ipc_stream()
and write_feather()
provide a much simpler
interface for writing data to these formats and are sufficient for many use
cases. write_to_raw()
is a version that serializes data to a buffer.
tf <- tempfile() on.exit(unlink(tf)) batch <- record_batch(chickwts) # This opens a connection to the file in Arrow file_obj <- FileOutputStream$create(tf) # Pass that to a RecordBatchWriter to write data conforming to a schema writer <- RecordBatchFileWriter$create(file_obj, batch$schema) writer$write(batch) # You may write additional batches to the stream, provided that they have # the same schema. # Call "close" on the writer to indicate end-of-file/stream writer$close() # Then, close the connection--closing the IPC message does not close the file file_obj$close() # Now, we have a file we can read from. Same pattern: open file connection, # then pass it to a RecordBatchReader read_file_obj <- ReadableFile$create(tf) reader <- RecordBatchFileReader$create(read_file_obj) # RecordBatchFileReader knows how many batches it has (StreamReader does not) reader$num_record_batches #> [1] 1 # We could consume the Reader by calling $read_next_batch() until all are, # consumed, or we can call $read_table() to pull them all into a Table tab <- reader$read_table() # Call as.data.frame to turn that Table into an R data.frame df <- as.data.frame(tab) # This should be the same data we sent all.equal(df, chickwts, check.attributes = FALSE) #> [1] TRUE # Unlike the Writers, we don't have to close RecordBatchReaders, # but we do still need to close the file connection read_file_obj$close()