Apache Arrow defines two formats for serializing data for interprocess communication (IPC):
a "stream" format and a "file" format, known as Feather.
RecordBatchStreamWriter
and RecordBatchFileWriter
are
interfaces for writing record batches to those formats, respectively.
For guidance on how to use these classes, see the examples section.
Factory
The RecordBatchFileWriter$create()
and RecordBatchStreamWriter$create()
factory methods instantiate the object and take the following arguments:
sink
AnOutputStream
schema
A Schema for the data to be writtenuse_legacy_format
logical: write data formatted so that Arrow libraries versions 0.14 and lower can read it. Default isFALSE
. You can also enable this by setting the environment variableARROW_PRE_0_15_IPC_FORMAT=1
.metadata_version
: A string like "V5" or the equivalent integer indicating the Arrow IPC MetadataVersion. Default (NULL) will use the latest version, unless the environment variableARROW_PRE_1_0_METADATA_VERSION=1
, in which case it will be V4.
Methods
$write(x)
: Write a RecordBatch, Table, ordata.frame
, dispatching to the methods below appropriately$write_batch(batch)
: Write aRecordBatch
to stream$write_table(table)
: Write aTable
to stream$close()
: close stream. Note that this indicates end-of-file or end-of-stream--it does not close the connection to thesink
. That needs to be closed separately.
See also
write_ipc_stream()
and write_feather()
provide a much simpler
interface for writing data to these formats and are sufficient for many use
cases. write_to_raw()
is a version that serializes data to a buffer.
Examples
tf <- tempfile()
on.exit(unlink(tf))
batch <- record_batch(chickwts)
# This opens a connection to the file in Arrow
file_obj <- FileOutputStream$create(tf)
# Pass that to a RecordBatchWriter to write data conforming to a schema
writer <- RecordBatchFileWriter$create(file_obj, batch$schema)
writer$write(batch)
# You may write additional batches to the stream, provided that they have
# the same schema.
# Call "close" on the writer to indicate end-of-file/stream
writer$close()
# Then, close the connection--closing the IPC message does not close the file
file_obj$close()
# Now, we have a file we can read from. Same pattern: open file connection,
# then pass it to a RecordBatchReader
read_file_obj <- ReadableFile$create(tf)
reader <- RecordBatchFileReader$create(read_file_obj)
# RecordBatchFileReader knows how many batches it has (StreamReader does not)
reader$num_record_batches
#> [1] 1
# We could consume the Reader by calling $read_next_batch() until all are,
# consumed, or we can call $read_table() to pull them all into a Table
tab <- reader$read_table()
# Call as.data.frame to turn that Table into an R data.frame
df <- as.data.frame(tab)
# This should be the same data we sent
all.equal(df, chickwts, check.attributes = FALSE)
#> [1] TRUE
# Unlike the Writers, we don't have to close RecordBatchReaders,
# but we do still need to close the file connection
read_file_obj$close()