As an alternative to calling collect()
on a Dataset
query, you can
use this function to access the stream of RecordBatch
es in the Dataset
.
This lets you do more complex operations in R that operate on chunks of data
without having to hold the entire Dataset in memory at once. You can include
map_batches()
in a dplyr pipeline and do additional dplyr methods on the
stream of data in Arrow after it.
map_batches(X, FUN, ..., .schema = NULL, .lazy = FALSE, .data.frame = NULL)
A Dataset
or arrow_dplyr_query
object, as returned by the
dplyr
methods on Dataset
.
A function or purrr
-style lambda expression to apply to each
batch. It must return a RecordBatch or something coercible to one via
`as_record_batch()'.
Additional arguments passed to FUN
An optional schema()
. If NULL, the schema will be inferred
from the first batch.
Use TRUE
to evaluate FUN
lazily as batches are read from
the result; use FALSE
to evaluate FUN
on all batches before returning
the reader.
Deprecated argument, ignored
An arrow_dplyr_query
.
Note that, unlike the core dplyr methods that are implemented in the Arrow
query engine, map_batches()
is not lazy: it starts evaluating on the data
when you call it, even if you send its result to another pipeline function.
This is experimental and not recommended for production use. It is also single-threaded and runs in R not C++, so it won't be as fast as core Arrow methods.