As an alternative to calling collect() on a Dataset query, you can use this function to access the stream of RecordBatches in the Dataset. This lets you do more complex operations in R that operate on chunks of data without having to hold the entire Dataset in memory at once. You can include map_batches() in a dplyr pipeline and do additional dplyr methods on the stream of data in Arrow after it.

map_batches(X, FUN, ..., .schema = NULL, .lazy = FALSE, .data.frame = NULL)

Arguments

X

A Dataset or arrow_dplyr_query object, as returned by the dplyr methods on Dataset.

FUN

A function or purrr-style lambda expression to apply to each batch. It must return a RecordBatch or something coercible to one via `as_record_batch()'.

...

Additional arguments passed to FUN

.schema

An optional schema(). If NULL, the schema will be inferred from the first batch.

.lazy

Use TRUE to evaluate FUN lazily as batches are read from the result; use FALSE to evaluate FUN on all batches before returning the reader.

.data.frame

Deprecated argument, ignored

Value

An arrow_dplyr_query.

Details

Note that, unlike the core dplyr methods that are implemented in the Arrow query engine, map_batches() is not lazy: it starts evaluating on the data when you call it, even if you send its result to another pipeline function.

This is experimental and not recommended for production use. It is also single-threaded and runs in R not C++, so it won't be as fast as core Arrow methods.