As an alternative to calling collect() on a Dataset query, you can use this function to access the stream of RecordBatches in the Dataset. This lets you aggregate on each chunk and pull the intermediate results into a data.frame for further aggregation, even if you couldn't fit the whole Dataset result in memory.

map_batches(X, FUN, ..., .data.frame = TRUE)

Arguments

X

A Dataset or arrow_dplyr_query object, as returned by the dplyr methods on Dataset.

FUN

A function or purrr-style lambda expression to apply to each batch

...

Additional arguments passed to FUN

.data.frame

logical: collect the resulting chunks into a single data.frame? Default TRUE

Details

This is experimental and not recommended for production use.