A Scanner
iterates over a Dataset's fragments and returns data
according to given row filtering and column projection. A ScannerBuilder
can help create one.
Scanner$create()
wraps the ScannerBuilder
interface to make a Scanner
.
It takes the following arguments:
dataset
: A Dataset
or arrow_dplyr_query
object, as returned by the
dplyr
methods on Dataset
.
projection
: A character vector of column names to select columns or a
named list of expressions
filter
: A Expression
to filter the scanned rows by, or TRUE
(default)
to keep all rows.
use_threads
: logical: should scanning use multithreading? Default TRUE
use_async
: logical: should the async scanner (performs better on
high-latency/highly parallel filesystems like S3) be used? Default FALSE
...
: Additional arguments, currently ignored
ScannerBuilder
has the following methods:
$Project(cols)
: Indicate that the scan should only return columns given
by cols
, a character vector of column names
$Filter(expr)
: Filter rows by an Expression.
$UseThreads(threads)
: logical: should the scan use multithreading?
The method's default input is TRUE
, but you must call the method to enable
multithreading because the scanner default is FALSE
.
$UseAsync(use_async)
: logical: should the async scanner be used?
$BatchSize(batch_size)
: integer: Maximum row count of scanned record
batches, default is 32K. If scanned record batches are overflowing memory
then this method can be called to reduce their size.
$schema
: Active binding, returns the Schema of the Dataset
$Finish()
: Returns a Scanner
Scanner
currently has a single method, $ToTable()
, which evaluates the
query and returns an Arrow Table.