Arrow Datasets allow you to query against data that has been split across multiple files. This sharding of data may indicate partitioning, which can accelerate queries that only touch some partitions (files).
A Dataset
contains one or more Fragments
, such as files, of potentially
differing type and partitioning.
For Dataset$create()
, see open_dataset()
, which is an alias for it.
DatasetFactory
is used to provide finer control over the creation of Dataset
s.
Factory
DatasetFactory
is used to create a Dataset
, inspect the Schema of the
fragments contained in it, and declare a partitioning.
FileSystemDatasetFactory
is a subclass of DatasetFactory
for
discovering files in the local file system, the only currently supported
file system.
For the DatasetFactory$create()
factory method, see dataset_factory()
, an
alias for it. A DatasetFactory
has:
$Inspect(unify_schemas)
: Ifunify_schemas
isTRUE
, all fragments will be scanned and a unified Schema will be created from them; ifFALSE
(default), only the first fragment will be inspected for its schema. Use this fast path when you know and trust that all fragments have an identical schema.$Finish(schema, unify_schemas)
: Returns aDataset
. Ifschema
is provided, it will be used for theDataset
; if omitted, aSchema
will be created from inspecting the fragments (files) in the dataset, followingunify_schemas
as described above.
FileSystemDatasetFactory$create()
is a lower-level factory method and
takes the following arguments:
filesystem
: A FileSystemselector
: Either a FileSelector orNULL
paths
: Either a character vector of file paths orNULL
format
: A FileFormatpartitioning
: EitherPartitioning
,PartitioningFactory
, orNULL
Methods
A Dataset
has the following methods:
$NewScan()
: Returns a ScannerBuilder for building a query$WithSchema()
: Returns a new Dataset with the specified schema. This method currently supports only adding, removing, or reordering fields in the schema: you cannot alter or cast the field types.$schema
: Active binding that returns the Schema of the Dataset; you may also replace the dataset's schema by usingds$schema <- new_schema
.
FileSystemDataset
has the following methods:
$files
: Active binding, returns the files of theFileSystemDataset
$format
: Active binding, returns the FileFormat of theFileSystemDataset
UnionDataset
has the following methods:
$children
: Active binding, returns all childDataset
s.
See also
open_dataset()
for a simple interface to creating a Dataset