Arrow Datasets allow you to query against data that has been split across
multiple files. This sharding of data may indicate partitioning, which
can accelerate queries that only touch some partitions (files). Call
open_dataset()
to point to a directory of data files and return a
Dataset
, then use dplyr
methods to query it.
open_dataset( sources, schema = NULL, partitioning = hive_partition(), unify_schemas = NULL, format = c("parquet", "arrow", "ipc", "feather", "csv", "tsv", "text"), ... )
sources | One of:
When |
---|---|
schema | Schema for the |
partitioning | When
The default is to autodetect Hive-style partitions. When |
unify_schemas | logical: should all data fragments (files, |
format | A FileFormat object, or a string identifier of the format of
the files in
Default is "parquet", unless a |
... | additional arguments passed to |
A Dataset R6 object. Use dplyr
methods on it to query the data,
or call $NewScan()
to construct a query directly.
# Set up directory for examples tf <- tempfile() dir.create(tf) on.exit(unlink(tf)) data <- dplyr::group_by(mtcars, cyl) write_dataset(data, tf) # You can specify a directory containing the files for your dataset and # open_dataset will scan all files in your directory. open_dataset(tf) #> FileSystemDataset with 3 Parquet files #> mpg: double #> disp: double #> hp: double #> drat: double #> wt: double #> qsec: double #> vs: double #> am: double #> gear: double #> carb: double #> cyl: int32 # You can also supply a vector of paths open_dataset(c(file.path(tf, "cyl=4/part-0.parquet"), file.path(tf, "cyl=8/part-0.parquet"))) #> FileSystemDataset with 2 Parquet files #> mpg: double #> disp: double #> hp: double #> drat: double #> wt: double #> qsec: double #> vs: double #> am: double #> gear: double #> carb: double ## You must specify the file format if using a format other than parquet. tf2 <- tempfile() dir.create(tf2) on.exit(unlink(tf2)) write_dataset(data, tf2, format = "ipc") # This line will results in errors when you try to work with the data if (FALSE) { open_dataset(tf2) } # This line will work open_dataset(tf2, format = "ipc") #> FileSystemDataset with 3 Feather files #> mpg: double #> disp: double #> hp: double #> drat: double #> wt: double #> qsec: double #> vs: double #> am: double #> gear: double #> carb: double #> cyl: int32 ## You can specify file partitioning to include it as a field in your dataset # Create a temporary directory and write example dataset tf3 <- tempfile() dir.create(tf3) on.exit(unlink(tf3)) write_dataset(airquality, tf3, partitioning = c("Month", "Day"), hive_style = FALSE) # View files - you can see the partitioning means that files have been written # to folders based on Month/Day values tf3_files <- list.files(tf3, recursive = TRUE) # With no partitioning specified, dataset contains all files but doesn't include # directory names as field names open_dataset(tf3) #> FileSystemDataset with 153 Parquet files #> Ozone: int32 #> Solar.R: int32 #> Wind: double #> Temp: int32 #> #> See $metadata for additional Schema metadata # Now that partitioning has been specified, your dataset contains columns for Month and Day open_dataset(tf3, partitioning = c("Month", "Day")) #> FileSystemDataset with 153 Parquet files #> Ozone: int32 #> Solar.R: int32 #> Wind: double #> Temp: int32 #> Month: int32 #> Day: int32 #> #> See $metadata for additional Schema metadata # If you want to specify the data types for your fields, you can pass in a Schema open_dataset(tf3, partitioning = schema(Month = int8(), Day = int8())) #> FileSystemDataset with 153 Parquet files #> Ozone: int32 #> Solar.R: int32 #> Wind: double #> Temp: int32 #> Month: int8 #> Day: int8 #> #> See $metadata for additional Schema metadata