This release includes support for version 2 of the Feather file format. Feather v2 features full support for all Arrow data types, fixes the 2GB per-column limitation for large amounts of string data, and it allows files to be compressed using either
write_feather() can write either version 2 or version 1 Feather files, and
read_feather() automatically detects which file version it is reading.
Related to this change, several functions around reading and writing data have been reworked.
write_ipc_stream() have been added to facilitate writing data to the Arrow IPC stream format, which is slightly different from the IPC file format (Feather v2 is the IPC file format).
Behavior has been standardized: all
read_<format>() return an R
data.frame (default) or a
Table if the argument
as_data_frame = FALSE; all
write_<format>() functions return the data object, invisibly. To facilitate some workflows, a special
write_to_raw() function is added to wrap
write_ipc_stream() and return the
raw vector containing the buffer that was written.
The 0.17 Apache Arrow release includes a C data interface that allows exchanging Arrow data in-process at the C level without copying and without libraries having a build or runtime dependency on each other. This enables us to use
reticulate to share data between R and Python (
vignette("python", package = "arrow") for details.
dim()method, which sums rows across all files (#6635, @boshek)
FALSE, consistent with
vignette("dataset", package = "arrow")now has correct, executable code
vignette("install", package = "arrow")for details and more options.
unify_schemas()to create a
Schemacontaining the union of fields in multiple schemas
read_feather()and other reader functions close any file connections they open
R.oopackage is also loaded
FileStatsis renamed to
FileInfo, and the original spelling has been deprecated
install_arrow()now installs the latest release of
arrow, including Linux dependencies, either for CRAN releases or for development builds (if
nightly = TRUE)
NOT_CRANenvironment variable is set
write_parquet()now return their input, similar to the
write_*functions in the
readrpackage (#6387, @boshek)
listand create a ListArray when all list elements are the same type (#6275, @michaelchirico)
This release includes a
dplyr interface to Arrow Datasets, which let you work efficiently with large, multi-file datasets as a single entity. Explore a directory of data files with
open_dataset() and then use
dplyr methods to
filter(), etc. Work will be done where possible in Arrow memory. When necessary, data is pulled into R for further computation.
dplyr methods are conditionally loaded if you have
dplyr available; it is not a hard dependency.
vignette("dataset", package = "arrow") for details.
A source package installation (as from CRAN) will now handle its C++ dependencies automatically. For common Linux distributions and versions, installation will retrieve a prebuilt static C++ library for inclusion in the package; where this binary is not available, the package executes a bundled script that should build the Arrow C++ library with no system dependencies beyond what R requires.
vignette("install", package = "arrow") for details.
RecordBatches also have
[methods for Tables, RecordBatches, Arrays, and ChunkedArrays now support natural row extraction operations. These use the C++
Takemethods for efficient access, depending on the type of selection vector.
array_expressionclass has also been added, enabling among other things the ability to filter a Table with some function of Arrays, such as
arrow_table[arrow_table$var1 > 5, ]without having to pull everything into R first.
factorlevels are required to be) instead of raising an error
arrow::table()have been removed in favor of
Table$create(), eliminating the package startup message about masking
basefunctions. For more information, see the new
as_tibbleargument in the
read_*()functions has been renamed to
arrow::Columnclass has been removed, as it was removed from the C++ library
RecordBatchobjects have S3 methods that enable you to work with them more like
data.frames. Extract columns, subset, and so on. See
read_csv_arrow()supports more parsing options, including
read_feather()can ingest data from a
double()), and time types can be created with human-friendly resolution strings (“ms”, “s”, etc.). (ARROW-6338, ARROW-6364)
Initial CRAN release of the
arrow package. Key features include: