NEWS.md
write_dataset()
to Feather or Parquet files with partitioning. See the end of vignette("dataset", package = "arrow")
for discussion and examples.head()
, tail()
, and take ([
) methods. head()
is optimized but the others may not be performant.collect()
gains an as_data_frame
argument, default TRUE
but when FALSE
allows you to evaluate the accumulated select
and filter
query but keep the result in Arrow, not an R data.frame
read_csv_arrow()
supports specifying column types, both with a Schema
and with the compact string representation for types used in the readr
package. It also has gained a timestamp_parsers
argument that lets you express a set of strptime
parse strings that will be tried to convert columns designated as Timestamp
type.libcurl
and openssl
, as well as a sufficiently modern compiler. See vignette("install", package = "arrow")
for details.read_parquet()
, write_feather()
, et al.), as well as open_dataset()
and write_dataset()
, allow you to access resources on S3 (or on file systems that emulate S3) either by providing an s3://
URI or by providing a FileSystem$path()
. See vignette("fs", package = "arrow")
for examples.copy_files()
allows you to recursively copy directories of files from one file system to another, such as from S3 to your local machine.Flight is a general-purpose client-server framework for high performance transport of large datasets over network interfaces. The arrow
R package now provides methods for connecting to Flight RPC servers to send and receive data. See vignette("flight", package = "arrow")
for an overview.
==
, >
, etc.) and boolean (&
, |
, !
) operations, along with is.na
, %in%
and match
(called match_arrow()
), on Arrow Arrays and ChunkedArrays are now implemented in the C++ library.min()
, max()
, and unique()
are implemented for Arrays and ChunkedArrays.dplyr
filter expressions on Arrow Tables and RecordBatches are now evaluated in the C++ library, rather than by pulling data into R and evaluating. This yields significant performance improvements.dim()
(nrow
) for dplyr queries on Table/RecordBatch is now supportedarrow
now depends on cpp11
, which brings more robust UTF-8 handling and faster compilationInt64
type when all values fit with an R 32-bit integer now correctly inspects all chunks in a ChunkedArray, and this conversion can be disabled (so that Int64
always yields a bit64::integer64
vector) by setting options(arrow.int64_downcast = FALSE)
.ParquetFileReader
has additional methods for accessing individual columns or row groups from the fileParquetFileWriter
; invalid ArrowObject
pointer from a saved R object; converting deeply nested structs from Arrow to Rproperties
and arrow_properties
arguments to write_parquet()
are deprecated%in%
expression now faithfully returns all relevant rows.
or _
; files and subdirectories starting with those prefixes are still ignoredopen_dataset("~/path")
now correctly expands the pathversion
option to write_parquet()
is now correctly implementedparquet-cpp
library has been fixedcmake
is more robust, and you can now specify a /path/to/cmake
by setting the CMAKE
environment variablevignette("arrow", package = "arrow")
includes tables that explain how R types are converted to Arrow types and vice versa.uint64
, binary
, fixed_size_binary
, large_binary
, large_utf8
, large_list
, list
of structs
.character
vectors that exceed 2GB are converted to Arrow large_utf8
typePOSIXlt
objects can now be converted to Arrow (struct
)attributes()
are preserved in Arrow metadata when converting to Arrow RecordBatch and table and are restored when converting from Arrow. This means that custom subclasses, such as haven::labelled
, are preserved in round trip through Arrow.batch$metadata$new_key <- "new value"
int64
, uint32
, and uint64
now are converted to R integer
if all values fit in boundsdate32
is now converted to R Date
with double
underlying storage. Even though the data values themselves are integers, this provides more strict round-trip fidelityfactor
, dictionary
ChunkedArrays that do not have identical dictionaries are properly unifiedRecordBatch{File,Stream}Writer
will write V5, but you can specify an alternate metadata_version
. For convenience, if you know the consumer you’re writing to cannot read V5, you can set the environment variable ARROW_PRE_1_0_METADATA_VERSION=1
to write V4 without changing any other code.ds <- open_dataset("s3://...")
. Note that this currently requires a special C++ library build with additional dependencies–this is not yet available in CRAN releases or in nightly packages.sum()
and mean()
are implemented for Array
and ChunkedArray
dimnames()
and as.list()
reticulate
coerce_timestamps
option to write_parquet()
is now correctly implemented.type
definition if provided by the userread_arrow
and write_arrow
are now deprecated; use the read/write_feather()
and read/write_ipc_stream()
functions depending on whether you’re working with the Arrow IPC file or stream format, respectively.FileStats
, read_record_batch
, and read_table
have been removed.jemalloc
included, and Windows packages use mimalloc
CC
and CXX
values that R usesdplyr
1.0reticulate::r_to_py()
conversion now correctly works automatically, without having to call the method yourselfThis release includes support for version 2 of the Feather file format. Feather v2 features full support for all Arrow data types, fixes the 2GB per-column limitation for large amounts of string data, and it allows files to be compressed using either lz4
or zstd
. write_feather()
can write either version 2 or version 1 Feather files, and read_feather()
automatically detects which file version it is reading.
Related to this change, several functions around reading and writing data have been reworked. read_ipc_stream()
and write_ipc_stream()
have been added to facilitate writing data to the Arrow IPC stream format, which is slightly different from the IPC file format (Feather v2 is the IPC file format).
Behavior has been standardized: all read_<format>()
return an R data.frame
(default) or a Table
if the argument as_data_frame = FALSE
; all write_<format>()
functions return the data object, invisibly. To facilitate some workflows, a special write_to_raw()
function is added to wrap write_ipc_stream()
and return the raw
vector containing the buffer that was written.
To achieve this standardization, read_table()
, read_record_batch()
, read_arrow()
, and write_arrow()
have been deprecated.
The 0.17 Apache Arrow release includes a C data interface that allows exchanging Arrow data in-process at the C level without copying and without libraries having a build or runtime dependency on each other. This enables us to use reticulate
to share data between R and Python (pyarrow
) efficiently.
See vignette("python", package = "arrow")
for details.
dim()
method, which sums rows across all files (#6635, @boshek)UnionDataset
with the c()
methodNA
as FALSE
, consistent with dplyr::filter()
vignette("dataset", package = "arrow")
now has correct, executable codeNOT_CRAN=true
. See vignette("install", package = "arrow")
for details and more options.unify_schemas()
to create a Schema
containing the union of fields in multiple schemasread_feather()
and other reader functions close any file connections they openR.oo
package is also loadedFileStats
is renamed to FileInfo
, and the original spelling has been deprecatedinstall_arrow()
now installs the latest release of arrow
, including Linux dependencies, either for CRAN releases or for development builds (if nightly = TRUE
)LIBARROW_DOWNLOAD
or NOT_CRAN
environment variable is setwrite_feather()
, write_arrow()
and write_parquet()
now return their input, similar to the write_*
functions in the readr
package (#6387, @boshek)list
and create a ListArray when all list elements are the same type (#6275, @michaelchirico)This release includes a dplyr
interface to Arrow Datasets, which let you work efficiently with large, multi-file datasets as a single entity. Explore a directory of data files with open_dataset()
and then use dplyr
methods to select()
, filter()
, etc. Work will be done where possible in Arrow memory. When necessary, data is pulled into R for further computation. dplyr
methods are conditionally loaded if you have dplyr
available; it is not a hard dependency.
See vignette("dataset", package = "arrow")
for details.
A source package installation (as from CRAN) will now handle its C++ dependencies automatically. For common Linux distributions and versions, installation will retrieve a prebuilt static C++ library for inclusion in the package; where this binary is not available, the package executes a bundled script that should build the Arrow C++ library with no system dependencies beyond what R requires.
See vignette("install", package = "arrow")
for details.
Table
s and RecordBatch
es also have dplyr
methods.dplyr
, [
methods for Tables, RecordBatches, Arrays, and ChunkedArrays now support natural row extraction operations. These use the C++ Filter
, Slice
, and Take
methods for efficient access, depending on the type of selection vector.array_expression
class has also been added, enabling among other things the ability to filter a Table with some function of Arrays, such as arrow_table[arrow_table$var1 > 5, ]
without having to pull everything into R first.write_parquet()
now supports compressioncodec_is_available()
returns TRUE
or FALSE
whether the Arrow C++ library was built with support for a given compression library (e.g. gzip, lz4, snappy)character
(as R factor
levels are required to be) instead of raising an errorClass$create()
methods. Notably, arrow::array()
and arrow::table()
have been removed in favor of Array$create()
and Table$create()
, eliminating the package startup message about masking base
functions. For more information, see the new vignette("arrow")
.ARROW_PRE_0_15_IPC_FORMAT=1
.as_tibble
argument in the read_*()
functions has been renamed to as_data_frame
(ARROW-6337, @jameslamb)arrow::Column
class has been removed, as it was removed from the C++ libraryTable
and RecordBatch
objects have S3 methods that enable you to work with them more like data.frame
s. Extract columns, subset, and so on. See ?Table
and ?RecordBatch
for examples.read_csv_arrow()
supports more parsing options, including col_names
, na
, quoted_na
, and skip
read_parquet()
and read_feather()
can ingest data from a raw
vector (ARROW-6278)~/file.parquet
(ARROW-6323)double()
), and time types can be created with human-friendly resolution strings (“ms”, “s”, etc.). (ARROW-6338, ARROW-6364)Initial CRAN release of the arrow
package. Key features include: