A record batch is a collection of equal-length arrays matching a particular Schema. It is a table-like data structure that is semantically a sequence of fields, each a contiguous Arrow Array.

record_batch(..., schema = NULL)

Arguments

...

A data.frame or a named set of Arrays or vectors. If given a mixture of data.frames and vectors, the inputs will be autospliced together (see examples). Alternatively, you can provide a single Arrow IPC InputStream, Message, Buffer, or R raw object containing a Buffer.

schema

a Schema, or NULL (the default) to infer the schema from the data in .... When providing an Arrow IPC buffer, schema is required.

S3 Methods and Usage

Record batches are data-frame-like, and many methods you expect to work on a data.frame are implemented for RecordBatch. This includes [, [[, $, names, dim, nrow, ncol, head, and tail. You can also pull the data from an Arrow record batch into R with as.data.frame(). See the examples.

A caveat about the $ method: because RecordBatch is an R6 object, $ is also used to access the object's methods (see below). Methods take precedence over the table's columns. So, batch$Slice would return the "Slice" method function even if there were a column in the table called "Slice".

R6 Methods

In addition to the more R-friendly S3 methods, a RecordBatch object has the following R6 methods that map onto the underlying C++ methods:

  • $Equals(other): Returns TRUE if the other record batch is equal

  • $column(i): Extract an Array by integer position from the batch

  • $column_name(i): Get a column's name by integer position

  • $names(): Get all column names (called by names(batch))

  • $RenameColumns(value): Set all column names (called by names(batch) <- value)

  • $GetColumnByName(name): Extract an Array by string name

  • $RemoveColumn(i): Drops a column from the batch by integer position

  • $SelectColumns(indices): Return a new record batch with a selection of columns, expressed as 0-based integers.

  • $Slice(offset, length = NULL): Create a zero-copy view starting at the indicated integer offset and going for the given length, or to the end of the table if NULL, the default.

  • $Take(i): return an RecordBatch with rows at positions given by integers (R vector or Array Array) i.

  • $Filter(i, keep_na = TRUE): return an RecordBatch with rows at positions where logical vector (or Arrow boolean Array) i is TRUE.

  • $serialize(): Returns a raw vector suitable for interprocess communication

  • $cast(target_schema, safe = TRUE, options = cast_options(safe)): Alter the schema of the record batch.

There are also some active bindings

  • $num_columns

  • $num_rows

  • $schema

  • $metadata: Returns the key-value metadata of the Schema as a named list. Modify or replace by assigning in (batch$metadata <- new_metadata). All list elements are coerced to string. See schema() for more information.

  • $columns: Returns a list of Arrays

Examples

# \donttest{ batch <- record_batch(name = rownames(mtcars), mtcars) dim(batch)
#> [1] 32 12
dim(head(batch))
#> [1] 6 12
names(batch)
#> [1] "name" "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" #> [11] "gear" "carb"
batch$mpg
#> Array #> <double> #> [ #> 21, #> 21, #> 22.8, #> 21.4, #> 18.7, #> 18.1, #> 14.3, #> 24.4, #> 22.8, #> 19.2, #> ... #> 15.2, #> 13.3, #> 19.2, #> 27.3, #> 26, #> 30.4, #> 15.8, #> 19.7, #> 15, #> 21.4 #> ]
batch[["cyl"]]
#> Array #> <double> #> [ #> 6, #> 6, #> 4, #> 6, #> 8, #> 6, #> 8, #> 4, #> 4, #> 6, #> ... #> 8, #> 8, #> 8, #> 4, #> 4, #> 4, #> 8, #> 6, #> 8, #> 4 #> ]
as.data.frame(batch[4:8, c("gear", "hp", "wt")])
#> # A tibble: 5 x 3 #> gear hp wt #> <dbl> <dbl> <dbl> #> 1 3 110 3.22 #> 2 3 175 3.44 #> 3 3 105 3.46 #> 4 3 245 3.57 #> 5 4 62 3.19
# }