A Schema
is an Arrow object containing Fields, which map names to
Arrow data types. Create a Schema
when you
want to convert an R data.frame
to Arrow but don't want to rely on the
default mapping of R types to Arrow types, such as when you want to choose a
specific numeric precision, or when creating a Dataset and you want to
ensure a specific schema rather than inferring it from the various files.
Many Arrow objects, including Table and Dataset, have a $schema
method
(active binding) that lets you access their schema.
Methods
$ToString()
: convert to a string$field(i)
: returns the field at indexi
(0-based)$GetFieldByName(x)
: returns the field with namex
$WithMetadata(metadata)
: returns a newSchema
with the key-valuemetadata
set. Note that all list elements inmetadata
will be coerced tocharacter
.
Active bindings
$names
: returns the field names (called innames(Schema)
)$num_fields
: returns the number of fields (called inlength(Schema)
)$fields
: returns the list ofField
s in theSchema
, suitable for iterating over$HasMetadata
: logical: does thisSchema
have extra metadata?$metadata
: returns the key-value metadata as a named list. Modify or replace by assigning in (sch$metadata <- new_metadata
). All list elements are coerced to string.
R Metadata
When converting a data.frame to an Arrow Table or RecordBatch, attributes
from the data.frame
are saved alongside tables so that the object can be
reconstructed faithfully in R (e.g. with as.data.frame()
). This metadata
can be both at the top-level of the data.frame
(e.g. attributes(df)
) or
at the column (e.g. attributes(df$col_a)
) or for list columns only:
element level (e.g. attributes(df[1, "col_a"])
). For example, this allows
for storing haven
columns in a table and being able to faithfully
re-create them when pulled back into R. This metadata is separate from the
schema (column names and types) which is compatible with other Arrow
clients. The R metadata is only read by R and is ignored by other clients
(e.g. Pandas has its own custom metadata). This metadata is stored in
$metadata$r
.
Since Schema metadata keys and values must be strings, this metadata is
saved by serializing R's attribute list structure to a string. If the
serialized metadata exceeds 100Kb in size, by default it is compressed
starting in version 3.0.0. To disable this compression (e.g. for tables
that are compatible with Arrow versions before 3.0.0 and include large
amounts of metadata), set the option arrow.compress_metadata
to FALSE
.
Files with compressed metadata are readable by older versions of arrow, but
the metadata is dropped.
Examples
schema(a = int32(), b = float64())
#> Schema
#> a: int32
#> b: double
schema(
field("b", double()),
field("c", bool(), nullable = FALSE),
field("d", string())
)
#> Schema
#> b: double
#> c: bool not null
#> d: string
df <- data.frame(col1 = 2:4, col2 = c(0.1, 0.3, 0.5))
tab1 <- arrow_table(df)
tab1$schema
#> Schema
#> col1: int32
#> col2: double
#>
#> See $metadata for additional Schema metadata
tab2 <- arrow_table(df, schema = schema(col1 = int8(), col2 = float32()))
tab2$schema
#> Schema
#> col1: int8
#> col2: float
#>
#> See $metadata for additional Schema metadata