Tabular Data#
While arrays (aka: ValueVector) represent a one-dimensional sequence of homogeneous values, data often comes in the form of two-dimensional sets of heterogeneous data (such as database tables, CSV files…). Arrow provides several abstractions to handle such data conveniently and efficiently.
Fields#
Fields are used to denote the particular columns of tabular data. A field, i.e. an instance of Field, holds together a field name, a data type, and some optional key-value metadata.
// Create a column "document" of string type with metadata
import org.apache.arrow.vector.types.pojo.ArrowType;
import org.apache.arrow.vector.types.pojo.Field;
import org.apache.arrow.vector.types.pojo.FieldType;
Map<String, String> metadata = new HashMap<>();
metadata.put("A", "Id card");
metadata.put("B", "Passport");
metadata.put("C", "Visa");
Field document = new Field("document", new FieldType(true, new ArrowType.Utf8(), /*dictionary*/ null, metadata), /*children*/ null);
Schemas#
A Schema describes the overall structure consisting of any number of columns. It holds a sequence of fields together with some optional schema-wide metadata (in addition to per-field metadata).
// Create a schema describing datasets with two columns:
// a int32 column "A" and a utf8-encoded string column "B"
import org.apache.arrow.vector.types.pojo.ArrowType;
import org.apache.arrow.vector.types.pojo.Field;
import org.apache.arrow.vector.types.pojo.FieldType;
import org.apache.arrow.vector.types.pojo.Schema;
import static java.util.Arrays.asList;
Map<String, String> metadata = new HashMap<>();
metadata.put("K1", "V1");
metadata.put("K2", "V2");
Field a = new Field("A", FieldType.nullable(new ArrowType.Int(32, true)), null);
Field b = new Field("B", FieldType.nullable(new ArrowType.Utf8()), null);
Schema schema = new Schema(asList(a, b), metadata);
VectorSchemaRoot#
A VectorSchemaRoot is a container for batches of data. Batches flow through VectorSchemaRoot as part of a pipeline.
Note
VectorSchemaRoot is somewhat analogous to tables or record batches in the other Arrow implementations in that they all are 2D datasets, but their usage is different.
The recommended usage is to create a single VectorSchemaRoot based on a known
schema and populate data over and over into that root in a stream of batches,
rather than creating a new instance each time (see Flight or
ArrowFileWriter
as examples). Thus at any one point, a VectorSchemaRoot may
have data or may have no data (say it was transferred downstream or not yet
populated).
Here is an example of creating a VectorSchemaRoot:
BitVector bitVector = new BitVector("boolean", allocator);
VarCharVector varCharVector = new VarCharVector("varchar", allocator);
bitVector.allocateNew();
varCharVector.allocateNew();
for (int i = 0; i < 10; i++) {
bitVector.setSafe(i, i % 2 == 0 ? 0 : 1);
varCharVector.setSafe(i, ("test" + i).getBytes(StandardCharsets.UTF_8));
}
bitVector.setValueCount(10);
varCharVector.setValueCount(10);
List<Field> fields = Arrays.asList(bitVector.getField(), varCharVector.getField());
List<FieldVector> vectors = Arrays.asList(bitVector, varCharVector);
VectorSchemaRoot vectorSchemaRoot = new VectorSchemaRoot(fields, vectors);
Data can be loaded into/unloaded from a VectorSchemaRoot via VectorLoader and VectorUnloader. They handle converting between VectorSchemaRoot and ArrowRecordBatch (a representation of a RecordBatch IPC message). For example:
// create a VectorSchemaRoot root1 and convert its data into recordBatch
VectorSchemaRoot root1 = new VectorSchemaRoot(fields, vectors);
VectorUnloader unloader = new VectorUnloader(root1);
ArrowRecordBatch recordBatch = unloader.getRecordBatch();
// create a VectorSchemaRoot root2 and load the recordBatch
VectorSchemaRoot root2 = VectorSchemaRoot.create(root1.getSchema(), allocator);
VectorLoader loader = new VectorLoader(root2);
loader.load(recordBatch);
A new VectorSchemaRoot can be sliced from an existing root without copying data:
// 0 indicates start index (inclusive) and 5 indicated length (exclusive).
VectorSchemaRoot newRoot = vectorSchemaRoot.slice(0, 5);