Tabular Data

While arrays (aka: ValueVector) represent a one-dimensional sequence of homogeneous values, data often comes in the form of two-dimensional sets of heterogeneous data (such as database tables, CSV files…). Arrow provides several abstractions to handle such data conveniently and efficiently.

Fields

Fields are used to denote the particular columns of tabular data. A field, i.e. an instance of Field, holds together a field name, a data type, and some optional key-value metadata.

// Create a column "document" of string type with metadata
import org.apache.arrow.vector.types.pojo.ArrowType;
import org.apache.arrow.vector.types.pojo.Field;
import org.apache.arrow.vector.types.pojo.FieldType;

Map<String, String> metadata = new HashMap<>();
metadata.put("A", "Id card");
metadata.put("B", "Passport");
metadata.put("C", "Visa");
Field document = new Field("document", new FieldType(true, new ArrowType.Utf8(), /*dictionary*/ null, metadata), /*children*/ null);

Schemas

A Schema describes the overall structure consisting of any number of columns. It holds a sequence of fields together with some optional schema-wide metadata (in addition to per-field metadata).

// Create a schema describing datasets with two columns:
// a int32 column "A" and a utf8-encoded string column "B"
import org.apache.arrow.vector.types.pojo.ArrowType;
import org.apache.arrow.vector.types.pojo.Field;
import org.apache.arrow.vector.types.pojo.FieldType;
import org.apache.arrow.vector.types.pojo.Schema;
import static java.util.Arrays.asList;

Map<String, String> metadata = new HashMap<>();
metadata.put("K1", "V1");
metadata.put("K2", "V2");
Field a = new Field("A", FieldType.nullable(new ArrowType.Int(32, true)), null);
Field b = new Field("B", FieldType.nullable(new ArrowType.Utf8()), null);
Schema schema = new Schema(asList(a, b), metadata);

VectorSchemaRoot

A VectorSchemaRoot is a container for batches of data. Batches flow through VectorSchemaRoot as part of a pipeline.

Note

VectorSchemaRoot is somewhat analogous to tables or record batches in the other Arrow implementations in that they all are 2D datasets, but their usage is different.

The recommended usage is to create a single VectorSchemaRoot based on a known schema and populate data over and over into that root in a stream of batches, rather than creating a new instance each time (see Flight or ArrowFileWriter as examples). Thus at any one point, a VectorSchemaRoot may have data or may have no data (say it was transferred downstream or not yet populated).

Here is an example of creating a VectorSchemaRoot:

BitVector bitVector = new BitVector("boolean", allocator);
VarCharVector varCharVector = new VarCharVector("varchar", allocator);
bitVector.allocateNew();
varCharVector.allocateNew();
for (int i = 0; i < 10; i++) {
  bitVector.setSafe(i, i % 2 == 0 ? 0 : 1);
  varCharVector.setSafe(i, ("test" + i).getBytes(StandardCharsets.UTF_8));
}
bitVector.setValueCount(10);
varCharVector.setValueCount(10);

List<Field> fields = Arrays.asList(bitVector.getField(), varCharVector.getField());
List<FieldVector> vectors = Arrays.asList(bitVector, varCharVector);
VectorSchemaRoot vectorSchemaRoot = new VectorSchemaRoot(fields, vectors);

Data can be loaded into/unloaded from a VectorSchemaRoot via VectorLoader and VectorUnloader. They handle converting between VectorSchemaRoot and ArrowRecordBatch (a representation of a RecordBatch IPC message). For example:

// create a VectorSchemaRoot root1 and convert its data into recordBatch
VectorSchemaRoot root1 = new VectorSchemaRoot(fields, vectors);
VectorUnloader unloader = new VectorUnloader(root1);
ArrowRecordBatch recordBatch = unloader.getRecordBatch();

// create a VectorSchemaRoot root2 and load the recordBatch
VectorSchemaRoot root2 = VectorSchemaRoot.create(root1.getSchema(), allocator);
VectorLoader loader = new VectorLoader(root2);
loader.load(recordBatch);

A new VectorSchemaRoot can be sliced from an existing root without copying data:

// 0 indicates start index (inclusive) and 5 indicated length (exclusive).
VectorSchemaRoot newRoot = vectorSchemaRoot.slice(0, 5);

Table

A Table is an immutable tabular data structure, very similar to VectorSchemaRoot, in that it is also built on ValueVectors and schemas. Unlike VectorSchemaRoot, Table is not designed for batch processing. Here is a version of the example above, showing how to create a Table, rather than a VectorSchemaRoot:

BitVector bitVector = new BitVector("boolean", allocator);
VarCharVector varCharVector = new VarCharVector("varchar", allocator);
bitVector.allocateNew();
varCharVector.allocateNew();
for (int i = 0; i < 10; i++) {
  bitVector.setSafe(i, i % 2 == 0 ? 0 : 1);
  varCharVector.setSafe(i, ("test" + i).getBytes(StandardCharsets.UTF_8));
}
bitVector.setValueCount(10);
varCharVector.setValueCount(10);

List<FieldVector> vectors = Arrays.asList(bitVector, varCharVector);
Table table = new Table(vectors);

See the Table documentation for more information.