.. Licensed to the Apache Software Foundation (ASF) under one .. or more contributor license agreements. See the NOTICE file .. distributed with this work for additional information .. regarding copyright ownership. The ASF licenses this file .. to you under the Apache License, Version 2.0 (the .. "License"); you may not use this file except in compliance .. with the License. You may obtain a copy of the License at .. http://www.apache.org/licenses/LICENSE-2.0 .. Unless required by applicable law or agreed to in writing, .. software distributed under the License is distributed on an .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY .. KIND, either express or implied. See the License for the .. specific language governing permissions and limitations .. under the License. ================= Quick Start Guide ================= Arrow Java provides several building blocks. Data types describe the types of values; ValueVectors are sequences of typed values; fields describe the types of columns in tabular data; schemas describe a sequence of columns in tabular data, and VectorSchemaRoot represents tabular data. Arrow also provides readers and writers for loading data from and persisting data to storage. Create a ValueVector ******************** **ValueVectors** represent a sequence of values of the same type. They are also known as "arrays" in the columnar format. Example: create a vector of 32-bit integers representing ``[1, null, 2]``: .. code-block:: Java import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; import org.apache.arrow.vector.IntVector; try( BufferAllocator allocator = new RootAllocator(); IntVector intVector = new IntVector("fixed-size-primitive-layout", allocator); ){ intVector.allocateNew(3); intVector.set(0,1); intVector.setNull(1); intVector.set(2,2); intVector.setValueCount(3); System.out.println("Vector created in memory: " + intVector); } .. code-block:: shell Vector created in memory: [1, null, 2] Example: create a vector of UTF-8 encoded strings representing ``["one", "two", "three"]``: .. code-block:: Java import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; import org.apache.arrow.vector.VarCharVector; try( BufferAllocator allocator = new RootAllocator(); VarCharVector varCharVector = new VarCharVector("variable-size-primitive-layout", allocator); ){ varCharVector.allocateNew(3); varCharVector.set(0, "one".getBytes()); varCharVector.set(1, "two".getBytes()); varCharVector.set(2, "three".getBytes()); varCharVector.setValueCount(3); System.out.println("Vector created in memory: " + varCharVector); } .. code-block:: shell Vector created in memory: [one, two, three] Create a Field ************** **Fields** are used to denote the particular columns of tabular data. They consist of a name, a data type, a flag indicating whether the column can have null values, and optional key-value metadata. Example: create a field named "document" of string type: .. code-block:: Java import org.apache.arrow.vector.types.pojo.ArrowType; import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.types.pojo.FieldType; import java.util.HashMap; import java.util.Map; Map metadata = new HashMap<>(); metadata.put("A", "Id card"); metadata.put("B", "Passport"); metadata.put("C", "Visa"); Field document = new Field("document", new FieldType(true, new ArrowType.Utf8(), /*dictionary*/ null, metadata), /*children*/ null); System.out.println("Field created: " + document + ", Metadata: " + document.getMetadata()); .. code-block:: shell Field created: document: Utf8, Metadata: {A=Id card, B=Passport, C=Visa} Create a Schema *************** **Schemas** hold a sequence of fields together with some optional metadata. Example: Create a schema describing datasets with two columns: an int32 column "A" and a UTF8-encoded string column "B" .. code-block:: Java import org.apache.arrow.vector.types.pojo.ArrowType; import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.types.pojo.FieldType; import org.apache.arrow.vector.types.pojo.Schema; import java.util.HashMap; import java.util.Map; import static java.util.Arrays.asList; Map metadata = new HashMap<>(); metadata.put("K1", "V1"); metadata.put("K2", "V2"); Field a = new Field("A", FieldType.nullable(new ArrowType.Int(32, true)), /*children*/ null); Field b = new Field("B", FieldType.nullable(new ArrowType.Utf8()), /*children*/ null); Schema schema = new Schema(asList(a, b), metadata); System.out.println("Schema created: " + schema); .. code-block:: shell Schema created: Schema(metadata: {K1=V1, K2=V2}) Create a VectorSchemaRoot ************************* A **VectorSchemaRoot** combines ValueVectors with a Schema to represent tabular data. Example: Create a dataset of names (strings) and ages (32-bit signed integers). .. code-block:: Java import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; import org.apache.arrow.vector.IntVector; import org.apache.arrow.vector.VarCharVector; import org.apache.arrow.vector.VectorSchemaRoot; import org.apache.arrow.vector.types.pojo.ArrowType; import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.types.pojo.FieldType; import org.apache.arrow.vector.types.pojo.Schema; import java.nio.charset.StandardCharsets; import java.util.HashMap; import java.util.Map; import static java.util.Arrays.asList; Field age = new Field("age", FieldType.nullable(new ArrowType.Int(32, true)), /*children*/null ); Field name = new Field("name", FieldType.nullable(new ArrowType.Utf8()), /*children*/null ); Schema schema = new Schema(asList(age, name), /*metadata*/ null); try( BufferAllocator allocator = new RootAllocator(); VectorSchemaRoot root = VectorSchemaRoot.create(schema, allocator); IntVector ageVector = (IntVector) root.getVector("age"); VarCharVector nameVector = (VarCharVector) root.getVector("name"); ){ ageVector.allocateNew(3); ageVector.set(0, 10); ageVector.set(1, 20); ageVector.set(2, 30); nameVector.allocateNew(3); nameVector.set(0, "Dave".getBytes(StandardCharsets.UTF_8)); nameVector.set(1, "Peter".getBytes(StandardCharsets.UTF_8)); nameVector.set(2, "Mary".getBytes(StandardCharsets.UTF_8)); root.setRowCount(3); System.out.println("VectorSchemaRoot created: \n" + root.contentToTSVString()); } .. code-block:: shell VectorSchemaRoot created: age name 10 Dave 20 Peter 30 Mary Interprocess Communication (IPC) ******************************** Arrow data can be written to and read from disk, and both of these can be done in a streaming and/or random-access fashion depending on application requirements. **Write data to an arrow file** Example: Write the dataset from the previous example to an Arrow IPC file (random-access). .. code-block:: Java import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; import org.apache.arrow.vector.IntVector; import org.apache.arrow.vector.VarCharVector; import org.apache.arrow.vector.VectorSchemaRoot; import org.apache.arrow.vector.ipc.ArrowFileWriter; import org.apache.arrow.vector.types.pojo.ArrowType; import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.types.pojo.FieldType; import org.apache.arrow.vector.types.pojo.Schema; import java.io.File; import java.io.FileOutputStream; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.util.HashMap; import java.util.Map; import static java.util.Arrays.asList; Field age = new Field("age", FieldType.nullable(new ArrowType.Int(32, true)), /*children*/ null); Field name = new Field("name", FieldType.nullable(new ArrowType.Utf8()), /*children*/ null); Schema schema = new Schema(asList(age, name)); try( BufferAllocator allocator = new RootAllocator(); VectorSchemaRoot root = VectorSchemaRoot.create(schema, allocator); IntVector ageVector = (IntVector) root.getVector("age"); VarCharVector nameVector = (VarCharVector) root.getVector("name"); ){ ageVector.allocateNew(3); ageVector.set(0, 10); ageVector.set(1, 20); ageVector.set(2, 30); nameVector.allocateNew(3); nameVector.set(0, "Dave".getBytes(StandardCharsets.UTF_8)); nameVector.set(1, "Peter".getBytes(StandardCharsets.UTF_8)); nameVector.set(2, "Mary".getBytes(StandardCharsets.UTF_8)); root.setRowCount(3); File file = new File("random_access_file.arrow"); try ( FileOutputStream fileOutputStream = new FileOutputStream(file); ArrowFileWriter writer = new ArrowFileWriter(root, /*provider*/ null, fileOutputStream.getChannel()); ) { writer.start(); writer.writeBatch(); writer.end(); System.out.println("Record batches written: " + writer.getRecordBlocks().size() + ". Number of rows written: " + root.getRowCount()); } catch (IOException e) { e.printStackTrace(); } } .. code-block:: shell Record batches written: 1. Number of rows written: 3 **Read data from an arrow file** Example: Read the dataset from the previous example from an Arrow IPC file (random-access). .. code-block:: Java import org.apache.arrow.memory.RootAllocator; import org.apache.arrow.vector.ipc.ArrowFileReader; import org.apache.arrow.vector.ipc.message.ArrowBlock; import org.apache.arrow.vector.VectorSchemaRoot; import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; try( BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE); FileInputStream fileInputStream = new FileInputStream(new File("random_access_file.arrow")); ArrowFileReader reader = new ArrowFileReader(fileInputStream.getChannel(), allocator); ){ System.out.println("Record batches in file: " + reader.getRecordBlocks().size()); for (ArrowBlock arrowBlock : reader.getRecordBlocks()) { reader.loadRecordBatch(arrowBlock); VectorSchemaRoot root = reader.getVectorSchemaRoot(); System.out.println("VectorSchemaRoot read: \n" + root.contentToTSVString()); } } catch (IOException e) { e.printStackTrace(); } .. code-block:: shell Record batches in file: 1 VectorSchemaRoot read: age name 10 Dave 20 Peter 30 Mary More examples available at `Arrow Java Cookbook`_. .. _`Arrow Java Cookbook`: https://arrow.apache.org/cookbook/java