Building Arrow Java

System Setup

Arrow Java uses the Maven build system.

Building requires:

  • JDK 11+

  • Maven 3+

Note

CI will test all supported JDK LTS versions, plus the latest non-LTS version.

Building

All the instructions below assume that you have cloned the Arrow git repository:

$ git clone https://github.com/apache/arrow.git
$ cd arrow
$ git submodule update --init --recursive

These are the options available to compile Arrow Java modules with:

  • Maven build tool.

  • Docker Compose.

  • Archery.

Building Java Modules

To build the default modules, go to the project root and execute:

Maven

$ cd arrow/java
$ export JAVA_HOME=<absolute path to your java home>
$ java --version
$ mvn clean install

Docker compose

$ cd arrow/java
$ export JAVA_HOME=<absolute path to your java home>
$ java --version
$ docker compose run java

Archery

$ cd arrow/java
$ export JAVA_HOME=<absolute path to your java home>
$ java --version
$ archery docker run java

Building JNI Libraries (*.dylib / *.so / *.dll)

First, we need to build the C++ shared libraries that the JNI bindings will use. We can build these manually or we can use Archery to build them using a Docker container (This will require installing Docker, Docker Compose, and Archery).

Note

If you are building on Apple Silicon, be sure to use a JDK version that was compiled for that architecture. See, for example, the Azul JDK.

If you are building on Windows OS, see Developing on Windows.

Maven

  • To build only the JNI C Data Interface library (macOS / Linux):

    $ cd arrow/java
    $ export JAVA_HOME=<absolute path to your java home>
    $ java --version
    $ mvn generate-resources -Pgenerate-libs-cdata-all-os -N
    $ ls -latr ../java-dist/lib
    |__ arrow_cdata_jni/
    
  • To build only the JNI C Data Interface library (Windows):

    $ cd arrow/java
    $ mvn generate-resources -Pgenerate-libs-cdata-all-os -N
    $ dir "../java-dist/bin"
    |__ arrow_cdata_jni/
    
  • To build all JNI libraries (macOS / Linux) except the JNI C Data Interface library:

    $ cd arrow/java
    $ export JAVA_HOME=<absolute path to your java home>
    $ java --version
    $ mvn generate-resources -Pgenerate-libs-jni-macos-linux -N
    $ ls -latr java-dist/lib
    |__ arrow_dataset_jni/
    |__ arrow_orc_jni/
    |__ gandiva_jni/
    
  • To build all JNI libraries (Windows) except the JNI C Data Interface library:

    $ cd arrow/java
    $ mvn generate-resources -Pgenerate-libs-jni-windows -N
    $ dir "../java-dist/bin"
    |__ arrow_dataset_jni/
    

CMake

  • To build only the JNI C Data Interface library (macOS / Linux):

    $ cd arrow
    $ mkdir -p java-dist java-cdata
    $ cmake \
        -S java \
        -B java-cdata \
        -DARROW_JAVA_JNI_ENABLE_C=ON \
        -DARROW_JAVA_JNI_ENABLE_DEFAULT=OFF \
        -DBUILD_TESTING=OFF \
        -DCMAKE_BUILD_TYPE=Release \
        -DCMAKE_INSTALL_PREFIX=java-dist
    $ cmake --build java-cdata --target install --config Release
    $ ls -latr java-dist/lib
    |__ arrow_cdata_jni/
    
  • To build only the JNI C Data Interface library (Windows):

    $ cd arrow
    $ mkdir java-dist, java-cdata
    $ cmake ^
        -S java ^
        -B java-cdata ^
        -DARROW_JAVA_JNI_ENABLE_C=ON ^
        -DARROW_JAVA_JNI_ENABLE_DEFAULT=OFF ^
        -DBUILD_TESTING=OFF ^
        -DCMAKE_BUILD_TYPE=Release ^
        -DCMAKE_INSTALL_PREFIX=java-dist
    $ cmake --build java-cdata --target install --config Release
    $ dir "java-dist/bin"
    |__ arrow_cdata_jni/
    
  • To build all JNI libraries (macOS / Linux) except the JNI C Data Interface library:

    $ cd arrow
    $ brew bundle --file=cpp/Brewfile
    # Homebrew Bundle complete! 25 Brewfile dependencies now installed.
    $ brew uninstall aws-sdk-cpp
    #  (We can't use aws-sdk-cpp installed by Homebrew because it has
    #  an issue: https://github.com/aws/aws-sdk-cpp/issues/1809 )
    $ export JAVA_HOME=<absolute path to your java home>
    $ mkdir -p java-dist cpp-jni
    $ cmake \
        -S cpp \
        -B cpp-jni \
        -DARROW_BUILD_SHARED=OFF \
        -DARROW_CSV=ON \
        -DARROW_DATASET=ON \
        -DARROW_DEPENDENCY_SOURCE=BUNDLED \
        -DARROW_DEPENDENCY_USE_SHARED=OFF \
        -DARROW_FILESYSTEM=ON \
        -DARROW_GANDIVA=ON \
        -DARROW_GANDIVA_STATIC_LIBSTDCPP=ON \
        -DARROW_JSON=ON \
        -DARROW_ORC=ON \
        -DARROW_PARQUET=ON \
        -DARROW_S3=ON \
        -DARROW_SUBSTRAIT=ON \
        -DARROW_USE_CCACHE=ON \
        -DCMAKE_BUILD_TYPE=Release \
        -DCMAKE_INSTALL_PREFIX=java-dist \
        -DCMAKE_UNITY_BUILD=ON
    $ cmake --build cpp-jni --target install --config Release
    $ cmake \
        -S java \
        -B java-jni \
        -DARROW_JAVA_JNI_ENABLE_C=OFF \
        -DARROW_JAVA_JNI_ENABLE_DEFAULT=ON \
        -DBUILD_TESTING=OFF \
        -DCMAKE_BUILD_TYPE=Release \
        -DCMAKE_INSTALL_PREFIX=java-dist \
        -DCMAKE_PREFIX_PATH=$PWD/java-dist \
        -DProtobuf_ROOT=$PWD/../cpp-jni/protobuf_ep-install \
        -DProtobuf_USE_STATIC_LIBS=ON
    $ cmake --build java-jni --target install --config Release
    $ ls -latr java-dist/lib/
    |__ arrow_dataset_jni/
    |__ arrow_orc_jni/
    |__ gandiva_jni/
    
  • To build all JNI libraries (Windows) except the JNI C Data Interface library:

    $ cd arrow
    $ mkdir java-dist, cpp-jni
    $ cmake ^
        -S cpp ^
        -B cpp-jni ^
        -DARROW_BUILD_SHARED=OFF ^
        -DARROW_CSV=ON ^
        -DARROW_DATASET=ON ^
        -DARROW_DEPENDENCY_USE_SHARED=OFF ^
        -DARROW_FILESYSTEM=ON ^
        -DARROW_GANDIVA=OFF ^
        -DARROW_JSON=ON ^
        -DARROW_ORC=ON ^
        -DARROW_PARQUET=ON ^
        -DARROW_S3=ON ^
        -DARROW_SUBSTRAIT=ON ^
        -DARROW_USE_CCACHE=ON ^
        -DARROW_WITH_BROTLI=ON ^
        -DARROW_WITH_LZ4=ON ^
        -DARROW_WITH_SNAPPY=ON ^
        -DARROW_WITH_ZLIB=ON ^
        -DARROW_WITH_ZSTD=ON ^
        -DCMAKE_BUILD_TYPE=Release ^
        -DCMAKE_INSTALL_PREFIX=java-dist ^
        -DCMAKE_UNITY_BUILD=ON ^
        -GNinja
    $ cd cpp-jni
    $ ninja install
    $ cd ../
    $ cmake ^
        -S java ^
        -B java-jni ^
        -DARROW_JAVA_JNI_ENABLE_C=OFF ^
        -DARROW_JAVA_JNI_ENABLE_DATASET=ON ^
        -DARROW_JAVA_JNI_ENABLE_DEFAULT=ON ^
        -DARROW_JAVA_JNI_ENABLE_GANDIVA=OFF ^
        -DARROW_JAVA_JNI_ENABLE_ORC=ON ^
        -DBUILD_TESTING=OFF ^
        -DCMAKE_BUILD_TYPE=Release ^
        -DCMAKE_INSTALL_PREFIX=java-dist ^
        -DCMAKE_PREFIX_PATH=$PWD/java-dist
    $ cmake --build java-jni --target install --config Release
    $ dir "java-dist/bin"
    |__ arrow_orc_jni/
    |__ arrow_dataset_jni/
    

Archery

$ cd arrow
$ archery docker run java-jni-manylinux-2014
$ ls -latr java-dist
|__ arrow_cdata_jni/
|__ arrow_dataset_jni/
|__ arrow_orc_jni/
|__ gandiva_jni/

Building Java JNI Modules

  • To compile the JNI bindings, use the arrow-c-data Maven profile:

    $ cd arrow/java
    $ mvn -Darrow.c.jni.dist.dir=<absolute path to your arrow folder>/java-dist/lib -Parrow-c-data clean install
    
  • To compile the JNI bindings for ORC / Gandiva / Dataset, use the arrow-jni Maven profile:

    $ cd arrow/java
    $ mvn \
        -Darrow.cpp.build.dir=<absolute path to your arrow folder>/java-dist/lib/ \
        -Darrow.c.jni.dist.dir=<absolute path to your arrow folder>/java-dist/lib/ \
        -Parrow-jni clean install
    

Testing

By default, Maven uses the same Java version to both build the code and run the tests.

It is also possible to use a different JDK version for the tests. This requires Maven toolchains to be configured beforehand, and then a specific test property needs to be set.

Configuring Maven toolchains

To be able to use a JDK version for testing, it needs to be registered first in Maven toolchains.xml configuration file usually located under ${HOME}/.m2 with the following snippet added to it:

<?xml version="1.0" encoding="UTF8"?>
<toolchains>

  [...]

  <toolchain>
    <type>jdk</type>
    <provides>
      <version>21</version> <!-- Replace with the corresponding JDK version: 11, 17, ... -->
      <vendor>temurin</vendor> <!-- Replace with the vendor/distribution: temurin, oracle, zulu ... -->
    </provides>
    <configuration>
      <jdkHome>path/to/jdk/home</jdkHome> <!-- Replace with the path to the JDK -->
    </configuration>
  </toolchain>

  [...]

</toolchains>

Testing with a specific JDK

To run Arrow tests with a specific JDK version, use the arrow.test.jdk-version property.

For example, to run Arrow tests with JDK 17, use the following snippet:

$ cd arrow/java
$ mvn -Darrow.test.jdk-version=17 clean verify

IDE Configuration

IntelliJ

To start working on Arrow in IntelliJ: build the project once from the command line using mvn clean install. Then open the java/ subdirectory of the Arrow repository, and update the following settings:

  • In the Files tool window, find the path vector/target/generated-sources, right click the directory, and select Mark Directory as > Generated Sources Root. There is no need to mark other generated sources directories, as only the vector module generates sources.

  • For JDK 11, due to an IntelliJ bug, you must go into Settings > Build, Execution, Deployment > Compiler > Java Compiler and disable “Use ‘–release’ option for cross-compilation (Java 9 and later)”. Otherwise you will get an error like “package sun.misc does not exist”.

  • You may want to disable error-prone entirely if it gives spurious warnings (disable both error-prone profiles in the Maven tool window and “Reload All Maven Projects”).

  • If using IntelliJ’s Maven integration to build, you may need to change <fork> to false in the pom.xml files due to an IntelliJ bug.

  • To enable debugging JNI-based modules like dataset, activate specific profiles in the Maven tab under “Profiles”. Ensure the profiles arrow-c-data, arrow-jni, generate-libs-cdata-all-os, generate-libs-jni-macos-linux, and jdk11+ are enabled, so that the IDE can build them and enable debugging.

You may not need to update all of these settings if you build/test with the IntelliJ Maven integration instead of with IntelliJ directly.

Common Errors

  • When working with the JNI code: if the C++ build cannot find dependencies, with errors like these:

    Could NOT find Boost (missing: Boost_INCLUDE_DIR system filesystem)
    Could NOT find Lz4 (missing: LZ4_LIB)
    Could NOT find zstd (missing: ZSTD_LIB)
    

    Specify that the dependencies should be downloaded at build time (more details at Dependency Resolution):

    -Dre2_SOURCE=BUNDLED \
    -DBoost_SOURCE=BUNDLED \
    -Dutf8proc_SOURCE=BUNDLED \
    -DSnappy_SOURCE=BUNDLED \
    -DORC_SOURCE=BUNDLED \
    -DZLIB_SOURCE=BUNDLED
    

Installing Nightly Packages

Warning

These packages are not official releases. Use them at your own risk.

Arrow nightly builds are posted on the mailing list at builds@arrow.apache.org. The artifacts are uploaded to GitHub. For example, for 2022/07/30, they can be found at GitHub Nightly.

Installing from Apache Nightlies

  1. Look up the nightly version number for the Arrow libraries used.

    For example, for arrow-memory, visit https://nightlies.apache.org/arrow/java/org/apache/arrow/arrow-memory/ and see what versions are available (e.g. 9.0.0.dev501).

  2. Add Apache Nightlies Repository to the Maven/Gradle project.

    <properties>
       <arrow.version>9.0.0.dev501</arrow.version>
    </properties>
    ...
    <repositories>
       <repository>
             <id>arrow-apache-nightlies</id>
             <url>https://nightlies.apache.org/arrow/java</url>
       </repository>
    </repositories>
    ...
    <dependencies>
       <dependency>
             <groupId>org.apache.arrow</groupId>
             <artifactId>arrow-vector</artifactId>
             <version>${arrow.version}</version>
       </dependency>
    </dependencies>
    ...
    

Installing Manually

  1. Decide nightly packages repository to use, for example: https://github.com/ursacomputing/crossbow/releases/tag/nightly-packaging-2022-07-30-0-github-java-jars

  2. Add packages to your pom.xml, for example: flight-core (it depends on: arrow-format, arrow-vector, arrow-memory-core and arrow-memory-netty).

    <properties>
       <maven.compiler.source>8</maven.compiler.source>
       <maven.compiler.target>8</maven.compiler.target>
       <arrow.version>9.0.0.dev501</arrow.version>
    </properties>
    
    <dependencies>
       <dependency>
             <groupId>org.apache.arrow</groupId>
             <artifactId>flight-core</artifactId>
             <version>${arrow.version}</version>
       </dependency>
    </dependencies>
    
  3. Download the necessary pom and jar files to a temporary directory:

    $ mkdir nightly-packaging-2022-07-30-0-github-java-jars
    $ cd nightly-packaging-2022-07-30-0-github-java-jars
    $ wget https://github.com/ursacomputing/crossbow/releases/download/nightly-packaging-2022-07-30-0-github-java-jars/arrow-java-root-9.0.0.dev501.pom
    $ wget https://github.com/ursacomputing/crossbow/releases/download/nightly-packaging-2022-07-30-0-github-java-jars/arrow-format-9.0.0.dev501.pom
    $ wget https://github.com/ursacomputing/crossbow/releases/download/nightly-packaging-2022-07-30-0-github-java-jars/arrow-format-9.0.0.dev501.jar
    $ wget https://github.com/ursacomputing/crossbow/releases/download/nightly-packaging-2022-07-30-0-github-java-jars/arrow-vector-9.0.0.dev501.pom
    $ wget https://github.com/ursacomputing/crossbow/releases/download/nightly-packaging-2022-07-30-0-github-java-jars/arrow-vector-9.0.0.dev501.jar
    $ wget https://github.com/ursacomputing/crossbow/releases/download/nightly-packaging-2022-07-30-0-github-java-jars/arrow-memory-9.0.0.dev501.pom
    $ wget https://github.com/ursacomputing/crossbow/releases/download/nightly-packaging-2022-07-30-0-github-java-jars/arrow-memory-core-9.0.0.dev501.pom
    $ wget https://github.com/ursacomputing/crossbow/releases/download/nightly-packaging-2022-07-30-0-github-java-jars/arrow-memory-netty-9.0.0.dev501.pom
    $ wget https://github.com/ursacomputing/crossbow/releases/download/nightly-packaging-2022-07-30-0-github-java-jars/arrow-memory-core-9.0.0.dev501.jar
    $ wget https://github.com/ursacomputing/crossbow/releases/download/nightly-packaging-2022-07-30-0-github-java-jars/arrow-memory-netty-9.0.0.dev501.jar
    $ wget https://github.com/ursacomputing/crossbow/releases/download/nightly-packaging-2022-07-30-0-github-java-jars/arrow-flight-9.0.0.dev501.pom
    $ wget https://github.com/ursacomputing/crossbow/releases/download/nightly-packaging-2022-07-30-0-github-java-jars/flight-core-9.0.0.dev501.pom
    $ wget https://github.com/ursacomputing/crossbow/releases/download/nightly-packaging-2022-07-30-0-github-java-jars/flight-core-9.0.0.dev501.jar
    $ tree
    .
    ├── arrow-flight-9.0.0.dev501.pom
    ├── arrow-format-9.0.0.dev501.jar
    ├── arrow-format-9.0.0.dev501.pom
    ├── arrow-java-root-9.0.0.dev501.pom
    ├── arrow-memory-9.0.0.dev501.pom
    ├── arrow-memory-core-9.0.0.dev501.jar
    ├── arrow-memory-core-9.0.0.dev501.pom
    ├── arrow-memory-netty-9.0.0.dev501.jar
    ├── arrow-memory-netty-9.0.0.dev501.pom
    ├── arrow-vector-9.0.0.dev501.jar
    ├── arrow-vector-9.0.0.dev501.pom
    ├── flight-core-9.0.0.dev501.jar
    └── flight-core-9.0.0.dev501.pom
    
  4. Install the artifacts to the local Maven repository with mvn install:install-file:

    $ mvn install:install-file -Dfile="$(pwd)/arrow-java-root-9.0.0.dev501.pom" -DgroupId=org.apache.arrow -DartifactId=arrow-java-root -Dversion=9.0.0.dev501 -Dpackaging=pom
    $ mvn install:install-file -Dfile="$(pwd)/arrow-format-9.0.0.dev501.pom" -DgroupId=org.apache.arrow -DartifactId=arrow-format -Dversion=9.0.0.dev501 -Dpackaging=pom
    $ mvn install:install-file -Dfile="$(pwd)/arrow-format-9.0.0.dev501.jar" -DgroupId=org.apache.arrow -DartifactId=arrow-format -Dversion=9.0.0.dev501 -Dpackaging=jar
    $ mvn install:install-file -Dfile="$(pwd)/arrow-vector-9.0.0.dev501.pom" -DgroupId=org.apache.arrow -DartifactId=arrow-vector -Dversion=9.0.0.dev501 -Dpackaging=pom
    $ mvn install:install-file -Dfile="$(pwd)/arrow-vector-9.0.0.dev501.jar" -DgroupId=org.apache.arrow -DartifactId=arrow-vector -Dversion=9.0.0.dev501 -Dpackaging=jar
    $ mvn install:install-file -Dfile="$(pwd)/arrow-memory-9.0.0.dev501.pom" -DgroupId=org.apache.arrow -DartifactId=arrow-memory -Dversion=9.0.0.dev501 -Dpackaging=pom
    $ mvn install:install-file -Dfile="$(pwd)/arrow-memory-core-9.0.0.dev501.pom" -DgroupId=org.apache.arrow -DartifactId=arrow-memory-core -Dversion=9.0.0.dev501 -Dpackaging=pom
    $ mvn install:install-file -Dfile="$(pwd)/arrow-memory-netty-9.0.0.dev501.pom" -DgroupId=org.apache.arrow -DartifactId=arrow-memory-netty -Dversion=9.0.0.dev501 -Dpackaging=pom
    $ mvn install:install-file -Dfile="$(pwd)/arrow-memory-core-9.0.0.dev501.jar" -DgroupId=org.apache.arrow -DartifactId=arrow-memory-core -Dversion=9.0.0.dev501 -Dpackaging=jar
    $ mvn install:install-file -Dfile="$(pwd)/arrow-memory-netty-9.0.0.dev501.jar" -DgroupId=org.apache.arrow -DartifactId=arrow-memory-netty -Dversion=9.0.0.dev501 -Dpackaging=jar
    $ mvn install:install-file -Dfile="$(pwd)/arrow-flight-9.0.0.dev501.pom" -DgroupId=org.apache.arrow -DartifactId=arrow-flight -Dversion=9.0.0.dev501 -Dpackaging=pom
    $ mvn install:install-file -Dfile="$(pwd)/flight-core-9.0.0.dev501.pom" -DgroupId=org.apache.arrow -DartifactId=flight-core -Dversion=9.0.0.dev501 -Dpackaging=pom
    $ mvn install:install-file -Dfile="$(pwd)/flight-core-9.0.0.dev501.jar" -DgroupId=org.apache.arrow -DartifactId=flight-core -Dversion=9.0.0.dev501 -Dpackaging=jar
    
  5. Validate that the packages were installed:

    $ tree ~/.m2/repository/org/apache/arrow
    .
    ├── arrow-flight
    │   ├── 9.0.0.dev501
    │      └── arrow-flight-9.0.0.dev501.pom
    ├── arrow-format
    │   ├── 9.0.0.dev501
    │      ├── arrow-format-9.0.0.dev501.jar
    │      └── arrow-format-9.0.0.dev501.pom
    ├── arrow-java-root
    │   ├── 9.0.0.dev501
    │      └── arrow-java-root-9.0.0.dev501.pom
    ├── arrow-memory
    │   ├── 9.0.0.dev501
    │      └── arrow-memory-9.0.0.dev501.pom
    ├── arrow-memory-core
    │   ├── 9.0.0.dev501
    │      ├── arrow-memory-core-9.0.0.dev501.jar
    │      └── arrow-memory-core-9.0.0.dev501.pom
    ├── arrow-memory-netty
    │   ├── 9.0.0.dev501
    │      ├── arrow-memory-netty-9.0.0.dev501.jar
    │      └── arrow-memory-netty-9.0.0.dev501.pom
    ├── arrow-vector
    │   ├── 9.0.0.dev501
    │      ├── _remote.repositories
    │      ├── arrow-vector-9.0.0.dev501.jar
    │      └── arrow-vector-9.0.0.dev501.pom
    └── flight-core
       ├── 9.0.0.dev501
          ├── flight-core-9.0.0.dev501.jar
          └── flight-core-9.0.0.dev501.pom
    
  6. Compile your project like usual with mvn clean install.

Installing Staging Packages

Warning

These packages are not official releases. Use them at your own risk.

Arrow staging builds are created when a Release Candidate (RC) is being prepared. This allows users to test the RC in their applications before voting on the release.

Installing from Apache Staging

  1. Look up the next version number for the Arrow libraries used.

  2. Add Apache Staging Repository to the Maven/Gradle project.

    <properties>
       <arrow.version>9.0.0</arrow.version>
    </properties>
    ...
    <repositories>
       <repository>
             <id>arrow-apache-staging</id>
             <url>https://repository.apache.org/content/repositories/staging</url>
       </repository>
    </repositories>
    ...
    <dependencies>
       <dependency>
             <groupId>org.apache.arrow</groupId>
             <artifactId>arrow-vector</artifactId>
             <version>${arrow.version}</version>
       </dependency>
    </dependencies>
    ...