Building Arrow Java#

System Setup#

Arrow Java uses the Maven build system.

Building requires:

  • JDK 11+

  • Maven 3+

Note

CI will test all supported JDK LTS versions, plus the latest non-LTS version.

Building#

All the instructions below assume that you have cloned the Arrow git repository:

$ git clone https://github.com/apache/arrow.git
$ cd arrow
$ git submodule update --init --recursive

These are the options available to compile Arrow Java modules with:

  • Maven build tool.

  • Docker Compose.

  • Archery.

Building Java Modules#

To build the default modules, go to the project root and execute:

Maven#

$ cd arrow/java
$ export JAVA_HOME=<absolute path to your java home>
$ java --version
$ mvn clean install

Docker compose#

$ cd arrow/java
$ export JAVA_HOME=<absolute path to your java home>
$ java --version
$ docker compose run java

Archery#

$ cd arrow/java
$ export JAVA_HOME=<absolute path to your java home>
$ java --version
$ archery docker run java

Building JNI Libraries (*.dylib / *.so / *.dll)#

First, we need to build the C++ shared libraries that the JNI bindings will use. We can build these manually or we can use Archery to build them using a Docker container (This will require installing Docker, Docker Compose, and Archery).

Note

If you are building on Apple Silicon, be sure to use a JDK version that was compiled for that architecture. See, for example, the Azul JDK.

If you are building on Windows OS, see Developing on Windows.

Maven#

  • To build only the JNI C Data Interface library (macOS / Linux):

    $ cd arrow/java
    $ export JAVA_HOME=<absolute path to your java home>
    $ java --version
    $ mvn generate-resources -Pgenerate-libs-cdata-all-os -N
    $ ls -latr ../java-dist/lib
    |__ arrow_cdata_jni/
    
  • To build only the JNI C Data Interface library (Windows):

    $ cd arrow/java
    $ mvn generate-resources -Pgenerate-libs-cdata-all-os -N
    $ dir "../java-dist/bin"
    |__ arrow_cdata_jni/
    
  • To build all JNI libraries (macOS / Linux) except the JNI C Data Interface library:

    $ cd arrow/java
    $ export JAVA_HOME=<absolute path to your java home>
    $ java --version
    $ mvn generate-resources -Pgenerate-libs-jni-macos-linux -N
    $ ls -latr java-dist/lib
    |__ arrow_dataset_jni/
    |__ arrow_orc_jni/
    |__ gandiva_jni/
    
  • To build all JNI libraries (Windows) except the JNI C Data Interface library:

    $ cd arrow/java
    $ mvn generate-resources -Pgenerate-libs-jni-windows -N
    $ dir "../java-dist/bin"
    |__ arrow_dataset_jni/
    

CMake#

  • To build only the JNI C Data Interface library (macOS / Linux):

    $ cd arrow
    $ mkdir -p java-dist java-cdata
    $ cmake \
        -S java \
        -B java-cdata \
        -DARROW_JAVA_JNI_ENABLE_C=ON \
        -DARROW_JAVA_JNI_ENABLE_DEFAULT=OFF \
        -DBUILD_TESTING=OFF \
        -DCMAKE_BUILD_TYPE=Release \
        -DCMAKE_INSTALL_PREFIX=java-dist
    $ cmake --build java-cdata --target install --config Release
    $ ls -latr java-dist/lib
    |__ arrow_cdata_jni/
    
  • To build only the JNI C Data Interface library (Windows):

    $ cd arrow
    $ mkdir java-dist, java-cdata
    $ cmake ^
        -S java ^
        -B java-cdata ^
        -DARROW_JAVA_JNI_ENABLE_C=ON ^
        -DARROW_JAVA_JNI_ENABLE_DEFAULT=OFF ^
        -DBUILD_TESTING=OFF ^
        -DCMAKE_BUILD_TYPE=Release ^
        -DCMAKE_INSTALL_PREFIX=java-dist
    $ cmake --build java-cdata --target install --config Release
    $ dir "java-dist/bin"
    |__ arrow_cdata_jni/
    
  • To build all JNI libraries (macOS / Linux) except the JNI C Data Interface library:

    $ cd arrow
    $ brew bundle --file=cpp/Brewfile
    # Homebrew Bundle complete! 25 Brewfile dependencies now installed.
    $ brew uninstall aws-sdk-cpp
    #  (We can't use aws-sdk-cpp installed by Homebrew because it has
    #  an issue: https://github.com/aws/aws-sdk-cpp/issues/1809 )
    $ export JAVA_HOME=<absolute path to your java home>
    $ mkdir -p java-dist cpp-jni
    $ cmake \
        -S cpp \
        -B cpp-jni \
        -DARROW_BUILD_SHARED=OFF \
        -DARROW_CSV=ON \
        -DARROW_DATASET=ON \
        -DARROW_DEPENDENCY_SOURCE=BUNDLED \
        -DARROW_DEPENDENCY_USE_SHARED=OFF \
        -DARROW_FILESYSTEM=ON \
        -DARROW_GANDIVA=ON \
        -DARROW_GANDIVA_STATIC_LIBSTDCPP=ON \
        -DARROW_JSON=ON \
        -DARROW_ORC=ON \
        -DARROW_PARQUET=ON \
        -DARROW_S3=ON \
        -DARROW_SUBSTRAIT=ON \
        -DARROW_USE_CCACHE=ON \
        -DCMAKE_BUILD_TYPE=Release \
        -DCMAKE_INSTALL_PREFIX=java-dist \
        -DCMAKE_UNITY_BUILD=ON
    $ cmake --build cpp-jni --target install --config Release
    $ cmake \
        -S java \
        -B java-jni \
        -DARROW_JAVA_JNI_ENABLE_C=OFF \
        -DARROW_JAVA_JNI_ENABLE_DEFAULT=ON \
        -DBUILD_TESTING=OFF \
        -DCMAKE_BUILD_TYPE=Release \
        -DCMAKE_INSTALL_PREFIX=java-dist \
        -DCMAKE_PREFIX_PATH=$PWD/java-dist \
        -DProtobuf_ROOT=$PWD/../cpp-jni/protobuf_ep-install \
        -DProtobuf_USE_STATIC_LIBS=ON
    $ cmake --build java-jni --target install --config Release
    $ ls -latr java-dist/lib/
    |__ arrow_dataset_jni/
    |__ arrow_orc_jni/
    |__ gandiva_jni/
    
  • To build all JNI libraries (Windows) except the JNI C Data Interface library:

    $ cd arrow
    $ mkdir java-dist, cpp-jni
    $ cmake ^
        -S cpp ^
        -B cpp-jni ^
        -DARROW_BUILD_SHARED=OFF ^
        -DARROW_CSV=ON ^
        -DARROW_DATASET=ON ^
        -DARROW_DEPENDENCY_USE_SHARED=OFF ^
        -DARROW_FILESYSTEM=ON ^
        -DARROW_GANDIVA=OFF ^
        -DARROW_JSON=ON ^
        -DARROW_ORC=ON ^
        -DARROW_PARQUET=ON ^
        -DARROW_S3=ON ^
        -DARROW_SUBSTRAIT=ON ^
        -DARROW_USE_CCACHE=ON ^
        -DARROW_WITH_BROTLI=ON ^
        -DARROW_WITH_LZ4=ON ^
        -DARROW_WITH_SNAPPY=ON ^
        -DARROW_WITH_ZLIB=ON ^
        -DARROW_WITH_ZSTD=ON ^
        -DCMAKE_BUILD_TYPE=Release ^
        -DCMAKE_INSTALL_PREFIX=java-dist ^
        -DCMAKE_UNITY_BUILD=ON ^
        -GNinja
    $ cd cpp-jni
    $ ninja install
    $ cd ../
    $ cmake ^
        -S java ^
        -B java-jni ^
        -DARROW_JAVA_JNI_ENABLE_C=OFF ^
        -DARROW_JAVA_JNI_ENABLE_DATASET=ON ^
        -DARROW_JAVA_JNI_ENABLE_DEFAULT=ON ^
        -DARROW_JAVA_JNI_ENABLE_GANDIVA=OFF ^
        -DARROW_JAVA_JNI_ENABLE_ORC=ON ^
        -DBUILD_TESTING=OFF ^
        -DCMAKE_BUILD_TYPE=Release ^
        -DCMAKE_INSTALL_PREFIX=java-dist ^
        -DCMAKE_PREFIX_PATH=$PWD/java-dist
    $ cmake --build java-jni --target install --config Release
    $ dir "java-dist/bin"
    |__ arrow_orc_jni/
    |__ arrow_dataset_jni/
    

Archery#

$ cd arrow
$ archery docker run java-jni-manylinux-2014
$ ls -latr java-dist
|__ arrow_cdata_jni/
|__ arrow_dataset_jni/
|__ arrow_orc_jni/
|__ gandiva_jni/

Building Java JNI Modules#

  • To compile the JNI bindings, use the arrow-c-data Maven profile:

    $ cd arrow/java
    $ mvn -Darrow.c.jni.dist.dir=<absolute path to your arrow folder>/java-dist/lib -Parrow-c-data clean install
    
  • To compile the JNI bindings for ORC / Gandiva / Dataset, use the arrow-jni Maven profile:

    $ cd arrow/java
    $ mvn \
        -Darrow.cpp.build.dir=<absolute path to your arrow folder>/java-dist/lib/ \
        -Darrow.c.jni.dist.dir=<absolute path to your arrow folder>/java-dist/lib/ \
        -Parrow-jni clean install
    

Testing#

By default, Maven uses the same Java version to both build the code and run the tests.

It is also possible to use a different JDK version for the tests. This requires Maven toolchains to be configured beforehand, and then a specific test property needs to be set.

Configuring Maven toolchains#

To be able to use a JDK version for testing, it needs to be registered first in Maven toolchains.xml configuration file usually located under ${HOME}/.m2 with the following snippet added to it:

<?xml version="1.0" encoding="UTF8"?>
<toolchains>

  [...]

  <toolchain>
    <type>jdk</type>
    <provides>
      <version>21</version> <!-- Replace with the corresponding JDK version: 11, 17, ... -->
      <vendor>temurin</vendor> <!-- Replace with the vendor/distribution: temurin, oracle, zulu ... -->
    </provides>
    <configuration>
      <jdkHome>path/to/jdk/home</jdkHome> <!-- Replace with the path to the JDK -->
    </configuration>
  </toolchain>

  [...]

</toolchains>

Testing with a specific JDK#

To run Arrow tests with a specific JDK version, use the arrow.test.jdk-version property.

For example, to run Arrow tests with JDK 17, use the following snippet:

$ cd arrow/java
$ mvn -Darrow.test.jdk-version=17 clean verify

IDE Configuration#

IntelliJ#

To start working on Arrow in IntelliJ: build the project once from the command line using mvn clean install. Then open the java/ subdirectory of the Arrow repository, and update the following settings:

  • In the Files tool window, find the path vector/target/generated-sources, right click the directory, and select Mark Directory as > Generated Sources Root. There is no need to mark other generated sources directories, as only the vector module generates sources.

  • For JDK 11, due to an IntelliJ bug, you must go into Settings > Build, Execution, Deployment > Compiler > Java Compiler and disable β€œUse β€˜β€“release’ option for cross-compilation (Java 9 and later)”. Otherwise you will get an error like β€œpackage sun.misc does not exist”.

  • You may want to disable error-prone entirely if it gives spurious warnings (disable both error-prone profiles in the Maven tool window and β€œReload All Maven Projects”).

  • If using IntelliJ’s Maven integration to build, you may need to change <fork> to false in the pom.xml files due to an IntelliJ bug.

  • To enable debugging JNI-based modules like dataset, activate specific profiles in the Maven tab under β€œProfiles”. Ensure the profiles arrow-c-data, arrow-jni, generate-libs-cdata-all-os, generate-libs-jni-macos-linux, and jdk11+ are enabled, so that the IDE can build them and enable debugging.

You may not need to update all of these settings if you build/test with the IntelliJ Maven integration instead of with IntelliJ directly.

Common Errors#

  • When working with the JNI code: if the C++ build cannot find dependencies, with errors like these:

    Could NOT find Boost (missing: Boost_INCLUDE_DIR system filesystem)
    Could NOT find Lz4 (missing: LZ4_LIB)
    Could NOT find zstd (missing: ZSTD_LIB)
    

    Specify that the dependencies should be downloaded at build time (more details at Dependency Resolution):

    -Dre2_SOURCE=BUNDLED \
    -DBoost_SOURCE=BUNDLED \
    -Dutf8proc_SOURCE=BUNDLED \
    -DSnappy_SOURCE=BUNDLED \
    -DORC_SOURCE=BUNDLED \
    -DZLIB_SOURCE=BUNDLED
    

Installing Nightly Packages#

Warning

These packages are not official releases. Use them at your own risk.

Arrow nightly builds are posted on the mailing list at builds@arrow.apache.org. The artifacts are uploaded to GitHub. For example, for 2022/07/30, they can be found at GitHub Nightly.

Installing from Apache Nightlies#

  1. Look up the nightly version number for the Arrow libraries used.

    For example, for arrow-memory, visit https://nightlies.apache.org/arrow/java/org/apache/arrow/arrow-memory/ and see what versions are available (e.g. 9.0.0.dev501).

  2. Add Apache Nightlies Repository to the Maven/Gradle project.

    <properties>
       <arrow.version>9.0.0.dev501</arrow.version>
    </properties>
    ...
    <repositories>
       <repository>
             <id>arrow-apache-nightlies</id>
             <url>https://nightlies.apache.org/arrow/java</url>
       </repository>
    </repositories>
    ...
    <dependencies>
       <dependency>
             <groupId>org.apache.arrow</groupId>
             <artifactId>arrow-vector</artifactId>
             <version>${arrow.version}</version>
       </dependency>
    </dependencies>
    ...
    

Installing Manually#

  1. Decide nightly packages repository to use, for example: ursacomputing/crossbow

  2. Add packages to your pom.xml, for example: flight-core (it depends on: arrow-format, arrow-vector, arrow-memory-core and arrow-memory-netty).

    <properties>
       <maven.compiler.source>8</maven.compiler.source>
       <maven.compiler.target>8</maven.compiler.target>
       <arrow.version>9.0.0.dev501</arrow.version>
    </properties>
    
    <dependencies>
       <dependency>
             <groupId>org.apache.arrow</groupId>
             <artifactId>flight-core</artifactId>
             <version>${arrow.version}</version>
       </dependency>
    </dependencies>
    
  3. Download the necessary pom and jar files to a temporary directory:

    $ mkdir nightly-packaging-2022-07-30-0-github-java-jars
    $ cd nightly-packaging-2022-07-30-0-github-java-jars
    $ wget https://github.com/ursacomputing/crossbow/releases/download/nightly-packaging-2022-07-30-0-github-java-jars/arrow-java-root-9.0.0.dev501.pom
    $ wget https://github.com/ursacomputing/crossbow/releases/download/nightly-packaging-2022-07-30-0-github-java-jars/arrow-format-9.0.0.dev501.pom
    $ wget https://github.com/ursacomputing/crossbow/releases/download/nightly-packaging-2022-07-30-0-github-java-jars/arrow-format-9.0.0.dev501.jar
    $ wget https://github.com/ursacomputing/crossbow/releases/download/nightly-packaging-2022-07-30-0-github-java-jars/arrow-vector-9.0.0.dev501.pom
    $ wget https://github.com/ursacomputing/crossbow/releases/download/nightly-packaging-2022-07-30-0-github-java-jars/arrow-vector-9.0.0.dev501.jar
    $ wget https://github.com/ursacomputing/crossbow/releases/download/nightly-packaging-2022-07-30-0-github-java-jars/arrow-memory-9.0.0.dev501.pom
    $ wget https://github.com/ursacomputing/crossbow/releases/download/nightly-packaging-2022-07-30-0-github-java-jars/arrow-memory-core-9.0.0.dev501.pom
    $ wget https://github.com/ursacomputing/crossbow/releases/download/nightly-packaging-2022-07-30-0-github-java-jars/arrow-memory-netty-9.0.0.dev501.pom
    $ wget https://github.com/ursacomputing/crossbow/releases/download/nightly-packaging-2022-07-30-0-github-java-jars/arrow-memory-core-9.0.0.dev501.jar
    $ wget https://github.com/ursacomputing/crossbow/releases/download/nightly-packaging-2022-07-30-0-github-java-jars/arrow-memory-netty-9.0.0.dev501.jar
    $ wget https://github.com/ursacomputing/crossbow/releases/download/nightly-packaging-2022-07-30-0-github-java-jars/arrow-flight-9.0.0.dev501.pom
    $ wget https://github.com/ursacomputing/crossbow/releases/download/nightly-packaging-2022-07-30-0-github-java-jars/flight-core-9.0.0.dev501.pom
    $ wget https://github.com/ursacomputing/crossbow/releases/download/nightly-packaging-2022-07-30-0-github-java-jars/flight-core-9.0.0.dev501.jar
    $ tree
    .
    β”œβ”€β”€ arrow-flight-9.0.0.dev501.pom
    β”œβ”€β”€ arrow-format-9.0.0.dev501.jar
    β”œβ”€β”€ arrow-format-9.0.0.dev501.pom
    β”œβ”€β”€ arrow-java-root-9.0.0.dev501.pom
    β”œβ”€β”€ arrow-memory-9.0.0.dev501.pom
    β”œβ”€β”€ arrow-memory-core-9.0.0.dev501.jar
    β”œβ”€β”€ arrow-memory-core-9.0.0.dev501.pom
    β”œβ”€β”€ arrow-memory-netty-9.0.0.dev501.jar
    β”œβ”€β”€ arrow-memory-netty-9.0.0.dev501.pom
    β”œβ”€β”€ arrow-vector-9.0.0.dev501.jar
    β”œβ”€β”€ arrow-vector-9.0.0.dev501.pom
    β”œβ”€β”€ flight-core-9.0.0.dev501.jar
    └── flight-core-9.0.0.dev501.pom
    
  4. Install the artifacts to the local Maven repository with mvn install:install-file:

    $ mvn install:install-file -Dfile="$(pwd)/arrow-java-root-9.0.0.dev501.pom" -DgroupId=org.apache.arrow -DartifactId=arrow-java-root -Dversion=9.0.0.dev501 -Dpackaging=pom
    $ mvn install:install-file -Dfile="$(pwd)/arrow-format-9.0.0.dev501.pom" -DgroupId=org.apache.arrow -DartifactId=arrow-format -Dversion=9.0.0.dev501 -Dpackaging=pom
    $ mvn install:install-file -Dfile="$(pwd)/arrow-format-9.0.0.dev501.jar" -DgroupId=org.apache.arrow -DartifactId=arrow-format -Dversion=9.0.0.dev501 -Dpackaging=jar
    $ mvn install:install-file -Dfile="$(pwd)/arrow-vector-9.0.0.dev501.pom" -DgroupId=org.apache.arrow -DartifactId=arrow-vector -Dversion=9.0.0.dev501 -Dpackaging=pom
    $ mvn install:install-file -Dfile="$(pwd)/arrow-vector-9.0.0.dev501.jar" -DgroupId=org.apache.arrow -DartifactId=arrow-vector -Dversion=9.0.0.dev501 -Dpackaging=jar
    $ mvn install:install-file -Dfile="$(pwd)/arrow-memory-9.0.0.dev501.pom" -DgroupId=org.apache.arrow -DartifactId=arrow-memory -Dversion=9.0.0.dev501 -Dpackaging=pom
    $ mvn install:install-file -Dfile="$(pwd)/arrow-memory-core-9.0.0.dev501.pom" -DgroupId=org.apache.arrow -DartifactId=arrow-memory-core -Dversion=9.0.0.dev501 -Dpackaging=pom
    $ mvn install:install-file -Dfile="$(pwd)/arrow-memory-netty-9.0.0.dev501.pom" -DgroupId=org.apache.arrow -DartifactId=arrow-memory-netty -Dversion=9.0.0.dev501 -Dpackaging=pom
    $ mvn install:install-file -Dfile="$(pwd)/arrow-memory-core-9.0.0.dev501.jar" -DgroupId=org.apache.arrow -DartifactId=arrow-memory-core -Dversion=9.0.0.dev501 -Dpackaging=jar
    $ mvn install:install-file -Dfile="$(pwd)/arrow-memory-netty-9.0.0.dev501.jar" -DgroupId=org.apache.arrow -DartifactId=arrow-memory-netty -Dversion=9.0.0.dev501 -Dpackaging=jar
    $ mvn install:install-file -Dfile="$(pwd)/arrow-flight-9.0.0.dev501.pom" -DgroupId=org.apache.arrow -DartifactId=arrow-flight -Dversion=9.0.0.dev501 -Dpackaging=pom
    $ mvn install:install-file -Dfile="$(pwd)/flight-core-9.0.0.dev501.pom" -DgroupId=org.apache.arrow -DartifactId=flight-core -Dversion=9.0.0.dev501 -Dpackaging=pom
    $ mvn install:install-file -Dfile="$(pwd)/flight-core-9.0.0.dev501.jar" -DgroupId=org.apache.arrow -DartifactId=flight-core -Dversion=9.0.0.dev501 -Dpackaging=jar
    
  5. Validate that the packages were installed:

    $ tree ~/.m2/repository/org/apache/arrow
    .
    β”œβ”€β”€ arrow-flight
    β”‚Β Β  β”œβ”€β”€ 9.0.0.dev501
    β”‚Β Β  β”‚Β Β  └── arrow-flight-9.0.0.dev501.pom
    β”œβ”€β”€ arrow-format
    β”‚Β Β  β”œβ”€β”€ 9.0.0.dev501
    β”‚Β Β  β”‚Β Β  β”œβ”€β”€ arrow-format-9.0.0.dev501.jar
    β”‚Β Β  β”‚Β Β  └── arrow-format-9.0.0.dev501.pom
    β”œβ”€β”€ arrow-java-root
    β”‚Β Β  β”œβ”€β”€ 9.0.0.dev501
    β”‚Β Β  β”‚Β Β  └── arrow-java-root-9.0.0.dev501.pom
    β”œβ”€β”€ arrow-memory
    β”‚Β Β  β”œβ”€β”€ 9.0.0.dev501
    β”‚Β Β  β”‚Β Β  └── arrow-memory-9.0.0.dev501.pom
    β”œβ”€β”€ arrow-memory-core
    β”‚Β Β  β”œβ”€β”€ 9.0.0.dev501
    β”‚Β Β  β”‚Β Β  β”œβ”€β”€ arrow-memory-core-9.0.0.dev501.jar
    β”‚Β Β  β”‚Β Β  └── arrow-memory-core-9.0.0.dev501.pom
    β”œβ”€β”€ arrow-memory-netty
    β”‚Β Β  β”œβ”€β”€ 9.0.0.dev501
    β”‚Β Β  β”‚Β Β  β”œβ”€β”€ arrow-memory-netty-9.0.0.dev501.jar
    β”‚Β Β  β”‚Β Β  └── arrow-memory-netty-9.0.0.dev501.pom
    β”œβ”€β”€ arrow-vector
    β”‚Β Β  β”œβ”€β”€ 9.0.0.dev501
    β”‚Β Β  β”‚Β Β  β”œβ”€β”€ _remote.repositories
    β”‚Β Β  β”‚Β Β  β”œβ”€β”€ arrow-vector-9.0.0.dev501.jar
    β”‚Β Β  β”‚Β Β  └── arrow-vector-9.0.0.dev501.pom
    └── flight-core
       β”œβ”€β”€ 9.0.0.dev501
       β”‚Β Β  β”œβ”€β”€ flight-core-9.0.0.dev501.jar
       β”‚Β Β  └── flight-core-9.0.0.dev501.pom
    
  6. Compile your project like usual with mvn clean install.

Installing Staging Packages#

Warning

These packages are not official releases. Use them at your own risk.

Arrow staging builds are created when a Release Candidate (RC) is being prepared. This allows users to test the RC in their applications before voting on the release.

Installing from Apache Staging#

  1. Look up the next version number for the Arrow libraries used.

  2. Add Apache Staging Repository to the Maven/Gradle project.

    <properties>
       <arrow.version>9.0.0</arrow.version>
    </properties>
    ...
    <repositories>
       <repository>
             <id>arrow-apache-staging</id>
             <url>https://repository.apache.org/content/repositories/staging</url>
       </repository>
    </repositories>
    ...
    <dependencies>
       <dependency>
             <groupId>org.apache.arrow</groupId>
             <artifactId>arrow-vector</artifactId>
             <version>${arrow.version}</version>
       </dependency>
    </dependencies>
    ...