Java Algorithms

Arrow’s Java library provides algorithms for some commonly-used functionalities. The algorithms are provided in the org.apache.arrow.algorithm package of the algorithm module.

Comparing Vector Elements

Comparing vector elements is the basic for many algorithms. Vector elements can be compared in one of the two ways:

1. Equality comparison: there are two possible results for this type of comparisons: equal and unequal. Currently, this type of comparison is supported through the org.apache.arrow.vector.compare.VectorValueEqualizer interface.

2. Ordering comparison: there are three possible results for this type of comparisons: less than, equal to and greater than. This comparison is supported by the abstract class org.apache.arrow.algorithm.sort.VectorValueComparator.

We provide default implementations to compare vector elements. However, users can also define ways for customized comparisons.

Vector Sorting

Given a vector, a sorting algorithm turns it into a sorted one. The sorting criteria must be specified by some ordering comparison operation. The sorting algorithms can be classified into the following categories:

1. In-place sorter: an in-place sorter performs the sorting by manipulating the original vector, without creating any new vector. So it just returns the original vector after the sorting operations. Currently, we have org.apache.arrow.algorithm.sort.FixedWidthInPlaceVectorSorter for in-place sorting in O(nlog(n)) time. As the name suggests, it only supports fixed width vectors.

2. Out-of-place sorter: an out-of-place sorter does not mutate the original vector. Instead, it copies vector elements to a new vector in sorted order, and returns the new vector. We have org.apache.arrow.algorithm.sort.FixedWidthInPlaceVectorSorter.FixedWidthOutOfPlaceVectorSorter and org.apache.arrow.algorithm.sort.FixedWidthInPlaceVectorSorter.VariableWidthOutOfPlaceVectorSorter for fixed width and variable width vectors, respectively. Both algorithms run in O(nlog(n)) time.

3. Index sorter: this sorter does not actually sort the vector. Instead, it returns an integer vector, which correspond to indices of vector elements in sorted order. With the index vector, one can easily construct a sorted vector. In addition, some other tasks can be easily achieved, like finding the k``th smallest value in the vector. Index sorting is supported by ``org.apache.arrow.algorithm.sort.IndexSorter, which runs in O(nlog(n)) time. It is applicable to vectors of any type.

Other Algorithms

Other algorithms include vector deduplication, dictionary encoding, etc., in the algorithm module.