Java Algorithms#
Arrow’s Java library provides algorithms for some commonly-used
functionalities. The algorithms are provided in the org.apache.arrow.algorithm
package of the algorithm
module.
Comparing Vector Elements#
Comparing vector elements is the basic for many algorithms. Vector elements can be compared in one of the two ways:
1. Equality comparison: there are two possible results for this type of comparisons: equal
and unequal
.
Currently, this type of comparison is supported through the org.apache.arrow.vector.compare.VectorValueEqualizer
interface.
2. Ordering comparison: there are three possible results for this type of comparisons: less than
, equal to
and greater than
. This comparison is supported by the abstract class org.apache.arrow.algorithm.sort.VectorValueComparator
.
We provide default implementations to compare vector elements. However, users can also define ways for customized comparisons.
Vector Element Search#
A search algorithm tries to find a particular value in a vector. When successful, a vector index is
returned; otherwise, a -1
is returned. The following search algorithms are provided:
1. Linear search: this algorithm simply traverses the vector from the beginning, until a match is
found, or the end of the vector is reached. So it takes O(n)
time, where n
is the number of elements
in the vector. This algorithm is implemented in org.apache.arrow.algorithm.search.VectorSearcher#linearSearch
.
2. Binary search: this represents a more efficient search algorithm, as it runs in O(log(n))
time.
However, it is only applicable to sorted vectors. To get a sorted vector,
one can use one of our sorting algorithms, which will be discussed in the next section. This algorithm
is implemented in org.apache.arrow.algorithm.search.VectorSearcher#binarySearch
.
3. Parallel search: when the vector is large, it takes a long time to traverse the elements to search
for a value. To make this process faster, one can split the vector into multiple partitions, and perform the
search for each partition in parallel. This is supported by org.apache.arrow.algorithm.search.ParallelSearcher
.
4. Range search: for many scenarios, there can be multiple matching values in the vector.
If the vector is sorted, the matching values reside in a contiguous region in the vector. The
range search algorithm tries to find the upper/lower bound of the region in O(log(n))
time.
An implementation is provided in org.apache.arrow.algorithm.search.VectorRangeSearcher
.
Vector Sorting#
Given a vector, a sorting algorithm turns it into a sorted one. The sorting criteria must be specified by some ordering comparison operation. The sorting algorithms can be classified into the following categories:
1. In-place sorter: an in-place sorter performs the sorting by manipulating the original
vector, without creating any new vector. So it just returns the original vector after the sorting operations.
Currently, we have org.apache.arrow.algorithm.sort.FixedWidthInPlaceVectorSorter
for in-place
sorting in O(nlog(n))
time. As the name suggests, it only supports fixed width vectors.
2. Out-of-place sorter: an out-of-place sorter does not mutate the original vector. Instead,
it copies vector elements to a new vector in sorted order, and returns the new vector.
We have org.apache.arrow.algorithm.sort.FixedWidthInPlaceVectorSorter.FixedWidthOutOfPlaceVectorSorter
and org.apache.arrow.algorithm.sort.FixedWidthInPlaceVectorSorter.VariableWidthOutOfPlaceVectorSorter
for fixed width and variable width vectors, respectively. Both algorithms run in O(nlog(n))
time.
3. Index sorter: this sorter does not actually sort the vector. Instead, it returns an integer
vector, which correspond to indices of vector elements in sorted order. With the index vector, one can
easily construct a sorted vector. In addition, some other tasks can be easily achieved, like finding the k``th
smallest value in the vector. Index sorting is supported by ``org.apache.arrow.algorithm.sort.IndexSorter
,
which runs in O(nlog(n))
time. It is applicable to vectors of any type.
Other Algorithms#
Other algorithms include vector deduplication, dictionary encoding, etc., in the algorithm
module.