Arrays

class arrow::Array

Array base type Immutable data array with some logical type and some length.

Any memory is owned by the respective Buffer instance (or its parents).

The base class is only required to have a null bitmap buffer if the null count is greater than 0

If known, the null count can be provided in the base Array constructor. If the null count is not known, pass -1 to indicate that the null count is to be computed on the first call to null_count()

Subclassed by arrow::BaseListArray< TYPE >, arrow::DictionaryArray, arrow::ExtensionArray, arrow::FixedSizeListArray, arrow::FlatArray, arrow::StructArray, arrow::UnionArray, arrow::BaseListArray< LargeListType >, arrow::BaseListArray< ListType >

Public Functions

bool IsNull(int64_t i) const

Return true if value at index is null. Does not boundscheck.

bool IsValid(int64_t i) const

Return true if value at index is valid (not null).

Does not boundscheck

Result<std::shared_ptr<Scalar>> GetScalar(int64_t i) const

Return a Scalar containing the value of this array at i.

int64_t length() const

Size in the number of elements this array contains.

int64_t offset() const

A relative position into another array’s data, to enable zero-copy slicing.

This value defaults to zero

int64_t null_count() const

The number of null entries in the array.

If the null count was not known at time of construction (and set to a negative value), then the null count will be computed and cached on the first invocation of this function

std::shared_ptr<Buffer> null_bitmap() const

Buffer for the validity (null) bitmap, if any.

Note that Union types never have a null bitmap.

Note that for null_count == 0 or for null type, this will be null. This buffer does not account for any slice offset

const uint8_t *null_bitmap_data() const

Raw pointer to the null bitmap.

Note that for null_count == 0 or for null type, this will be null. This buffer does not account for any slice offset

bool Equals(const Array &arr, const EqualOptions& = EqualOptions::Defaults()) const

Equality comparison with another array.

std::string Diff(const Array &other) const

Return the formatted unified diff of arrow::Diff between this Array and another Array.

bool ApproxEquals(const std::shared_ptr<Array> &arr, const EqualOptions& = EqualOptions::Defaults()) const

Approximate equality comparison with another array.

epsilon is only used if this is FloatArray or DoubleArray

bool RangeEquals(int64_t start_idx, int64_t end_idx, int64_t other_start_idx, const Array &other, const EqualOptions& = EqualOptions::Defaults()) const

Compare if the range of slots specified are equal for the given array and this array.

end_idx exclusive. This methods does not bounds check.

Result<std::shared_ptr<Array>> View(const std::shared_ptr<DataType> &type) const

Construct a zero-copy view of this array with the given type.

This method checks if the types are layout-compatible. Nested types are traversed in depth-first order. Data buffers must have the same item sizes, even though the logical types may be different. An error is returned if the types are not layout-compatible.

std::shared_ptr<Array> Slice(int64_t offset, int64_t length) const

Construct a zero-copy slice of the array with the indicated offset and length.

Return

a new object wrapped in std::shared_ptr<Array>

Parameters
  • [in] offset: the position of the first element in the constructed slice

  • [in] length: the length of the slice. If there are not enough elements in the array, the length will be adjusted accordingly

std::shared_ptr<Array> Slice(int64_t offset) const

Slice from offset until end of the array.

Result<std::shared_ptr<Array>> SliceSafe(int64_t offset, int64_t length) const

Input-checking variant of Array::Slice.

Result<std::shared_ptr<Array>> SliceSafe(int64_t offset) const

Input-checking variant of Array::Slice.

std::string ToString() const

Return

PrettyPrint representation of array suitable for debugging

Status Validate() const

Perform cheap validation checks to determine obvious inconsistencies within the array’s internal data.

This is O(k) where k is the number of descendents.

Return

Status

Status ValidateFull() const

Perform extensive validation checks to determine inconsistencies within the array’s internal data.

This is potentially O(k*n) where k is the number of descendents and n is the array length.

Return

Status

Concrete array subclasses

class arrow::DictionaryArray : public arrow::Array

Array type for dictionary-encoded data with a data-dependent dictionary.

A dictionary array contains an array of non-negative integers (the “dictionary indices”) along with a data type containing a “dictionary” corresponding to the distinct values represented in the data.

For example, the array

[“foo”, “bar”, “foo”, “bar”, “foo”, “bar”]

with dictionary [“bar”, “foo”], would have dictionary array representation

indices: [1, 0, 1, 0, 1, 0] dictionary: [“bar”, “foo”]

The indices in principle may be any integer type.

Public Functions

Result<std::shared_ptr<Array>> Transpose(const std::shared_ptr<DataType> &type, const std::shared_ptr<Array> &dictionary, const int32_t *transpose_map, MemoryPool *pool = default_memory_pool()) const

Transpose this DictionaryArray.

This method constructs a new dictionary array with the given dictionary type, transposing indices using the transpose map. The type and the transpose map are typically computed using DictionaryUnifier.

Parameters
  • [in] type: the new type object

  • [in] dictionary: the new dictionary

  • [in] transpose_map: transposition array of this array’s indices into the target array’s indices

  • [in] pool: a pool to allocate the array data from

bool CanCompareIndices(const DictionaryArray &other) const

Determine whether dictionary arrays may be compared without unification.

std::shared_ptr<Array> dictionary() const

Return the dictionary for this array, which is stored as a member of the ArrayData internal structure.

int64_t GetValueIndex(int64_t i) const

Return the ith value of indices, cast to int64_t.

Not recommended for use in performance-sensitive code. Does not validate whether the value is null or out-of-bounds.

Public Static Functions

Result<std::shared_ptr<Array>> FromArrays(const std::shared_ptr<DataType> &type, const std::shared_ptr<Array> &indices, const std::shared_ptr<Array> &dictionary)

Construct DictionaryArray from dictionary and indices array and validate.

This function does the validation of the indices and input type. It checks if all indices are non-negative and smaller than the size of the dictionary.

Parameters
  • [in] type: a dictionary type

  • [in] dictionary: the dictionary with same value type as the type object

  • [in] indices: an array of non-negative integers smaller than the size of the dictionary

Non-nested

class FlatArray : public arrow::Array

Base class for non-nested arrays.

Subclassed by arrow::BaseBinaryArray< TYPE >, arrow::NullArray, arrow::PrimitiveArray, arrow::BaseBinaryArray< BinaryType >, arrow::BaseBinaryArray< LargeBinaryType >

class NullArray : public arrow::FlatArray

Degenerate null type Array.

class BinaryArray : public arrow::BaseBinaryArray<BinaryType>

Concrete Array class for variable-size binary data.

Subclassed by arrow::StringArray

class arrow::StringArray : public arrow::BinaryArray

Concrete Array class for variable-size string (utf-8) data.

Public Functions

Status ValidateUTF8() const

Validate that this array contains only valid UTF8 entries.

This check is also implied by ValidateFull()

class arrow::PrimitiveArray : public arrow::FlatArray

Base class for arrays of fixed-size logical types.

Subclassed by arrow::BooleanArray, arrow::DayTimeIntervalArray, arrow::FixedSizeBinaryArray, arrow::NumericArray< TYPE >

Public Functions

std::shared_ptr<Buffer> values() const

Does not account for any slice offset.

class arrow::BooleanArray : public arrow::PrimitiveArray

Concrete Array class for boolean data.

Public Functions

int64_t false_count() const

Return the number of false (0) values among the valid values.

Result is not cached.

int64_t true_count() const

Return the number of true (1) values among the valid values.

Result is not cached.

class FixedSizeBinaryArray : public arrow::PrimitiveArray

Concrete Array class for fixed-size binary data.

Subclassed by arrow::Decimal128Array, arrow::Decimal256Array

class arrow::Decimal128Array : public arrow::FixedSizeBinaryArray

Concrete Array class for 128-bit decimal data.

Public Functions

Decimal128Array(const std::shared_ptr<ArrayData> &data)

Construct Decimal128Array from ArrayData instance.

template<typename TYPE>
class NumericArray : public arrow::PrimitiveArray

Concrete Array class for numeric data.

Nested

class arrow::UnionArray : public arrow::Array

Base class for SparseUnionArray and DenseUnionArray.

Subclassed by arrow::DenseUnionArray, arrow::SparseUnionArray

Public Functions

std::shared_ptr<Buffer> type_codes() const

Note that this buffer does not account for any slice offset.

int child_id(int64_t i) const

The physical child id containing value at index.

std::shared_ptr<Array> field(int pos) const

Return the given field as an individual array.

For sparse unions, the returned array has its offset, length and null count adjusted.

class arrow::ListArray : public arrow::BaseListArray<ListType>

Concrete Array class for list data.

Subclassed by arrow::MapArray

Public Functions

Result<std::shared_ptr<Array>> Flatten(MemoryPool *memory_pool = default_memory_pool()) const

Return an Array that is a concatenation of the lists in this array.

Note that it’s different from values() in that it takes into consideration of this array’s offsets as well as null elements backed by non-empty lists (they are skipped, thus copying may be needed).

std::shared_ptr<Array> offsets() const

Return list offsets as an Int32Array.

Public Static Functions

Result<std::shared_ptr<ListArray>> FromArrays(const Array &offsets, const Array &values, MemoryPool *pool = default_memory_pool())

Construct ListArray from array of offsets and child value array.

This function does the bare minimum of validation of the offsets and input types, and will allocate a new offsets array if necessary (i.e. if the offsets contain any nulls). If the offsets do not have nulls, they are assumed to be well-formed

Parameters
  • [in] offsets: Array containing n + 1 offsets encoding length and size. Must be of int32 type

  • [in] values: Array containing list values

  • [in] pool: MemoryPool in case new offsets array needs to be allocated because of null values

class arrow::StructArray : public arrow::Array

Concrete Array class for struct data.

Public Functions

std::shared_ptr<Array> GetFieldByName(const std::string &name) const

Returns null if name not found.

Result<ArrayVector> Flatten(MemoryPool *pool = default_memory_pool()) const

Flatten this array as a vector of arrays, one for each field.

Parameters
  • [in] pool: The pool to allocate null bitmaps from, if necessary

Public Static Functions

Result<std::shared_ptr<StructArray>> Make(const ArrayVector &children, const std::vector<std::string> &field_names, std::shared_ptr<Buffer> null_bitmap = NULLPTR, int64_t null_count = kUnknownNullCount, int64_t offset = 0)

Return a StructArray from child arrays and field names.

The length and data type are automatically inferred from the arguments. There should be at least one child array.

Result<std::shared_ptr<StructArray>> Make(const ArrayVector &children, const FieldVector &fields, std::shared_ptr<Buffer> null_bitmap = NULLPTR, int64_t null_count = kUnknownNullCount, int64_t offset = 0)

Return a StructArray from child arrays and fields.

The length is automatically inferred from the arguments. There should be at least one child array. This method does not check that field types and child array types are consistent.

Chunked Arrays

class arrow::ChunkedArray

A data structure managing a list of primitive Arrow arrays logically as one large array.

Data chunking is treated throughout this project largely as an implementation detail for performance and memory use optimization. ChunkedArray allows Array objects to be collected and interpreted as a single logical array without requiring an expensive concatenation step.

In some cases, data produced by a function may exceed the capacity of an Array (like BinaryArray or StringArray) and so returning multiple Arrays is the only possibility. In these cases, we recommend returning a ChunkedArray instead of vector of Arrays or some alternative.

When data is processed in parallel, it may not be practical or possible to create large contiguous memory allocations and write output into them. With some data types, like binary and string types, it is not possible at all to produce non-chunked array outputs without requiring a concatenation step at the end of processing.

Application developers may tune chunk sizes based on analysis of performance profiles but many developer-users will not need to be especially concerned with the chunking details.

Preserving the chunk layout/sizes in processing steps is generally not considered to be a contract in APIs. A function may decide to alter the chunking of its result. Similarly, APIs accepting multiple ChunkedArray inputs should not expect the chunk layout to be the same in each input.

Public Functions

ChunkedArray(ArrayVector chunks)

Construct a chunked array from a vector of arrays.

The vector must be non-empty and all its elements must have the same data type.

ChunkedArray(std::shared_ptr<Array> chunk)

Construct a chunked array from a single Array.

ChunkedArray(ArrayVector chunks, std::shared_ptr<DataType> type)

Construct a chunked array from a vector of arrays and a data type.

As the data type is passed explicitly, the vector may be empty.

int64_t length() const

Return

the total length of the chunked array; computed on construction

int64_t null_count() const

Return

the total number of nulls among all chunks

std::shared_ptr<Array> chunk(int i) const

Return

chunk a particular chunk from the chunked array

std::shared_ptr<ChunkedArray> Slice(int64_t offset, int64_t length) const

Construct a zero-copy slice of the chunked array with the indicated offset and length.

Return

a new object wrapped in std::shared_ptr<ChunkedArray>

Parameters
  • [in] offset: the position of the first element in the constructed slice

  • [in] length: the length of the slice. If there are not enough elements in the chunked array, the length will be adjusted accordingly

std::shared_ptr<ChunkedArray> Slice(int64_t offset) const

Slice from offset until end of the chunked array.

Result<std::vector<std::shared_ptr<ChunkedArray>>> Flatten(MemoryPool *pool = default_memory_pool()) const

Flatten this chunked array as a vector of chunked arrays, one for each struct field.

Parameters
  • [in] pool: The pool for buffer allocations, if any

Result<std::shared_ptr<ChunkedArray>> View(const std::shared_ptr<DataType> &type) const

Construct a zero-copy view of this chunked array with the given type.

Calls Array::View on each constituent chunk. Always succeeds if there are zero chunks

bool Equals(const ChunkedArray &other) const

Determine if two chunked arrays are equal.

Two chunked arrays can be equal only if they have equal datatypes. However, they may be equal even if they have different chunkings.

bool Equals(const std::shared_ptr<ChunkedArray> &other) const

Determine if two chunked arrays are equal.

std::string ToString() const

Return

PrettyPrint representation suitable for debugging

Status Validate() const

Perform cheap validation checks to determine obvious inconsistencies within the chunk array’s internal data.

This is O(k*m) where k is the number of array descendents, and m is the number of chunks.

Return

Status

Status ValidateFull() const

Perform extensive validation checks to determine inconsistencies within the chunk array’s internal data.

This is O(k*n) where k is the number of array descendents, and n is the length in elements.

Return

Status