arrow::array::array

Struct GenericByteViewArray

pub struct GenericByteViewArray<T>where
    T: ByteViewType + ?Sized,{
    data_type: DataType,
    views: ScalarBuffer<u128>,
    buffers: Vec<Buffer>,
    phantom: PhantomData<T>,
    nulls: Option<NullBuffer>,
}

Expand description

Variable-size Binary View Layout: An array of variable length bytes views.

This array type is used to store variable length byte data (e.g. Strings, Binary) and has efficient operations such as take, filter, and comparison.

This is different from GenericByteArray, which also stores variable length byte data, as it represents strings with an offset and length. take and filter like operations are implemented by manipulating the “views” (u128) without modifying the bytes. Each view also stores an inlined prefix which speed up comparisons.

§See Also

StringViewArray for storing utf8 encoded string data
BinaryViewArray for storing bytes
ByteView to interpret u128s layout of the views.

§Layout: “views” and buffers

A GenericByteViewArray stores variable length byte strings. An array of N elements is stored as N fixed length “views” and a variable number of variable length “buffers”.

Each view is a u128 value whose layout is different depending on the length of the string stored at that location:

                        ┌──────┬────────────────────────┐
                        │length│      string value      │
   Strings (len <= 12)  │      │    (padded with 0)     │
                        └──────┴────────────────────────┘
                         0    31                      127

                        ┌───────┬───────┬───────┬───────┐
                        │length │prefix │  buf  │offset │
   Strings (len > 12)   │       │       │ index │       │
                        └───────┴───────┴───────┴───────┘
                         0    31       63      95    127

Strings with length <= 12 (MAX_INLINE_VIEW_LEN) are stored directly in the view. See Self::inline_value to access the inlined prefix from a short view.
Strings with length > 12: The first four bytes are stored inline in the view and the entire string is stored in one of the buffers. See ByteView to access the fields of the these views.

As with other arrays, the optimized kernels in arrow_compute are likely the easiest and fastest way to work with this data. However, it is possible to access the views and buffers directly for more control.

For example

use arrow_data::ByteView;
let array = StringViewArray::from(vec![
  "hello",
  "this string is longer than 12 bytes",
  "this string is also longer than 12 bytes"
]);

// ** Examine the first view (short string) **
assert!(array.is_valid(0)); // Check for nulls
let short_view: u128 = array.views()[0]; // "hello"
// get length of the string
let len = short_view as u32;
assert_eq!(len, 5); // strings less than 12 bytes are stored in the view
// SAFETY: `view` is a valid view
let value = unsafe {
  StringViewArray::inline_value(&short_view, len as usize)
};
assert_eq!(value, b"hello");

// ** Examine the third view (long string) **
assert!(array.is_valid(12)); // Check for nulls
let long_view: u128 = array.views()[2]; // "this string is also longer than 12 bytes"
let len = long_view as u32;
assert_eq!(len, 40); // strings longer than 12 bytes are stored in the buffer
let view = ByteView::from(long_view); // use ByteView to access the fields
assert_eq!(view.length, 40);
assert_eq!(view.buffer_index, 0);
assert_eq!(view.offset, 35); // data starts after the first long string
// Views for long strings store a 4 byte prefix
let prefix = view.prefix.to_le_bytes();
assert_eq!(&prefix, b"this");
let value = array.value(2); // get the string value (see `value` implementation for how to access the bytes directly)
assert_eq!(value, "this string is also longer than 12 bytes");

Unlike GenericByteArray, there are no constraints on the offsets other than they must point into a valid buffer. However, they can be out of order, non continuous and overlapping.

For example, in the following diagram, the strings “FishWasInTownToday” and “CrumpleFacedFish” are both longer than 12 bytes and thus are stored in a separate buffer while the string “LavaMonster” is stored inlined in the view. In this case, the same bytes for “Fish” are used to store both strings.

                                                                           ┌───┐
                        ┌──────┬──────┬──────┬──────┐               offset │...│
"FishWasInTownTodayYay" │  21  │ Fish │  0   │ 115  │─ ─              103  │Mr.│
                        └──────┴──────┴──────┴──────┘   │      ┌ ─ ─ ─ ─ ▶ │Cru│
                        ┌──────┬──────┬──────┬──────┐                      │mpl│
"CrumpleFacedFish"      │  16  │ Crum │  0   │ 103  │─ ─│─ ─ ─ ┘           │eFa│
                        └──────┴──────┴──────┴──────┘                      │ced│
                        ┌──────┬────────────────────┐   └ ─ ─ ─ ─ ─ ─ ─ ─ ▶│Fis│
"LavaMonster"           │  11  │   LavaMonster      │                      │hWa│
                        └──────┴────────────────────┘               offset │sIn│
                                                                      115  │Tow│
                                                                           │nTo│
                                                                           │day│
                                 u128 "views"                              │Yay│
                                                                  buffer 0 │...│
                                                                           └───┘

Fields§

§data_type: DataType§views: ScalarBuffer<u128>§buffers: Vec<Buffer>§phantom: PhantomData<T>§nulls: Option<NullBuffer>

Implementations§

§

impl<T> GenericByteViewArray<T>
where T: ByteViewType + ?Sized,

pub fn new( views: ScalarBuffer<u128>, buffers: Vec<Buffer>, nulls: Option<NullBuffer>, ) -> GenericByteViewArray<T>

Create a new GenericByteViewArray from the provided parts, panicking on failure

§Panics

Panics if GenericByteViewArray::try_new returns an error

pub fn try_new( views: ScalarBuffer<u128>, buffers: Vec<Buffer>, nulls: Option<NullBuffer>, ) -> Result<GenericByteViewArray<T>, ArrowError>

Create a new GenericByteViewArray from the provided parts, returning an error on failure

§Errors

views.len() != nulls.len()
ByteViewType::validate fails

pub unsafe fn new_unchecked( views: ScalarBuffer<u128>, buffers: Vec<Buffer>, nulls: Option<NullBuffer>, ) -> GenericByteViewArray<T>

Create a new GenericByteViewArray from the provided parts, without validation

§Safety

Safe if Self::try_new would not error

pub fn new_null(len: usize) -> GenericByteViewArray<T>

Create a new GenericByteViewArray of length len where all values are null

pub fn new_scalar( value: impl AsRef<<T as ByteViewType>::Native>, ) -> Scalar<GenericByteViewArray<T>>

Create a new Scalar from value

pub fn from_iter_values<Ptr, I>(iter: I) -> GenericByteViewArray<T>
where Ptr: AsRef<<T as ByteViewType>::Native>, I: IntoIterator<Item = Ptr>,

Creates a GenericByteViewArray based on an iterator of values without nulls

pub fn into_parts(self) -> (ScalarBuffer<u128>, Vec<Buffer>, Option<NullBuffer>)

Deconstruct this array into its constituent parts

pub fn views(&self) -> &ScalarBuffer<u128>

Returns the views buffer

pub fn data_buffers(&self) -> &[Buffer]

Returns the buffers storing string data

pub fn value(&self, i: usize) -> &<T as ByteViewType>::Native

Returns the element at index i

§Panics

Panics if index i is out of bounds.

pub unsafe fn value_unchecked(&self, idx: usize) -> &<T as ByteViewType>::Native

Returns the element at index i without bounds checking

§Safety

Caller is responsible for ensuring that the index is within the bounds of the array

pub unsafe fn inline_value(view: &u128, len: usize) -> &[u8] ⓘ

Returns the first len bytes the inline value of the view.

§Safety

The view must be a valid element from Self::views() that adheres to the view layout.
The len must be the length of the inlined value. It should never be larger than [MAX_INLINE_VIEW_LEN].

pub fn iter(&self) -> ArrayIter<&GenericByteViewArray<T>> ⓘ

Constructs a new iterator for iterating over the values of this array

pub fn bytes_iter(&self) -> impl Iterator<Item = &[u8]>

Returns an iterator over the bytes of this array, including null values

pub fn prefix_bytes_iter( &self, prefix_len: usize, ) -> impl Iterator<Item = &[u8]>

Returns an iterator over the first prefix_len bytes of each array element, including null values.

If prefix_len is larger than the element’s length, the iterator will return an empty slice (&[]).

pub fn suffix_bytes_iter( &self, suffix_len: usize, ) -> impl Iterator<Item = &[u8]>

Returns an iterator over the last suffix_len bytes of each array element, including null values.

Note that for StringViewArray the last bytes may start in the middle of a UTF-8 codepoint, and thus may not be a valid &str.

If suffix_len is larger than the element’s length, the iterator will return an empty slice (&[]).

pub fn slice(&self, offset: usize, length: usize) -> GenericByteViewArray<T>

Returns a zero-copy slice of this array with the indicated offset and length.

pub fn gc(&self) -> GenericByteViewArray<T>

Returns a “compacted” version of this array

The original array will not be modified

§Garbage Collection

Before GC:

                                       ┌──────┐
                                       │......│
                                       │......│
┌────────────────────┐       ┌ ─ ─ ─ ▶ │Data1 │   Large buffer
│       View 1       │─ ─ ─ ─          │......│  with data that
├────────────────────┤                 │......│ is not referred
│       View 2       │─ ─ ─ ─ ─ ─ ─ ─▶ │Data2 │ to by View 1 or
└────────────────────┘                 │......│      View 2
                                       │......│
   2 views, refer to                   │......│
  small portions of a                  └──────┘
     large buffer

After GC:

┌────────────────────┐                 ┌─────┐    After gc, only
│       View 1       │─ ─ ─ ─ ─ ─ ─ ─▶ │Data1│     data that is
├────────────────────┤       ┌ ─ ─ ─ ▶ │Data2│    pointed to by
│       View 2       │─ ─ ─ ─          └─────┘     the views is
└────────────────────┘                                 left


        2 views

This method will compact the data buffers by recreating the view array and only include the data that is pointed to by the views.

Note that it will copy the array regardless of whether the original array is compact. Use with caution as this can be an expensive operation, only use it when you are sure that the view array is significantly smaller than when it is originally created, e.g., after filtering or slicing.

Note: this function does not attempt to canonicalize / deduplicate values. For this feature see GenericByteViewBuilder::with_deduplicate_strings.

pub fn total_buffer_bytes_used(&self) -> usize

Returns the total number of bytes used by all non inlined views in all buffers.

Note this does not account for views that point at the same underlying data in buffers

For example, if the array has three strings views:

View with length = 9 (inlined)
View with length = 32 (non inlined)
View with length = 16 (non inlined)

Then this method would report 48

pub unsafe fn compare_unchecked( left: &GenericByteViewArray<T>, left_idx: usize, right: &GenericByteViewArray<T>, right_idx: usize, ) -> Ordering

Compare two GenericByteViewArray at index left_idx and right_idx

Comparing two ByteView types are non-trivial. It takes a bit of patience to understand why we don’t just compare two &u8 directly.

ByteView types give us the following two advantages, and we need to be careful not to lose them: (1) For string/byte smaller than [MAX_INLINE_VIEW_LEN] bytes, the entire data is inlined in the view. Meaning that reading one array element requires only one memory access (two memory access required for StringArray, one for offset buffer, the other for value buffer).

(2) For string/byte larger than [MAX_INLINE_VIEW_LEN] bytes, we can still be faster than (for certain operations) StringArray/ByteArray, thanks to the inlined 4 bytes. Consider equality check: If the first four bytes of the two strings are different, we can return false immediately (with just one memory access).

If we directly compare two &u8, we materialize the entire string (i.e., make multiple memory accesses), which might be unnecessary.

Most of the time (eq, ord), we only need to look at the first 4 bytes to know the answer, e.g., if the inlined 4 bytes are different, we can directly return unequal without looking at the full string.

§Order check flow

(1) if both string are smaller than [MAX_INLINE_VIEW_LEN] bytes, we can directly compare the data inlined to the view. (2) if any of the string is larger than [MAX_INLINE_VIEW_LEN] bytes, we need to compare the full string. (2.1) if the inlined 4 bytes are different, we can return the result immediately. (2.2) o.w., we need to compare the full string.

§Safety

The left/right_idx must within range of each array

pub fn inline_key_fast(raw: u128) -> u128

Builds a 128-bit composite key for an inline value:

High 96 bits: the inline data in big-endian byte order (for correct lexicographical sorting).
Low 32 bits: the length in big-endian byte order, acting as a tiebreaker so shorter strings (or those with fewer meaningful bytes) always numerically sort before longer ones.

This function extracts the length and the 12-byte inline string data from the raw little-endian u128 representation, converts them to big-endian ordering, and packs them into a single u128 value suitable for fast, branchless comparisons.

§Why include length?

A pure 96-bit content comparison can’t distinguish between two values whose inline bytes compare equal—either because one is a true prefix of the other or because zero-padding hides extra bytes. By tucking the 32-bit length into the lower bits, a single u128 compare handles both content and length in one go.

Example: comparing “bar” (3 bytes) vs “bar\0” (4 bytes)

String	Bytes 0–4 (length LE)	Bytes 4–16 (data + padding)
`"bar"`	`03 00 00 00`	`62 61 72` + 9 × `00`
`"bar\0"`	`04 00 00 00`	`62 61 72 00` + 8 × `00`

Both inline parts become 62 61 72 00…00, so they tie on content. The length field then differentiates:

key("bar")   = 0x0000000000000000000062617200000003
key("bar\0") = 0x0000000000000000000062617200000004
⇒ key("bar") < key("bar\0")

§

impl GenericByteViewArray<BinaryViewType>

pub fn to_string_view( self, ) -> Result<GenericByteViewArray<StringViewType>, ArrowError>

Convert the BinaryViewArray to StringViewArray If items not utf8 data, validate will fail and error returned.

pub unsafe fn to_string_view_unchecked( self, ) -> GenericByteViewArray<StringViewType>

Convert the BinaryViewArray to StringViewArray

§Safety

Caller is responsible for ensuring that items in array are utf8 data.

§

impl GenericByteViewArray<StringViewType>

pub fn to_binary_view(self) -> GenericByteViewArray<BinaryViewType>

Convert the StringViewArray to BinaryViewArray

pub fn is_ascii(&self) -> bool

Returns true if all data within this array is ASCII

Trait Implementations§

§

impl<T> Array for GenericByteViewArray<T>
where T: ByteViewType + ?Sized,

§

fn as_any(&self) -> &(dyn Any + 'static)

Returns the array as Any so that it can be downcasted to a specific implementation. Read more

§

fn to_data(&self) -> ArrayData

Returns the underlying data of this array

§

fn into_data(self) -> ArrayData

Returns the underlying data of this array Read more

§

fn data_type(&self) -> &DataType

Returns a reference to the DataType of this array. Read more

§

fn slice(&self, offset: usize, length: usize) -> Arc<dyn Array>

Returns a zero-copy slice of this array with the indicated offset and length. Read more

§

fn len(&self) -> usize

Returns the length (i.e., number of elements) of this array. Read more

§

fn is_empty(&self) -> bool

Returns whether this array is empty. Read more

§

fn shrink_to_fit(&mut self)

Shrinks the capacity of any exclusively owned buffer as much as possible Read more

§

fn offset(&self) -> usize

Returns the offset into the underlying data used by this array(-slice). Note that the underlying data can be shared by many arrays. This defaults to 0. Read more

§

fn nulls(&self) -> Option<&NullBuffer>

Returns the null buffer of this array if any. Read more

§

fn logical_null_count(&self) -> usize

Returns the total number of logical null values in this array. Read more

§

fn get_buffer_memory_size(&self) -> usize

Returns the total number of bytes of memory pointed to by this array. The buffers store bytes in the Arrow memory format, and include the data as well as the validity map. Note that this does not always correspond to the exact memory usage of an array, since multiple arrays can share the same buffers or slices thereof.

§

fn get_array_memory_size(&self) -> usize

Returns the total number of bytes of memory occupied physically by this array. This value will always be greater than returned by get_buffer_memory_size() and includes the overhead of the data structures that contain the pointers to the various buffers.

§

fn logical_nulls(&self) -> Option<NullBuffer>

Returns a potentially computed NullBuffer that represents the logical null values of this array, if any. Read more

§

fn is_null(&self, index: usize) -> bool

Returns whether the element at index is null according to Array::nulls Read more

§

fn is_valid(&self, index: usize) -> bool

Returns whether the element at index is not null, the opposite of Self::is_null. Read more

§

fn null_count(&self) -> usize

Returns the total number of physical null values in this array. Read more

§

fn is_nullable(&self) -> bool

Returns false if the array is guaranteed to not contain any logical nulls Read more

§

impl<'a, T> ArrayAccessor for &'a GenericByteViewArray<T>
where T: ByteViewType + ?Sized,

§

type Item = &'a <T as ByteViewType>::Native

The Arrow type of the element being accessed.

§

fn value( &self, index: usize, ) -> <&'a GenericByteViewArray<T> as ArrayAccessor>::Item

Returns the element at index i Read more

§

unsafe fn value_unchecked( &self, index: usize, ) -> <&'a GenericByteViewArray<T> as ArrayAccessor>::Item

Returns the element at index i Read more

§

impl<'a> BinaryArrayType<'a> for &'a GenericByteViewArray<BinaryViewType>

§

fn iter(&self) -> ArrayIter<&'a GenericByteViewArray<BinaryViewType>> ⓘ

Constructs a new iterator

§

impl<T> Clone for GenericByteViewArray<T>
where T: ByteViewType + ?Sized,

§

fn clone(&self) -> GenericByteViewArray<T>

Returns a duplicate of the value. Read more

1.0.0 · Source§

const fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more

§

impl<T> Debug for GenericByteViewArray<T>
where T: ByteViewType + ?Sized,

§

fn fmt(&self, f: &mut Formatter<'_>) -> Result<(), Error>

Formats the value using the given formatter. Read more

§

impl<FROM, V> From<&GenericByteArray<FROM>> for GenericByteViewArray<V>
where FROM: ByteArrayType, <FROM as ByteArrayType>::Offset: OffsetSizeTrait + ToPrimitive, V: ByteViewType<Native = <FROM as ByteArrayType>::Native>,

Efficiently convert a GenericByteArray to a GenericByteViewArray

For example this method can convert a StringArray to a StringViewArray.

If the offsets are all less than u32::MAX, the new GenericByteViewArray is built without copying the underlying string data (views are created directly into the existing buffer)

§