Struct GenericByteViewArray
pub struct GenericByteViewArray<T>where
T: ByteViewType + ?Sized,{
data_type: DataType,
views: ScalarBuffer<u128>,
buffers: Vec<Buffer>,
phantom: PhantomData<T>,
nulls: Option<NullBuffer>,
}
Expand description
Variable-size Binary View Layout: An array of variable length bytes views.
This array type is used to store variable length byte data (e.g. Strings, Binary)
and has efficient operations such as take
, filter
, and comparison.
This is different from GenericByteArray
, which also stores variable
length byte data, as it represents strings with an offset and length. take
and filter
like operations are implemented by manipulating the “views”
(u128
) without modifying the bytes. Each view also stores an inlined
prefix which speed up comparisons.
§See Also
StringViewArray
for storing utf8 encoded string dataBinaryViewArray
for storing bytesByteView
to interpretu128
s layout of the views.
§Layout: “views” and buffers
A GenericByteViewArray
stores variable length byte strings. An array of
N
elements is stored as N
fixed length “views” and a variable number
of variable length “buffers”.
Each view is a u128
value whose layout is different depending on the
length of the string stored at that location:
┌──────┬────────────────────────┐
│length│ string value │
Strings (len <= 12) │ │ (padded with 0) │
└──────┴────────────────────────┘
0 31 127
┌───────┬───────┬───────┬───────┐
│length │prefix │ buf │offset │
Strings (len > 12) │ │ │ index │ │
└───────┴───────┴───────┴───────┘
0 31 63 95 127
-
Strings with length <= 12 (
MAX_INLINE_VIEW_LEN
) are stored directly in the view. SeeSelf::inline_value
to access the inlined prefix from a short view. -
Strings with length > 12: The first four bytes are stored inline in the view and the entire string is stored in one of the buffers. See
ByteView
to access the fields of the these views.
As with other arrays, the optimized kernels in arrow_compute
are likely
the easiest and fastest way to work with this data. However, it is possible
to access the views and buffers directly for more control.
For example
use arrow_data::ByteView;
let array = StringViewArray::from(vec![
"hello",
"this string is longer than 12 bytes",
"this string is also longer than 12 bytes"
]);
// ** Examine the first view (short string) **
assert!(array.is_valid(0)); // Check for nulls
let short_view: u128 = array.views()[0]; // "hello"
// get length of the string
let len = short_view as u32;
assert_eq!(len, 5); // strings less than 12 bytes are stored in the view
// SAFETY: `view` is a valid view
let value = unsafe {
StringViewArray::inline_value(&short_view, len as usize)
};
assert_eq!(value, b"hello");
// ** Examine the third view (long string) **
assert!(array.is_valid(12)); // Check for nulls
let long_view: u128 = array.views()[2]; // "this string is also longer than 12 bytes"
let len = long_view as u32;
assert_eq!(len, 40); // strings longer than 12 bytes are stored in the buffer
let view = ByteView::from(long_view); // use ByteView to access the fields
assert_eq!(view.length, 40);
assert_eq!(view.buffer_index, 0);
assert_eq!(view.offset, 35); // data starts after the first long string
// Views for long strings store a 4 byte prefix
let prefix = view.prefix.to_le_bytes();
assert_eq!(&prefix, b"this");
let value = array.value(2); // get the string value (see `value` implementation for how to access the bytes directly)
assert_eq!(value, "this string is also longer than 12 bytes");
Unlike GenericByteArray
, there are no constraints on the offsets other
than they must point into a valid buffer. However, they can be out of order,
non continuous and overlapping.
For example, in the following diagram, the strings “FishWasInTownToday” and “CrumpleFacedFish” are both longer than 12 bytes and thus are stored in a separate buffer while the string “LavaMonster” is stored inlined in the view. In this case, the same bytes for “Fish” are used to store both strings.
┌───┐
┌──────┬──────┬──────┬──────┐ offset │...│
"FishWasInTownTodayYay" │ 21 │ Fish │ 0 │ 115 │─ ─ 103 │Mr.│
└──────┴──────┴──────┴──────┘ │ ┌ ─ ─ ─ ─ ▶ │Cru│
┌──────┬──────┬──────┬──────┐ │mpl│
"CrumpleFacedFish" │ 16 │ Crum │ 0 │ 103 │─ ─│─ ─ ─ ┘ │eFa│
└──────┴──────┴──────┴──────┘ │ced│
┌──────┬────────────────────┐ └ ─ ─ ─ ─ ─ ─ ─ ─ ▶│Fis│
"LavaMonster" │ 11 │ LavaMonster │ │hWa│
└──────┴────────────────────┘ offset │sIn│
115 │Tow│
│nTo│
│day│
u128 "views" │Yay│
buffer 0 │...│
└───┘
Fields§
§data_type: DataType
§views: ScalarBuffer<u128>
§buffers: Vec<Buffer>
§phantom: PhantomData<T>
§nulls: Option<NullBuffer>
Implementations§
§impl<T> GenericByteViewArray<T>where
T: ByteViewType + ?Sized,
impl<T> GenericByteViewArray<T>where
T: ByteViewType + ?Sized,
pub fn new(
views: ScalarBuffer<u128>,
buffers: Vec<Buffer>,
nulls: Option<NullBuffer>,
) -> GenericByteViewArray<T>
pub fn new( views: ScalarBuffer<u128>, buffers: Vec<Buffer>, nulls: Option<NullBuffer>, ) -> GenericByteViewArray<T>
Create a new GenericByteViewArray
from the provided parts, panicking on failure
§Panics
Panics if GenericByteViewArray::try_new
returns an error
pub fn try_new(
views: ScalarBuffer<u128>,
buffers: Vec<Buffer>,
nulls: Option<NullBuffer>,
) -> Result<GenericByteViewArray<T>, ArrowError>
pub fn try_new( views: ScalarBuffer<u128>, buffers: Vec<Buffer>, nulls: Option<NullBuffer>, ) -> Result<GenericByteViewArray<T>, ArrowError>
Create a new GenericByteViewArray
from the provided parts, returning an error on failure
§Errors
views.len() != nulls.len()
- ByteViewType::validate fails
pub unsafe fn new_unchecked(
views: ScalarBuffer<u128>,
buffers: Vec<Buffer>,
nulls: Option<NullBuffer>,
) -> GenericByteViewArray<T>
pub unsafe fn new_unchecked( views: ScalarBuffer<u128>, buffers: Vec<Buffer>, nulls: Option<NullBuffer>, ) -> GenericByteViewArray<T>
Create a new GenericByteViewArray
from the provided parts, without validation
§Safety
Safe if Self::try_new
would not error
pub fn new_null(len: usize) -> GenericByteViewArray<T>
pub fn new_null(len: usize) -> GenericByteViewArray<T>
Create a new GenericByteViewArray
of length len
where all values are null
pub fn new_scalar(
value: impl AsRef<<T as ByteViewType>::Native>,
) -> Scalar<GenericByteViewArray<T>>
pub fn new_scalar( value: impl AsRef<<T as ByteViewType>::Native>, ) -> Scalar<GenericByteViewArray<T>>
Create a new Scalar
from value
pub fn from_iter_values<Ptr, I>(iter: I) -> GenericByteViewArray<T>
pub fn from_iter_values<Ptr, I>(iter: I) -> GenericByteViewArray<T>
Creates a GenericByteViewArray
based on an iterator of values without nulls
pub fn into_parts(self) -> (ScalarBuffer<u128>, Vec<Buffer>, Option<NullBuffer>)
pub fn into_parts(self) -> (ScalarBuffer<u128>, Vec<Buffer>, Option<NullBuffer>)
Deconstruct this array into its constituent parts
pub fn views(&self) -> &ScalarBuffer<u128>
pub fn views(&self) -> &ScalarBuffer<u128>
Returns the views buffer
pub fn data_buffers(&self) -> &[Buffer]
pub fn data_buffers(&self) -> &[Buffer]
Returns the buffers storing string data
pub fn value(&self, i: usize) -> &<T as ByteViewType>::Native
pub fn value(&self, i: usize) -> &<T as ByteViewType>::Native
pub unsafe fn value_unchecked(&self, idx: usize) -> &<T as ByteViewType>::Native
pub unsafe fn value_unchecked(&self, idx: usize) -> &<T as ByteViewType>::Native
Returns the element at index i
without bounds checking
§Safety
Caller is responsible for ensuring that the index is within the bounds of the array
pub unsafe fn inline_value(view: &u128, len: usize) -> &[u8] ⓘ
pub unsafe fn inline_value(view: &u128, len: usize) -> &[u8] ⓘ
Returns the first len
bytes the inline value of the view.
§Safety
- The
view
must be a valid element fromSelf::views()
that adheres to the view layout. - The
len
must be the length of the inlined value. It should never be larger than [MAX_INLINE_VIEW_LEN
].
pub fn iter(&self) -> ArrayIter<&GenericByteViewArray<T>> ⓘ
pub fn iter(&self) -> ArrayIter<&GenericByteViewArray<T>> ⓘ
Constructs a new iterator for iterating over the values of this array
pub fn bytes_iter(&self) -> impl Iterator<Item = &[u8]>
pub fn bytes_iter(&self) -> impl Iterator<Item = &[u8]>
Returns an iterator over the bytes of this array, including null values
pub fn prefix_bytes_iter(
&self,
prefix_len: usize,
) -> impl Iterator<Item = &[u8]>
pub fn prefix_bytes_iter( &self, prefix_len: usize, ) -> impl Iterator<Item = &[u8]>
Returns an iterator over the first prefix_len
bytes of each array
element, including null values.
If prefix_len
is larger than the element’s length, the iterator will
return an empty slice (&[]
).
pub fn suffix_bytes_iter(
&self,
suffix_len: usize,
) -> impl Iterator<Item = &[u8]>
pub fn suffix_bytes_iter( &self, suffix_len: usize, ) -> impl Iterator<Item = &[u8]>
Returns an iterator over the last suffix_len
bytes of each array
element, including null values.
Note that for StringViewArray
the last bytes may start in the middle
of a UTF-8 codepoint, and thus may not be a valid &str
.
If suffix_len
is larger than the element’s length, the iterator will
return an empty slice (&[]
).
pub fn slice(&self, offset: usize, length: usize) -> GenericByteViewArray<T>
pub fn slice(&self, offset: usize, length: usize) -> GenericByteViewArray<T>
Returns a zero-copy slice of this array with the indicated offset and length.
pub fn gc(&self) -> GenericByteViewArray<T>
pub fn gc(&self) -> GenericByteViewArray<T>
Returns a “compacted” version of this array
The original array will not be modified
§Garbage Collection
Before GC:
┌──────┐
│......│
│......│
┌────────────────────┐ ┌ ─ ─ ─ ▶ │Data1 │ Large buffer
│ View 1 │─ ─ ─ ─ │......│ with data that
├────────────────────┤ │......│ is not referred
│ View 2 │─ ─ ─ ─ ─ ─ ─ ─▶ │Data2 │ to by View 1 or
└────────────────────┘ │......│ View 2
│......│
2 views, refer to │......│
small portions of a └──────┘
large buffer
After GC:
┌────────────────────┐ ┌─────┐ After gc, only
│ View 1 │─ ─ ─ ─ ─ ─ ─ ─▶ │Data1│ data that is
├────────────────────┤ ┌ ─ ─ ─ ▶ │Data2│ pointed to by
│ View 2 │─ ─ ─ ─ └─────┘ the views is
└────────────────────┘ left
2 views
This method will compact the data buffers by recreating the view array and only include the data that is pointed to by the views.
Note that it will copy the array regardless of whether the original array is compact. Use with caution as this can be an expensive operation, only use it when you are sure that the view array is significantly smaller than when it is originally created, e.g., after filtering or slicing.
Note: this function does not attempt to canonicalize / deduplicate values. For this
feature see GenericByteViewBuilder::with_deduplicate_strings
.
pub fn total_buffer_bytes_used(&self) -> usize
pub fn total_buffer_bytes_used(&self) -> usize
Returns the total number of bytes used by all non inlined views in all buffers.
Note this does not account for views that point at the same underlying data in buffers
For example, if the array has three strings views:
- View with length = 9 (inlined)
- View with length = 32 (non inlined)
- View with length = 16 (non inlined)
Then this method would report 48
pub unsafe fn compare_unchecked(
left: &GenericByteViewArray<T>,
left_idx: usize,
right: &GenericByteViewArray<T>,
right_idx: usize,
) -> Ordering
pub unsafe fn compare_unchecked( left: &GenericByteViewArray<T>, left_idx: usize, right: &GenericByteViewArray<T>, right_idx: usize, ) -> Ordering
Compare two GenericByteViewArray
at index left_idx
and right_idx
Comparing two ByteView types are non-trivial. It takes a bit of patience to understand why we don’t just compare two &u8 directly.
ByteView types give us the following two advantages, and we need to be careful not to lose them:
(1) For string/byte smaller than [MAX_INLINE_VIEW_LEN
] bytes, the entire data is inlined in the view.
Meaning that reading one array element requires only one memory access
(two memory access required for StringArray, one for offset buffer, the other for value buffer).
(2) For string/byte larger than [MAX_INLINE_VIEW_LEN
] bytes, we can still be faster than (for certain operations) StringArray/ByteArray,
thanks to the inlined 4 bytes.
Consider equality check:
If the first four bytes of the two strings are different, we can return false immediately (with just one memory access).
If we directly compare two &u8, we materialize the entire string (i.e., make multiple memory accesses), which might be unnecessary.
- Most of the time (eq, ord), we only need to look at the first 4 bytes to know the answer, e.g., if the inlined 4 bytes are different, we can directly return unequal without looking at the full string.
§Order check flow
(1) if both string are smaller than [MAX_INLINE_VIEW_LEN
] bytes, we can directly compare the data inlined to the view.
(2) if any of the string is larger than [MAX_INLINE_VIEW_LEN
] bytes, we need to compare the full string.
(2.1) if the inlined 4 bytes are different, we can return the result immediately.
(2.2) o.w., we need to compare the full string.
§Safety
The left/right_idx must within range of each array
pub fn inline_key_fast(raw: u128) -> u128
pub fn inline_key_fast(raw: u128) -> u128
Builds a 128-bit composite key for an inline value:
- High 96 bits: the inline data in big-endian byte order (for correct lexicographical sorting).
- Low 32 bits: the length in big-endian byte order, acting as a tiebreaker so shorter strings (or those with fewer meaningful bytes) always numerically sort before longer ones.
This function extracts the length and the 12-byte inline string data from the raw
little-endian u128
representation, converts them to big-endian ordering, and packs them
into a single u128
value suitable for fast, branchless comparisons.
§Why include length?
A pure 96-bit content comparison can’t distinguish between two values whose inline bytes
compare equal—either because one is a true prefix of the other or because zero-padding
hides extra bytes. By tucking the 32-bit length into the lower bits, a single u128
compare
handles both content and length in one go.
Example: comparing “bar” (3 bytes) vs “bar\0” (4 bytes)
String | Bytes 0–4 (length LE) | Bytes 4–16 (data + padding) |
---|---|---|
"bar" | 03 00 00 00 | 62 61 72 + 9 × 00 |
"bar\0" | 04 00 00 00 | 62 61 72 00 + 8 × 00 |
Both inline parts become 62 61 72 00…00
, so they tie on content. The length field
then differentiates:
key("bar") = 0x0000000000000000000062617200000003
key("bar\0") = 0x0000000000000000000062617200000004
⇒ key("bar") < key("bar\0")
§impl GenericByteViewArray<BinaryViewType>
impl GenericByteViewArray<BinaryViewType>
pub fn to_string_view(
self,
) -> Result<GenericByteViewArray<StringViewType>, ArrowError>
pub fn to_string_view( self, ) -> Result<GenericByteViewArray<StringViewType>, ArrowError>
Convert the BinaryViewArray
to StringViewArray
If items not utf8 data, validate will fail and error returned.
pub unsafe fn to_string_view_unchecked(
self,
) -> GenericByteViewArray<StringViewType>
pub unsafe fn to_string_view_unchecked( self, ) -> GenericByteViewArray<StringViewType>
Convert the BinaryViewArray
to StringViewArray
§Safety
Caller is responsible for ensuring that items in array are utf8 data.
§impl GenericByteViewArray<StringViewType>
impl GenericByteViewArray<StringViewType>
pub fn to_binary_view(self) -> GenericByteViewArray<BinaryViewType>
pub fn to_binary_view(self) -> GenericByteViewArray<BinaryViewType>
Convert the StringViewArray
to BinaryViewArray
Trait Implementations§
§impl<T> Array for GenericByteViewArray<T>where
T: ByteViewType + ?Sized,
impl<T> Array for GenericByteViewArray<T>where
T: ByteViewType + ?Sized,
§fn slice(&self, offset: usize, length: usize) -> Arc<dyn Array>
fn slice(&self, offset: usize, length: usize) -> Arc<dyn Array>
§fn shrink_to_fit(&mut self)
fn shrink_to_fit(&mut self)
§fn offset(&self) -> usize
fn offset(&self) -> usize
0
. Read more§fn nulls(&self) -> Option<&NullBuffer>
fn nulls(&self) -> Option<&NullBuffer>
§fn logical_null_count(&self) -> usize
fn logical_null_count(&self) -> usize
§fn get_buffer_memory_size(&self) -> usize
fn get_buffer_memory_size(&self) -> usize
§fn get_array_memory_size(&self) -> usize
fn get_array_memory_size(&self) -> usize
get_buffer_memory_size()
and
includes the overhead of the data structures that contain the pointers to the various buffers.§fn logical_nulls(&self) -> Option<NullBuffer>
fn logical_nulls(&self) -> Option<NullBuffer>
NullBuffer
that represents the logical
null values of this array, if any. Read more§fn null_count(&self) -> usize
fn null_count(&self) -> usize
§fn is_nullable(&self) -> bool
fn is_nullable(&self) -> bool
false
if the array is guaranteed to not contain any logical nulls Read more§impl<'a, T> ArrayAccessor for &'a GenericByteViewArray<T>where
T: ByteViewType + ?Sized,
impl<'a, T> ArrayAccessor for &'a GenericByteViewArray<T>where
T: ByteViewType + ?Sized,
§type Item = &'a <T as ByteViewType>::Native
type Item = &'a <T as ByteViewType>::Native
§fn value(
&self,
index: usize,
) -> <&'a GenericByteViewArray<T> as ArrayAccessor>::Item
fn value( &self, index: usize, ) -> <&'a GenericByteViewArray<T> as ArrayAccessor>::Item
i
Read more§unsafe fn value_unchecked(
&self,
index: usize,
) -> <&'a GenericByteViewArray<T> as ArrayAccessor>::Item
unsafe fn value_unchecked( &self, index: usize, ) -> <&'a GenericByteViewArray<T> as ArrayAccessor>::Item
i
Read more§impl<'a> BinaryArrayType<'a> for &'a GenericByteViewArray<BinaryViewType>
impl<'a> BinaryArrayType<'a> for &'a GenericByteViewArray<BinaryViewType>
§fn iter(&self) -> ArrayIter<&'a GenericByteViewArray<BinaryViewType>> ⓘ
fn iter(&self) -> ArrayIter<&'a GenericByteViewArray<BinaryViewType>> ⓘ
§impl<T> Clone for GenericByteViewArray<T>where
T: ByteViewType + ?Sized,
impl<T> Clone for GenericByteViewArray<T>where
T: ByteViewType + ?Sized,
§fn clone(&self) -> GenericByteViewArray<T>
fn clone(&self) -> GenericByteViewArray<T>
1.0.0 · Source§const fn clone_from(&mut self, source: &Self)
const fn clone_from(&mut self, source: &Self)
source
. Read more§impl<T> Debug for GenericByteViewArray<T>where
T: ByteViewType + ?Sized,
impl<T> Debug for GenericByteViewArray<T>where
T: ByteViewType + ?Sized,
§impl<FROM, V> From<&GenericByteArray<FROM>> for GenericByteViewArray<V>where
FROM: ByteArrayType,
<FROM as ByteArrayType>::Offset: OffsetSizeTrait + ToPrimitive,
V: ByteViewType<Native = <FROM as ByteArrayType>::Native>,
Efficiently convert a GenericByteArray
to a GenericByteViewArray
impl<FROM, V> From<&GenericByteArray<FROM>> for GenericByteViewArray<V>where
FROM: ByteArrayType,
<FROM as ByteArrayType>::Offset: OffsetSizeTrait + ToPrimitive,
V: ByteViewType<Native = <FROM as ByteArrayType>::Native>,
Efficiently convert a GenericByteArray
to a GenericByteViewArray
For example this method can convert a StringArray
to a
StringViewArray
.
If the offsets are all less than u32::MAX, the new GenericByteViewArray
is built without copying the underlying string data (views are created
directly into the existing buffer)