Apache Arrow (C++)
A columnar in-memory analytics layer designed to accelerate big data.
Public Member Functions | Protected Attributes | List of all members
arrow::RleDecoder Class Reference

Utility classes to do run length encoding (RLE) for fixed bit width values. More...

#include <arrow/util/rle-encoding.h>

Public Member Functions

 RleDecoder (const uint8_t *buffer, int buffer_len, int bit_width)
 Create a decoder object. More...
 
 RleDecoder ()
 
void Reset (const uint8_t *buffer, int buffer_len, int bit_width)
 
template<typename T >
bool Get (T *val)
 Gets the next value. Returns false if there are no more. More...
 
template<typename T >
int GetBatch (T *values, int batch_size)
 Gets a batch of values. Returns the number of decoded elements. More...
 
template<typename T >
int GetBatchWithDict (const T *dictionary, T *values, int batch_size)
 Like GetBatch but the values are then decoded using the provided dictionary. More...
 
template<typename T >
int GetBatchWithDictSpaced (const T *dictionary, T *values, int batch_size, int null_count, const uint8_t *valid_bits, int64_t valid_bits_offset)
 Like GetBatchWithDict but add spacing for null entries. More...
 

Protected Attributes

BitReader bit_reader_
 
int bit_width_
 Number of bits needed to encode the value. Must be between 0 and 64. More...
 
uint64_t current_value_
 
uint32_t repeat_count_
 
uint32_t literal_count_
 

Detailed Description

Utility classes to do run length encoding (RLE) for fixed bit width values.

If runs are sufficiently long, RLE is used, otherwise, the values are just bit-packed (literal encoding). For both types of runs, there is a byte-aligned indicator which encodes the length of the run and the type of the run. This encoding has the benefit that when there aren't any long enough runs, values are always decoded at fixed (can be precomputed) bit offsets OR both the value and the run length are byte aligned. This allows for very efficient decoding implementations. The encoding is: encoded-block := run* run := literal-run | repeated-run literal-run := literal-indicator < literal bytes > repeated-run := repeated-indicator < repeated value. padded to byte boundary > literal-indicator := varint_encode( number_of_groups << 1 | 1) repeated-indicator := varint_encode( number_of_repetitions << 1 ) Each run is preceded by a varint. The varint's least significant bit is used to indicate whether the run is a literal run or a repeated run. The rest of the varint is used to determine the length of the run (eg how many times the value repeats). In the case of literal runs, the run length is always a multiple of 8 (i.e. encode in groups of 8), so that no matter the bit-width of the value, the sequence will end on a byte boundary without padding. Given that we know it is a multiple of 8, we store the number of 8-groups rather than the actual number of encoded ints. (This means that the total number of encoded values can not be determined from the encoded data, since the number of values in the last group may not be a multiple of 8). For the last group of literal runs, we pad the group to 8 with zeros. This allows for 8 at a time decoding on the read side without the need for additional checks. There is a break-even point when it is more storage efficient to do run length encoding. For 1 bit-width values, that point is 8 values. They require 2 bytes for both the repeated encoding or the literal encoding. This value can always be computed based on the bit-width.

TODO: think about how to use this for strings. The bit packing isn't quite the same. Examples with bit-width 1 (eg encoding booleans):

100 1s followed by 100 0s: <varint(100 << 1)> <1, padded to 1 byte>  <varint(100 << 1)> <0, padded to 1 byte>

Constructor & Destructor Documentation

◆ RleDecoder() [1/2]

arrow::RleDecoder::RleDecoder ( const uint8_t *  buffer,
int  buffer_len,
int  bit_width 
)
inline

Create a decoder object.

buffer/buffer_len is the decoded data. bit_width is the width of each value (before encoding).

◆ RleDecoder() [2/2]

arrow::RleDecoder::RleDecoder ( )
inline

Member Function Documentation

◆ Get()

template<typename T >
bool arrow::RleDecoder::Get ( T *  val)
inline

Gets the next value. Returns false if there are no more.

◆ GetBatch()

template<typename T >
int arrow::RleDecoder::GetBatch ( T *  values,
int  batch_size 
)
inline

Gets a batch of values. Returns the number of decoded elements.

◆ GetBatchWithDict()

template<typename T >
int arrow::RleDecoder::GetBatchWithDict ( const T *  dictionary,
T *  values,
int  batch_size 
)
inline

Like GetBatch but the values are then decoded using the provided dictionary.

◆ GetBatchWithDictSpaced()

template<typename T >
int arrow::RleDecoder::GetBatchWithDictSpaced ( const T *  dictionary,
T *  values,
int  batch_size,
int  null_count,
const uint8_t *  valid_bits,
int64_t  valid_bits_offset 
)
inline

Like GetBatchWithDict but add spacing for null entries.

◆ Reset()

void arrow::RleDecoder::Reset ( const uint8_t *  buffer,
int  buffer_len,
int  bit_width 
)
inline

Member Data Documentation

◆ bit_reader_

BitReader arrow::RleDecoder::bit_reader_
protected

◆ bit_width_

int arrow::RleDecoder::bit_width_
protected

Number of bits needed to encode the value. Must be between 0 and 64.

◆ current_value_

uint64_t arrow::RleDecoder::current_value_
protected

◆ literal_count_

uint32_t arrow::RleDecoder::literal_count_
protected

◆ repeat_count_

uint32_t arrow::RleDecoder::repeat_count_
protected

The documentation for this class was generated from the following file: