parquet::column::writer

Struct GenericColumnWriter

Source
pub struct GenericColumnWriter<'a, E: ColumnValueEncoder> {
Show 18 fields descr: ColumnDescPtr, props: WriterPropertiesPtr, statistics_enabled: EnabledStatistics, page_writer: Box<dyn PageWriter + 'a>, codec: Compression, compressor: Option<Box<dyn Codec>>, encoder: E, page_metrics: PageMetrics, column_metrics: ColumnMetrics<E::T>, encodings: BTreeSet<Encoding>, def_levels_sink: Vec<i16>, rep_levels_sink: Vec<i16>, data_pages: VecDeque<CompressedPage>, column_index_builder: ColumnIndexBuilder, offset_index_builder: Option<OffsetIndexBuilder>, data_page_boundary_ascending: bool, data_page_boundary_descending: bool, last_non_null_data_page_min_max: Option<(E::T, E::T)>,
}
Expand description

Generic column writer for a primitive column.

Fields§

§descr: ColumnDescPtr§props: WriterPropertiesPtr§statistics_enabled: EnabledStatistics§page_writer: Box<dyn PageWriter + 'a>§codec: Compression§compressor: Option<Box<dyn Codec>>§encoder: E§page_metrics: PageMetrics§column_metrics: ColumnMetrics<E::T>§encodings: BTreeSet<Encoding>

The order of encodings within the generated metadata does not impact its meaning, but we use a BTreeSet so that the output is deterministic

§def_levels_sink: Vec<i16>§rep_levels_sink: Vec<i16>§data_pages: VecDeque<CompressedPage>§column_index_builder: ColumnIndexBuilder§offset_index_builder: Option<OffsetIndexBuilder>§data_page_boundary_ascending: bool§data_page_boundary_descending: bool§last_non_null_data_page_min_max: Option<(E::T, E::T)>

(min, max)

Implementations§

Source§

impl<'a, E: ColumnValueEncoder> GenericColumnWriter<'a, E>

Source

pub fn new( descr: ColumnDescPtr, props: WriterPropertiesPtr, page_writer: Box<dyn PageWriter + 'a>, ) -> Self

Returns a new instance of GenericColumnWriter.

Source

pub(crate) fn write_batch_internal( &mut self, values: &E::Values, value_indices: Option<&[usize]>, def_levels: Option<&[i16]>, rep_levels: Option<&[i16]>, min: Option<&E::T>, max: Option<&E::T>, distinct_count: Option<u64>, ) -> Result<usize>

Source

pub fn write_batch( &mut self, values: &E::Values, def_levels: Option<&[i16]>, rep_levels: Option<&[i16]>, ) -> Result<usize>

Writes batch of values, definition levels and repetition levels. Returns number of values processed (written).

If definition and repetition levels are provided, we write fully those levels and select how many values to write (this number will be returned), since number of actual written values may be smaller than provided values.

If only values are provided, then all values are written and the length of of the values buffer is returned.

Definition and/or repetition levels can be omitted, if values are non-nullable and/or non-repeated.

Source

pub fn write_batch_with_statistics( &mut self, values: &E::Values, def_levels: Option<&[i16]>, rep_levels: Option<&[i16]>, min: Option<&E::T>, max: Option<&E::T>, distinct_count: Option<u64>, ) -> Result<usize>

Writer may optionally provide pre-calculated statistics for use when computing chunk-level statistics

NB: WriterProperties::statistics_enabled must be set to EnabledStatistics::Chunk for these statistics to take effect. If EnabledStatistics::None they will be ignored, and if EnabledStatistics::Page the chunk statistics will instead be computed from the computed page statistics

Source

pub(crate) fn memory_size(&self) -> usize

Returns the estimated total memory usage.

Unlike Self::get_estimated_total_bytes this is an estimate of the current memory usage and not the final anticipated encoded size.

Source

pub fn get_total_bytes_written(&self) -> u64

Returns total number of bytes written by this column writer so far. This value is also returned when column writer is closed.

Note: this value does not include any buffered data that has not yet been flushed to a page.

Source

pub(crate) fn get_estimated_total_bytes(&self) -> u64

Returns the estimated total encoded bytes for this column writer.

Unlike Self::get_total_bytes_written this includes an estimate of any data that has not yet been flushed to a page, based on it’s anticipated encoded size.

Source

pub fn get_total_rows_written(&self) -> u64

Returns total number of rows written by this column writer so far. This value is also returned when column writer is closed.

Source

pub fn get_descriptor(&self) -> &ColumnDescPtr

Returns a reference to a ColumnDescPtr

Source

pub fn close(self) -> Result<ColumnCloseResult>

Finalizes writes and closes the column writer. Returns total bytes written, total rows written and column chunk metadata.

Source

fn write_mini_batch( &mut self, values: &E::Values, values_offset: usize, value_indices: Option<&[usize]>, num_levels: usize, def_levels: Option<&[i16]>, rep_levels: Option<&[i16]>, ) -> Result<usize>

Writes mini batch of values, definition and repetition levels. This allows fine-grained processing of values and maintaining a reasonable page size.

Source

fn should_dict_fallback(&self) -> bool

Returns true if we need to fall back to non-dictionary encoding.

We can only fall back if dictionary encoder is set and we have exceeded dictionary size.

Source

fn should_add_data_page(&self) -> bool

Returns true if there is enough data for a data page, false otherwise.

Source

fn dict_fallback(&mut self) -> Result<()>

Performs dictionary fallback. Prepares and writes dictionary and all data pages into page writer.

Source

fn update_column_offset_index( &mut self, page_statistics: Option<&ValueStatistics<E::T>>, page_variable_length_bytes: Option<i64>, )

Update the column index and offset index when adding the data page

Source

fn can_truncate_value(&self) -> bool

Determine if we should allow truncating min/max values for this column’s statistics

Source

fn is_utf8(&self) -> bool

Returns true if this column’s logical type is a UTF-8 string.

Source

fn truncate_min_value( &self, truncation_length: Option<usize>, data: &[u8], ) -> (Vec<u8>, bool)

Truncates a binary statistic to at most truncation_length bytes.

If truncation is not possible, returns data.

The bool in the returned tuple indicates whether truncation occurred or not.

UTF-8 Note: If the column type indicates UTF-8, and data contains valid UTF-8, then the result will also remain valid UTF-8, but may be less tnan truncation_length bytes to avoid splitting on non-character boundaries.

Source

fn truncate_max_value( &self, truncation_length: Option<usize>, data: &[u8], ) -> (Vec<u8>, bool)

Truncates a binary statistic to at most truncation_length bytes, and then increment the final byte(s) to yield a valid upper bound. This may result in a result of less than truncation_length bytes if the last byte(s) overflows.

If truncation is not possible, returns data.

The bool in the returned tuple indicates whether truncation occurred or not.

UTF-8 Note: If the column type indicates UTF-8, and data contains valid UTF-8, then the result will also remain valid UTF-8 (but again may be less than truncation_length bytes). If data does not contain valid UTF-8, then truncation will occur as if the column is non-string binary.

Source

fn add_data_page(&mut self) -> Result<()>

Adds data page. Data page is either buffered in case of dictionary encoding or written directly.

Source

fn flush_data_pages(&mut self) -> Result<()>

Finalises any outstanding data pages and flushes buffered data pages from dictionary encoding into underlying sink.

Source

fn build_column_metadata(&mut self) -> Result<ColumnChunkMetaData>

Assembles column chunk metadata.

Source

fn encode_levels_v1( &self, encoding: Encoding, levels: &[i16], max_level: i16, ) -> Vec<u8>

Encodes definition or repetition levels for Data Page v1.

Source

fn encode_levels_v2(&self, levels: &[i16], max_level: i16) -> Vec<u8>

Encodes definition or repetition levels for Data Page v2. Encoding is always RLE.

Source

fn write_data_page(&mut self, page: CompressedPage) -> Result<()>

Writes compressed data page into underlying sink and updates global metrics.

Source

fn write_dictionary_page(&mut self) -> Result<()>

Writes dictionary page into underlying sink.

Source

fn update_metrics_for_page(&mut self, page_spec: PageWriteSpec)

Updates column writer metrics with each page metadata.

Auto Trait Implementations§

§

impl<'a, E> Freeze for GenericColumnWriter<'a, E>
where E: Freeze, <E as ColumnValueEncoder>::T: Freeze,

§

impl<'a, E> !RefUnwindSafe for GenericColumnWriter<'a, E>

§

impl<'a, E> Send for GenericColumnWriter<'a, E>
where E: Send,

§

impl<'a, E> !Sync for GenericColumnWriter<'a, E>

§

impl<'a, E> Unpin for GenericColumnWriter<'a, E>
where E: Unpin, <E as ColumnValueEncoder>::T: Unpin,

§

impl<'a, E> !UnwindSafe for GenericColumnWriter<'a, E>

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
§

impl<T> ErasedDestructor for T
where T: 'static,

§

impl<T> MaybeSendSync for T