Skip to main content

ParquetPushDecoderBuilder

Type Alias ParquetPushDecoderBuilder 

Source
pub type ParquetPushDecoderBuilder = ArrowReaderBuilder<PushDecoderInput>;
Expand description

A builder for ParquetPushDecoder.

To create a new decoder, use ParquetPushDecoderBuilder::try_new_decoder.

You can decode the metadata from a Parquet file using either ParquetMetadataReader or ParquetMetaDataPushDecoder.

Note the “input” type is u64 which represents the length of the Parquet file being decoded. This is needed to initialize the internal buffers that track what data has been provided to the decoder.

§Example

// The file length and metadata are required to create the decoder
let mut decoder =
    ParquetPushDecoderBuilder::try_new_decoder(parquet_metadata)
      .unwrap()
      // Optionally configure the decoder, e.g. batch size
      .with_batch_size(1024)
      // Build the decoder
      .build()
      .unwrap();

    // In a loop, ask the decoder what it needs next, and provide it with the required data
    loop {
        match decoder.try_decode().unwrap() {
            DecodeResult::NeedsData(ranges) => {
                // The decoder needs more data. Fetch the data for the given ranges
                let data = ranges.iter().map(|r| get_range(r)).collect::<Vec<_>>();
                // Push the data to the decoder
                decoder.push_ranges(ranges, data).unwrap();
                // After pushing the data, we can try to decode again on the next iteration
            }
            DecodeResult::Data(batch) => {
                // Successfully decoded a batch of data
                assert!(batch.num_rows() > 0);
            }
            DecodeResult::Finished => {
                // The decoder has finished decoding exit the loop
                break;
            }
        }
    }

§Adaptive scans

The scan strategy is not fixed once build is called: it can be changed while decoding, at row-group boundaries.

The important API for this is ParquetPushDecoder::try_next_reader. Unlike try_decode, which barrels straight through row-group boundaries, try_next_reader returns once per row group — leaving a clean window between row groups. At any such boundary, ParquetPushDecoder::into_builder hands back a ParquetPushDecoderBuilder for the row groups not yet decoded. Change any option on it (projection, row filter, row selection policy, …) and build a fresh decoder that resumes from the next row group. This is how a query engine promotes or demotes filters — for example turning a row filter on or off — based on the selectivity observed in the row groups decoded so far.

let mut decoder = ParquetPushDecoderBuilder::try_new_decoder(parquet_metadata)
    .unwrap()
    .build()
    .unwrap();

// Drive the decoder one row group at a time with `try_next_reader`.
loop {
    match decoder.try_next_reader().unwrap() {
        DecodeResult::NeedsData(ranges) => {
            // Fetch and hand over the bytes the decoder asked for.
            let data = ranges.iter().map(|r| get_range(r)).collect();
            decoder.push_ranges(ranges, data).unwrap();
        }
        DecodeResult::Data(reader) => {
            // Decode this row group's batches.
            for batch in reader {
                assert!(batch.unwrap().num_rows() > 0);
            }
            // We are now at a row-group boundary. Based on whatever stats
            // were gathered, optionally change strategy for the row groups
            // still to come: drop or promote a row filter, narrow or widen
            // the projection, etc.
            if decoder.is_at_row_group_boundary() && decoder.row_groups_remaining() > 0 {
                let builder = decoder.into_builder().unwrap();
                // e.g. column "b" turned out not to be needed.
                let projection = ProjectionMask::columns(builder.parquet_schema(), ["a"]);
                decoder = builder.with_projection(projection).build().unwrap();
            }
        }
        DecodeResult::Finished => break,
    }
}

Aliased Type§

pub struct ParquetPushDecoderBuilder {
Show 14 fields pub(crate) input: PushDecoderInput, pub(crate) metadata: Arc<ParquetMetaData>, pub(crate) schema: Arc<Schema>, pub(crate) fields: Option<Arc<ParquetField>>, pub(crate) batch_size: usize, pub(crate) row_groups: Option<Vec<usize>>, pub(crate) projection: ProjectionMask, pub(crate) filter: Option<RowFilter>, pub(crate) selection: Option<RowSelection>, pub(crate) row_selection_policy: RowSelectionPolicy, pub(crate) limit: Option<usize>, pub(crate) offset: Option<usize>, pub(crate) metrics: ArrowReaderMetrics, pub(crate) max_predicate_cache_size: usize,
}

Fields§

§input: PushDecoderInput

The “input” to read parquet data from.

Note in the case of the ParquetPushDecoderBuilder there is no underlying reader; the input is instead PushDecoderInput, the buffer that caller-pushed bytes accumulate in.

§metadata: Arc<ParquetMetaData>§schema: Arc<Schema>§fields: Option<Arc<ParquetField>>§batch_size: usize§row_groups: Option<Vec<usize>>§projection: ProjectionMask§filter: Option<RowFilter>§selection: Option<RowSelection>§row_selection_policy: RowSelectionPolicy§limit: Option<usize>§offset: Option<usize>§metrics: ArrowReaderMetrics§max_predicate_cache_size: usize

Implementations§

Source§

impl ParquetPushDecoderBuilder

Methods for building a ParquetDecoder. See the base ArrowReaderBuilder for more options that can be configured.

Source

pub fn try_new_decoder( parquet_metadata: Arc<ParquetMetaData>, ) -> Result<Self, ParquetError>

Create a new ParquetDecoderBuilder for configuring a Parquet decoder for the given file.

See ParquetMetadataDecoder for a builder that can read the metadata from a Parquet file.

See example on ParquetPushDecoderBuilder

Source

pub fn try_new_decoder_with_options( parquet_metadata: Arc<ParquetMetaData>, arrow_reader_options: ArrowReaderOptions, ) -> Result<Self, ParquetError>

Create a new ParquetDecoderBuilder for configuring a Parquet decoder for the given file with the given reader options.

This is similar to Self::try_new_decoder but allows configuring options such as Arrow schema

Source

pub fn new_with_metadata(arrow_reader_metadata: ArrowReaderMetadata) -> Self

Create a new ParquetDecoderBuilder given ArrowReaderMetadata.

See ArrowReaderMetadata::try_new for how to create the metadata from the Parquet metadata and reader options.

Source

pub fn with_buffers(self, buffers: PushBuffers) -> Self

Provide a preexisting PushBuffers for the built decoder to read from, so bytes already fetched are not requested again.

Source

pub fn build(self) -> Result<ParquetPushDecoder, ParquetError>

Create a ParquetPushDecoder with the configured options