pub type ParquetPushDecoderBuilder = ArrowReaderBuilder<PushDecoderInput>;Expand description
A builder for ParquetPushDecoder.
To create a new decoder, use ParquetPushDecoderBuilder::try_new_decoder.
You can decode the metadata from a Parquet file using either
ParquetMetadataReader or ParquetMetaDataPushDecoder.
Note the “input” type is u64 which represents the length of the Parquet file
being decoded. This is needed to initialize the internal buffers that track
what data has been provided to the decoder.
§Example
// The file length and metadata are required to create the decoder
let mut decoder =
ParquetPushDecoderBuilder::try_new_decoder(parquet_metadata)
.unwrap()
// Optionally configure the decoder, e.g. batch size
.with_batch_size(1024)
// Build the decoder
.build()
.unwrap();
// In a loop, ask the decoder what it needs next, and provide it with the required data
loop {
match decoder.try_decode().unwrap() {
DecodeResult::NeedsData(ranges) => {
// The decoder needs more data. Fetch the data for the given ranges
let data = ranges.iter().map(|r| get_range(r)).collect::<Vec<_>>();
// Push the data to the decoder
decoder.push_ranges(ranges, data).unwrap();
// After pushing the data, we can try to decode again on the next iteration
}
DecodeResult::Data(batch) => {
// Successfully decoded a batch of data
assert!(batch.num_rows() > 0);
}
DecodeResult::Finished => {
// The decoder has finished decoding exit the loop
break;
}
}
}§Adaptive scans
The scan strategy is not fixed once build is called: it
can be changed while decoding, at row-group boundaries.
The important API for this is ParquetPushDecoder::try_next_reader.
Unlike try_decode, which barrels
straight through row-group boundaries, try_next_reader returns once per
row group — leaving a clean window between row groups. At any such
boundary, ParquetPushDecoder::into_builder hands back a
ParquetPushDecoderBuilder for the row groups not yet decoded. Change any
option on it (projection, row filter, row selection policy, …) and
build a fresh decoder that resumes from the next row
group. This is how a query engine promotes or demotes filters — for
example turning a row filter on or off — based on the selectivity observed
in the row groups decoded so far.
let mut decoder = ParquetPushDecoderBuilder::try_new_decoder(parquet_metadata)
.unwrap()
.build()
.unwrap();
// Drive the decoder one row group at a time with `try_next_reader`.
loop {
match decoder.try_next_reader().unwrap() {
DecodeResult::NeedsData(ranges) => {
// Fetch and hand over the bytes the decoder asked for.
let data = ranges.iter().map(|r| get_range(r)).collect();
decoder.push_ranges(ranges, data).unwrap();
}
DecodeResult::Data(reader) => {
// Decode this row group's batches.
for batch in reader {
assert!(batch.unwrap().num_rows() > 0);
}
// We are now at a row-group boundary. Based on whatever stats
// were gathered, optionally change strategy for the row groups
// still to come: drop or promote a row filter, narrow or widen
// the projection, etc.
if decoder.is_at_row_group_boundary() && decoder.row_groups_remaining() > 0 {
let builder = decoder.into_builder().unwrap();
// e.g. column "b" turned out not to be needed.
let projection = ProjectionMask::columns(builder.parquet_schema(), ["a"]);
decoder = builder.with_projection(projection).build().unwrap();
}
}
DecodeResult::Finished => break,
}
}Aliased Type§
pub struct ParquetPushDecoderBuilder {Show 14 fields
pub(crate) input: PushDecoderInput,
pub(crate) metadata: Arc<ParquetMetaData>,
pub(crate) schema: Arc<Schema>,
pub(crate) fields: Option<Arc<ParquetField>>,
pub(crate) batch_size: usize,
pub(crate) row_groups: Option<Vec<usize>>,
pub(crate) projection: ProjectionMask,
pub(crate) filter: Option<RowFilter>,
pub(crate) selection: Option<RowSelection>,
pub(crate) row_selection_policy: RowSelectionPolicy,
pub(crate) limit: Option<usize>,
pub(crate) offset: Option<usize>,
pub(crate) metrics: ArrowReaderMetrics,
pub(crate) max_predicate_cache_size: usize,
}Fields§
§input: PushDecoderInputThe “input” to read parquet data from.
Note in the case of the ParquetPushDecoderBuilder there is no
underlying reader; the input is instead PushDecoderInput, the buffer that
caller-pushed bytes accumulate in.
metadata: Arc<ParquetMetaData>§schema: Arc<Schema>§fields: Option<Arc<ParquetField>>§batch_size: usize§row_groups: Option<Vec<usize>>§projection: ProjectionMask§filter: Option<RowFilter>§selection: Option<RowSelection>§row_selection_policy: RowSelectionPolicy§limit: Option<usize>§offset: Option<usize>§metrics: ArrowReaderMetrics§max_predicate_cache_size: usizeImplementations§
Source§impl ParquetPushDecoderBuilder
Methods for building a ParquetDecoder. See the base ArrowReaderBuilder for
more options that can be configured.
impl ParquetPushDecoderBuilder
Methods for building a ParquetDecoder. See the base ArrowReaderBuilder for
more options that can be configured.
Sourcepub fn try_new_decoder(
parquet_metadata: Arc<ParquetMetaData>,
) -> Result<Self, ParquetError>
pub fn try_new_decoder( parquet_metadata: Arc<ParquetMetaData>, ) -> Result<Self, ParquetError>
Create a new ParquetDecoderBuilder for configuring a Parquet decoder for the given file.
See ParquetMetadataDecoder for a builder that can read the metadata from a Parquet file.
See example on ParquetPushDecoderBuilder
Sourcepub fn try_new_decoder_with_options(
parquet_metadata: Arc<ParquetMetaData>,
arrow_reader_options: ArrowReaderOptions,
) -> Result<Self, ParquetError>
pub fn try_new_decoder_with_options( parquet_metadata: Arc<ParquetMetaData>, arrow_reader_options: ArrowReaderOptions, ) -> Result<Self, ParquetError>
Create a new ParquetDecoderBuilder for configuring a Parquet decoder for the given file
with the given reader options.
This is similar to Self::try_new_decoder but allows configuring
options such as Arrow schema
Sourcepub fn new_with_metadata(arrow_reader_metadata: ArrowReaderMetadata) -> Self
pub fn new_with_metadata(arrow_reader_metadata: ArrowReaderMetadata) -> Self
Create a new ParquetDecoderBuilder given ArrowReaderMetadata.
See ArrowReaderMetadata::try_new for how to create the metadata from
the Parquet metadata and reader options.
Sourcepub fn with_buffers(self, buffers: PushBuffers) -> Self
pub fn with_buffers(self, buffers: PushBuffers) -> Self
Provide a preexisting PushBuffers for the built decoder to read
from, so bytes already fetched are not requested again.
Sourcepub fn build(self) -> Result<ParquetPushDecoder, ParquetError>
pub fn build(self) -> Result<ParquetPushDecoder, ParquetError>
Create a ParquetPushDecoder with the configured options