ParquetMetaDataPushDecoder

Struct ParquetMetaDataPushDecoder 

Source
pub struct ParquetMetaDataPushDecoder {
    state: DecodeState,
    column_index_policy: PageIndexPolicy,
    offset_index_policy: PageIndexPolicy,
    buffers: PushBuffers,
    metadata_parser: MetadataParser,
}
Expand description

A push decoder for ParquetMetaData.

This structure implements a push API for decoding Parquet metadata, which decouples IO from the metadata decoding logic (sometimes referred to as Sans-IO).

See ParquetMetaDataReader for a pull-based API that incorporates IO and is simpler to use for basic use cases. This decoder is best for customizing your IO operations to minimize bytes read, prefetch data, or use async IO.

§Example

The most basic usage is to feed the decoder with the necessary byte ranges as requested as shown below. This minimizes the number of bytes read, but requires the most IO operations - one to read the footer and then one to read the metadata, and possibly more if page indexes are requested.

// The `ParquetMetaDataPushDecoder` needs to know the file length.
let mut decoder = ParquetMetaDataPushDecoder::try_new(file_len).unwrap();
// try to decode the metadata. If more data is needed, the decoder will tell you what ranges
loop {
    match decoder.try_decode() {
       Ok(DecodeResult::Data(metadata)) => { return Ok(metadata); } // decode successful
       Ok(DecodeResult::NeedsData(ranges)) => {
          // The decoder needs more data
          //
          // In this example, we call a function that returns the bytes for each given range.
          // In a real application, you would likely read the data from a file or network.
          let data = ranges.iter().map(|range| get_range(range)).collect();
          // Push the data into the decoder and try to decode again on the next iteration.
          decoder.push_ranges(ranges, data).unwrap();
       }
       Ok(DecodeResult::Finished) => { unreachable!("returned metadata in previous match arm") }
       Err(e) => return Err(e),
    }
}

§Example with “prefetching”

By default, the ParquetMetaDataPushDecoder will request only the exact byte ranges it needs. This minimizes the number of bytes read, however it requires at least two IO operations to read the metadata - one to read the footer and then one to read the metadata.

If the file has a “Page Index” (see Self::with_page_index_policy), three IO operations are required to read the metadata, as the page index is not part of the normal metadata footer.

To reduce the number of IO operations in systems with high per operation overhead (e.g. cloud storage), you can “prefetch” the data and then push the data into the decoder before calling Self::try_decode. If you do not push enough bytes, the decoder will return the ranges that are still needed.

This approach can also be used when you have the entire file already in memory for other reasons.

let file_len = file_bytes.len() as u64;
// For this example, we "prefetch" all the bytes which we have in memory,
// but in a real application, you would likely read a chunk from the end
// for example 1MB.
let prefetched_bytes = file_bytes.clone();
let mut decoder = ParquetMetaDataPushDecoder::try_new(file_len).unwrap();
// push the prefetched bytes into the decoder
decoder.push_ranges(vec![0..file_len], vec![prefetched_bytes]).unwrap();
// The decoder will now be able to decode the metadata. Note in a real application,
// unless you can guarantee that the pushed data is enough to decode the metadata,
// you still need to call `try_decode` in a loop until it returns `DecodeResult::Data`
// as shown in  the previous example
 match decoder.try_decode() {
     Ok(DecodeResult::Data(metadata)) => { return Ok(metadata); } // decode successful
     other => { panic!("expected DecodeResult::Data, got: {other:?}") }
 }

§Example using AsyncRead

ParquetMetaDataPushDecoder is designed to work with any data source that can provide byte ranges, including async IO sources. However, it does not implement async IO itself. To use async IO, you simply write an async wrapper around it that reads the required byte ranges and pushes them into the decoder.

use tokio::io::{AsyncRead, AsyncReadExt, AsyncSeek, AsyncSeekExt};
// This function decodes Parquet Metadata from anything that implements
// [`AsyncRead`] and [`AsyncSeek`] such as a tokio::fs::File
async fn decode_metadata(
  file_len: u64,
  mut async_source: impl AsyncRead + AsyncSeek + Unpin
) -> Result<ParquetMetaData, ParquetError> {
  // We need a ParquetMetaDataPushDecoder to decode the metadata.
  let mut decoder = ParquetMetaDataPushDecoder::try_new(file_len).unwrap();
  loop {
    match decoder.try_decode() {
       Ok(DecodeResult::Data(metadata)) => { return Ok(metadata); } // decode successful
       Ok(DecodeResult::NeedsData(ranges)) => {
          // The decoder needs more data
          //
          // In this example we use the AsyncRead and AsyncSeek traits to read the
          // required ranges from the async source.
          let mut data = Vec::with_capacity(ranges.len());
          for range in &ranges {
            let mut buffer = vec![0; (range.end - range.start) as usize];
            async_source.seek(std::io::SeekFrom::Start(range.start)).await?;
            async_source.read_exact(&mut buffer).await?;
            data.push(Bytes::from(buffer));
          }
          // Push the data into the decoder and try to decode again on the next iteration.
          decoder.push_ranges(ranges, data).unwrap();
       }
       Ok(DecodeResult::Finished) => { unreachable!("returned metadata in previous match arm") }
       Err(e) => return Err(e),
    }
  }
}

Fields§

§state: DecodeState

Decoding state

§column_index_policy: PageIndexPolicy

policy for loading ColumnIndex (part of the PageIndex)

§offset_index_policy: PageIndexPolicy

policy for loading OffsetIndex (part of the PageIndex)

§buffers: PushBuffers

Underlying buffers

§metadata_parser: MetadataParser

Encryption API

Implementations§

Source§

impl ParquetMetaDataPushDecoder

Source

pub fn try_new(file_len: u64) -> Result<Self>

Create a new ParquetMetaDataPushDecoder with the given file length.

By default, this will read page indexes and column indexes. See ParquetMetaDataPushDecoder::with_page_index_policy for more detail.

See examples on ParquetMetaDataPushDecoder.

Begin decoding from the given footer tail.

Source

pub fn try_new_with_metadata( file_len: u64, metadata: ParquetMetaData, ) -> Result<Self>

Create a decoder with the given ParquetMetaData already known.

This can be used to parse and populate the page index structures after the metadata has already been decoded.

Source

pub fn with_page_index_policy(self, page_index_policy: PageIndexPolicy) -> Self

Enable or disable reading the page index structures described in “Parquet page index Layout to Support Page Skipping”.

Defaults to PageIndexPolicy::Optional

This requires

  1. The Parquet file to have been written with page indexes
  2. Additional data to be pushed into the decoder (as the page indexes are not part of the thrift footer)
Source

pub fn with_column_index_policy( self, column_index_policy: PageIndexPolicy, ) -> Self

Set the policy for reading the ColumnIndex (part of the PageIndex)

Source

pub fn with_offset_index_policy( self, offset_index_policy: PageIndexPolicy, ) -> Self

Set the policy for reading the OffsetIndex (part of the PageIndex)

Source

pub(crate) fn with_file_decryption_properties( self, file_decryption_properties: Option<Arc<FileDecryptionProperties>>, ) -> Self

Provide decryption properties for decoding encrypted Parquet files

Source

pub fn push_ranges( &mut self, ranges: Vec<Range<u64>>, buffers: Vec<Bytes>, ) -> Result<()>

Push the data into the decoder’s buffer.

The decoder does not immediately attempt to decode the metadata after pushing data. Instead, it accumulates the pushed data until you call Self::try_decode.

§Determining required data:

To determine what ranges are required to decode the metadata, you can either:

  1. Call Self::try_decode first to get the exact ranges required (see example on Self)

  2. Speculatively push any data that you have available, which may include more than the footer data or requested bytes.

Speculatively pushing data can be used when “prefetching” data. See example on Self

Source

pub fn push_range(&mut self, range: Range<u64>, buffer: Bytes) -> Result<()>

Pushes a single range of data into the decoder’s buffer.

Source

pub fn try_decode(&mut self) -> Result<DecodeResult<ParquetMetaData>>

Try to decode the metadata from the pushed data, returning the decoded metadata or an error if not enough data is available.

Source

fn get_bytes(&self, range: &Range<u64>) -> Result<Bytes>

Returns the bytes for the given range from the internal buffer

Trait Implementations§

Source§

impl Debug for ParquetMetaDataPushDecoder

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

§

fn vzip(self) -> V

§

impl<T> ErasedDestructor for T
where T: 'static,

§

impl<T> Ungil for T
where T: Send,