pub struct ParquetMetaDataPushDecoder {
state: DecodeState,
column_index_policy: PageIndexPolicy,
offset_index_policy: PageIndexPolicy,
buffers: PushBuffers,
metadata_parser: MetadataParser,
}
Expand description
A push decoder for ParquetMetaData
.
This structure implements a push API for decoding Parquet metadata, which decouples IO from the metadata decoding logic (sometimes referred to as Sans-IO).
See ParquetMetaDataReader
for a pull-based API that incorporates IO and
is simpler to use for basic use cases. This decoder is best for customizing
your IO operations to minimize bytes read, prefetch data, or use async IO.
§Example
The most basic usage is to feed the decoder with the necessary byte ranges as requested as shown below. This minimizes the number of bytes read, but requires the most IO operations - one to read the footer and then one to read the metadata, and possibly more if page indexes are requested.
// The `ParquetMetaDataPushDecoder` needs to know the file length.
let mut decoder = ParquetMetaDataPushDecoder::try_new(file_len).unwrap();
// try to decode the metadata. If more data is needed, the decoder will tell you what ranges
loop {
match decoder.try_decode() {
Ok(DecodeResult::Data(metadata)) => { return Ok(metadata); } // decode successful
Ok(DecodeResult::NeedsData(ranges)) => {
// The decoder needs more data
//
// In this example, we call a function that returns the bytes for each given range.
// In a real application, you would likely read the data from a file or network.
let data = ranges.iter().map(|range| get_range(range)).collect();
// Push the data into the decoder and try to decode again on the next iteration.
decoder.push_ranges(ranges, data).unwrap();
}
Ok(DecodeResult::Finished) => { unreachable!("returned metadata in previous match arm") }
Err(e) => return Err(e),
}
}
§Example with “prefetching”
By default, the ParquetMetaDataPushDecoder
will request only the exact byte
ranges it needs. This minimizes the number of bytes read, however it
requires at least two IO operations to read the metadata - one to read the
footer and then one to read the metadata.
If the file has a “Page Index” (see Self::with_page_index_policy), three IO operations are required to read the metadata, as the page index is not part of the normal metadata footer.
To reduce the number of IO operations in systems with high per operation
overhead (e.g. cloud storage), you can “prefetch” the data and then push
the data into the decoder before calling Self::try_decode
. If you do
not push enough bytes, the decoder will return the ranges that are still
needed.
This approach can also be used when you have the entire file already in memory for other reasons.
let file_len = file_bytes.len() as u64;
// For this example, we "prefetch" all the bytes which we have in memory,
// but in a real application, you would likely read a chunk from the end
// for example 1MB.
let prefetched_bytes = file_bytes.clone();
let mut decoder = ParquetMetaDataPushDecoder::try_new(file_len).unwrap();
// push the prefetched bytes into the decoder
decoder.push_ranges(vec![0..file_len], vec![prefetched_bytes]).unwrap();
// The decoder will now be able to decode the metadata. Note in a real application,
// unless you can guarantee that the pushed data is enough to decode the metadata,
// you still need to call `try_decode` in a loop until it returns `DecodeResult::Data`
// as shown in the previous example
match decoder.try_decode() {
Ok(DecodeResult::Data(metadata)) => { return Ok(metadata); } // decode successful
other => { panic!("expected DecodeResult::Data, got: {other:?}") }
}
§Example using AsyncRead
ParquetMetaDataPushDecoder
is designed to work with any data source that can
provide byte ranges, including async IO sources. However, it does not
implement async IO itself. To use async IO, you simply write an async
wrapper around it that reads the required byte ranges and pushes them into the
decoder.
use tokio::io::{AsyncRead, AsyncReadExt, AsyncSeek, AsyncSeekExt};
// This function decodes Parquet Metadata from anything that implements
// [`AsyncRead`] and [`AsyncSeek`] such as a tokio::fs::File
async fn decode_metadata(
file_len: u64,
mut async_source: impl AsyncRead + AsyncSeek + Unpin
) -> Result<ParquetMetaData, ParquetError> {
// We need a ParquetMetaDataPushDecoder to decode the metadata.
let mut decoder = ParquetMetaDataPushDecoder::try_new(file_len).unwrap();
loop {
match decoder.try_decode() {
Ok(DecodeResult::Data(metadata)) => { return Ok(metadata); } // decode successful
Ok(DecodeResult::NeedsData(ranges)) => {
// The decoder needs more data
//
// In this example we use the AsyncRead and AsyncSeek traits to read the
// required ranges from the async source.
let mut data = Vec::with_capacity(ranges.len());
for range in &ranges {
let mut buffer = vec![0; (range.end - range.start) as usize];
async_source.seek(std::io::SeekFrom::Start(range.start)).await?;
async_source.read_exact(&mut buffer).await?;
data.push(Bytes::from(buffer));
}
// Push the data into the decoder and try to decode again on the next iteration.
decoder.push_ranges(ranges, data).unwrap();
}
Ok(DecodeResult::Finished) => { unreachable!("returned metadata in previous match arm") }
Err(e) => return Err(e),
}
}
}
Fields§
§state: DecodeState
Decoding state
column_index_policy: PageIndexPolicy
policy for loading ColumnIndex (part of the PageIndex)
offset_index_policy: PageIndexPolicy
policy for loading OffsetIndex (part of the PageIndex)
buffers: PushBuffers
Underlying buffers
metadata_parser: MetadataParser
Encryption API
Implementations§
Source§impl ParquetMetaDataPushDecoder
impl ParquetMetaDataPushDecoder
Sourcepub fn try_new(file_len: u64) -> Result<Self>
pub fn try_new(file_len: u64) -> Result<Self>
Create a new ParquetMetaDataPushDecoder
with the given file length.
By default, this will read page indexes and column indexes. See
ParquetMetaDataPushDecoder::with_page_index_policy
for more detail.
See examples on ParquetMetaDataPushDecoder
.
Begin decoding from the given footer tail.
Sourcepub fn try_new_with_metadata(
file_len: u64,
metadata: ParquetMetaData,
) -> Result<Self>
pub fn try_new_with_metadata( file_len: u64, metadata: ParquetMetaData, ) -> Result<Self>
Create a decoder with the given ParquetMetaData
already known.
This can be used to parse and populate the page index structures after the metadata has already been decoded.
Sourcepub fn with_page_index_policy(self, page_index_policy: PageIndexPolicy) -> Self
pub fn with_page_index_policy(self, page_index_policy: PageIndexPolicy) -> Self
Enable or disable reading the page index structures described in “Parquet page index Layout to Support Page Skipping”.
Defaults to PageIndexPolicy::Optional
This requires
- The Parquet file to have been written with page indexes
- Additional data to be pushed into the decoder (as the page indexes are not part of the thrift footer)
Sourcepub fn with_column_index_policy(
self,
column_index_policy: PageIndexPolicy,
) -> Self
pub fn with_column_index_policy( self, column_index_policy: PageIndexPolicy, ) -> Self
Set the policy for reading the ColumnIndex (part of the PageIndex)
Sourcepub fn with_offset_index_policy(
self,
offset_index_policy: PageIndexPolicy,
) -> Self
pub fn with_offset_index_policy( self, offset_index_policy: PageIndexPolicy, ) -> Self
Set the policy for reading the OffsetIndex (part of the PageIndex)
Sourcepub(crate) fn with_file_decryption_properties(
self,
file_decryption_properties: Option<Arc<FileDecryptionProperties>>,
) -> Self
pub(crate) fn with_file_decryption_properties( self, file_decryption_properties: Option<Arc<FileDecryptionProperties>>, ) -> Self
Provide decryption properties for decoding encrypted Parquet files
Sourcepub fn push_ranges(
&mut self,
ranges: Vec<Range<u64>>,
buffers: Vec<Bytes>,
) -> Result<()>
pub fn push_ranges( &mut self, ranges: Vec<Range<u64>>, buffers: Vec<Bytes>, ) -> Result<()>
Push the data into the decoder’s buffer.
The decoder does not immediately attempt to decode the metadata
after pushing data. Instead, it accumulates the pushed data until you
call Self::try_decode
.
§Determining required data:
To determine what ranges are required to decode the metadata, you can either:
-
Call
Self::try_decode
first to get the exact ranges required (see example onSelf
) -
Speculatively push any data that you have available, which may include more than the footer data or requested bytes.
Speculatively pushing data can be used when “prefetching” data. See
example on Self
Sourcepub fn push_range(&mut self, range: Range<u64>, buffer: Bytes) -> Result<()>
pub fn push_range(&mut self, range: Range<u64>, buffer: Bytes) -> Result<()>
Pushes a single range of data into the decoder’s buffer.
Sourcepub fn try_decode(&mut self) -> Result<DecodeResult<ParquetMetaData>>
pub fn try_decode(&mut self) -> Result<DecodeResult<ParquetMetaData>>
Try to decode the metadata from the pushed data, returning the decoded metadata or an error if not enough data is available.
Trait Implementations§
Auto Trait Implementations§
impl Freeze for ParquetMetaDataPushDecoder
impl !RefUnwindSafe for ParquetMetaDataPushDecoder
impl Send for ParquetMetaDataPushDecoder
impl Sync for ParquetMetaDataPushDecoder
impl Unpin for ParquetMetaDataPushDecoder
impl !UnwindSafe for ParquetMetaDataPushDecoder
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left
is true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left(&self)
returns true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read more