parquet::file::metadata

Struct ParquetMetaDataReader

Source
pub struct ParquetMetaDataReader {
    metadata: Option<ParquetMetaData>,
    column_index: bool,
    offset_index: bool,
    prefetch_hint: Option<usize>,
    metadata_size: Option<usize>,
}
Expand description

Reads the ParquetMetaData from a byte stream.

See crate::file::metadata::ParquetMetaDataWriter for a description of the Parquet metadata.

Parquet metadata is not necessarily contiguous in the files: part is stored in the footer (the last bytes of the file), but other portions (such as the PageIndex) can be stored elsewhere.

This reader handles reading the footer as well as the non contiguous parts of the metadata such as the page indexes; excluding Bloom Filters.

§Example

// read parquet metadata including page indexes from a file
let file = open_parquet_file("some_path.parquet");
let mut reader = ParquetMetaDataReader::new()
    .with_page_indexes(true);
reader.try_parse(&file).unwrap();
let metadata = reader.finish().unwrap();
assert!(metadata.column_index().is_some());
assert!(metadata.offset_index().is_some());

Fields§

§metadata: Option<ParquetMetaData>§column_index: bool§offset_index: bool§prefetch_hint: Option<usize>§metadata_size: Option<usize>

Implementations§

Source§

impl ParquetMetaDataReader

Source

pub fn new() -> Self

Create a new ParquetMetaDataReader

Source

pub fn new_with_metadata(metadata: ParquetMetaData) -> Self

Create a new ParquetMetaDataReader populated with a ParquetMetaData struct obtained via other means.

Source

pub fn with_page_indexes(self, val: bool) -> Self

Enable or disable reading the page index structures described in “Parquet page index: Layout to Support Page Skipping”. Equivalent to: self.with_column_indexes(val).with_offset_indexes(val)

Source

pub fn with_column_indexes(self, val: bool) -> Self

Enable or disable reading the Parquet ColumnIndex structure.

Source

pub fn with_offset_indexes(self, val: bool) -> Self

Enable or disable reading the Parquet OffsetIndex structure.

Source

pub fn with_prefetch_hint(self, prefetch: Option<usize>) -> Self

Provide a hint as to the number of bytes needed to fully parse the ParquetMetaData. Only used for the asynchronous Self::try_load() method.

By default, the reader will first fetch the last 8 bytes of the input file to obtain the size of the footer metadata. A second fetch will be performed to obtain the needed bytes. After parsing the footer metadata, a third fetch will be performed to obtain the bytes needed to decode the page index structures, if they have been requested. To avoid unnecessary fetches, prefetch can be set to an estimate of the number of bytes needed to fully decode the ParquetMetaData, which can reduce the number of fetch requests and reduce latency. Setting prefetch too small will not trigger an error, but will result in extra fetches being performed.

Source

pub fn has_metadata(&self) -> bool

Indicates whether this reader has a ParquetMetaData internally.

Source

pub fn finish(&mut self) -> Result<ParquetMetaData>

Return the parsed ParquetMetaData struct, leaving None in its place.

Source

pub fn parse_and_finish<R: ChunkReader>( self, reader: &R, ) -> Result<ParquetMetaData>

Given a ChunkReader, parse and return the ParquetMetaData in a single pass.

If reader is [Bytes] based, then the buffer must contain sufficient bytes to complete the request, and must include the Parquet footer. If page indexes are desired, the buffer must contain the entire file, or Self::try_parse_sized() should be used.

This call will consume self.

§Example
// read parquet metadata including page indexes
let file = open_parquet_file("some_path.parquet");
let metadata = ParquetMetaDataReader::new()
    .with_page_indexes(true)
    .parse_and_finish(&file).unwrap();
Source

pub fn try_parse<R: ChunkReader>(&mut self, reader: &R) -> Result<()>

Attempts to parse the footer metadata (and optionally page indexes) given a ChunkReader.

If reader is [Bytes] based, then the buffer must contain sufficient bytes to complete the request, and must include the Parquet footer. If page indexes are desired, the buffer must contain the entire file, or Self::try_parse_sized() should be used.

Source

pub fn try_parse_sized<R: ChunkReader>( &mut self, reader: &R, file_size: usize, ) -> Result<()>

Same as Self::try_parse(), but provide the original file size in the case that reader is a [Bytes] struct that does not contain the entire file. This information is necessary when the page indexes are desired. reader must have access to the Parquet footer.

Using this function also allows for retrying with a larger buffer.

§Errors

This function will return ParquetError::NeedMoreData in the event reader does not provide enough data to fully parse the metadata (see example below). The returned error will be populated with a usize field indicating the number of bytes required from the tail of the file to completely parse the requested metadata.

Other errors returned include ParquetError::General and ParquetError::EOF.

§Example
let file = open_parquet_file("some_path.parquet");
let len = file.len() as usize;
// Speculatively read 1 kilobyte from the end of the file
let bytes = get_bytes(&file, len - 1024..len);
let mut reader = ParquetMetaDataReader::new().with_page_indexes(true);
match reader.try_parse_sized(&bytes, len) {
    Ok(_) => (),
    Err(ParquetError::NeedMoreData(needed)) => {
        // Read the needed number of bytes from the end of the file
        let bytes = get_bytes(&file, len - needed..len);
        reader.try_parse_sized(&bytes, len).unwrap();
    }
    _ => panic!("unexpected error")
}
let metadata = reader.finish().unwrap();

Note that it is possible for the file metadata to be completely read, but there are insufficient bytes available to read the page indexes. Self::has_metadata() can be used to test for this. In the event the file metadata is present, re-parsing of the file metadata can be skipped by using Self::read_page_indexes_sized(), as shown below.

let file = open_parquet_file("some_path.parquet");
let len = file.len() as usize;
// Speculatively read 1 kilobyte from the end of the file
let mut bytes = get_bytes(&file, len - 1024..len);
let mut reader = ParquetMetaDataReader::new().with_page_indexes(true);
// Loop until `bytes` is large enough
loop {
    match reader.try_parse_sized(&bytes, len) {
        Ok(_) => break,
        Err(ParquetError::NeedMoreData(needed)) => {
            // Read the needed number of bytes from the end of the file
            bytes = get_bytes(&file, len - needed..len);
            // If file metadata was read only read page indexes, otherwise continue loop
            if reader.has_metadata() {
                reader.read_page_indexes_sized(&bytes, len);
                break;
            }
        }
        _ => panic!("unexpected error")
    }
}
let metadata = reader.finish().unwrap();
Source

pub fn read_page_indexes<R: ChunkReader>(&mut self, reader: &R) -> Result<()>

Read the page index structures when a ParquetMetaData has already been obtained. See Self::new_with_metadata() and Self::has_metadata().

Source

pub fn read_page_indexes_sized<R: ChunkReader>( &mut self, reader: &R, file_size: usize, ) -> Result<()>

Read the page index structures when a ParquetMetaData has already been obtained. This variant is used when reader cannot access the entire Parquet file (e.g. it is a [Bytes] struct containing the tail of the file). See Self::new_with_metadata() and Self::has_metadata(). Like Self::try_parse_sized() this function may return ParquetError::NeedMoreData.

Source

pub async fn load_and_finish<F: MetadataFetch>( self, fetch: F, file_size: usize, ) -> Result<ParquetMetaData>

Given a MetadataFetch, parse and return the ParquetMetaData in a single pass.

This call will consume self.

See Self::with_prefetch_hint for a discussion of how to reduce the number of fetches performed by this function.

Source

pub async fn try_load<F: MetadataFetch>( &mut self, fetch: F, file_size: usize, ) -> Result<()>

Attempts to (asynchronously) parse the footer metadata (and optionally page indexes) given a MetadataFetch.

See Self::with_prefetch_hint for a discussion of how to reduce the number of fetches performed by this function.

Source

pub async fn load_page_index<F: MetadataFetch>( &mut self, fetch: F, ) -> Result<()>

Asynchronously fetch the page index structures when a ParquetMetaData has already been obtained. See Self::new_with_metadata().

Source

async fn load_page_index_with_remainder<F: MetadataFetch>( &mut self, fetch: F, remainder: Option<(usize, Bytes)>, ) -> Result<()>

Source

fn parse_column_index( &mut self, bytes: &Bytes, start_offset: usize, ) -> Result<()>

Source

fn parse_offset_index( &mut self, bytes: &Bytes, start_offset: usize, ) -> Result<()>

Source

fn range_for_page_index(&self) -> Option<Range<usize>>

Source

fn parse_metadata<R: ChunkReader>( &mut self, chunk_reader: &R, ) -> Result<ParquetMetaData>

Source

fn get_prefetch_size(&self) -> usize

Return the number of bytes to read in the initial pass. If prefetch_size has been provided, then return that value if it is larger than the size of the Parquet file footer (8 bytes). Otherwise returns 8.

Source

async fn load_metadata<F: MetadataFetch>( fetch: &mut F, file_size: usize, prefetch: usize, ) -> Result<(ParquetMetaData, Option<(usize, Bytes)>)>

Decodes the Parquet footer returning the metadata length in bytes

A parquet footer is 8 bytes long and has the following layout:

  • 4 bytes for the metadata length
  • 4 bytes for the magic bytes ‘PAR1’
+-----+--------+
| len | 'PAR1' |
+-----+--------+
Source

pub fn decode_metadata(buf: &[u8]) -> Result<ParquetMetaData>

Decodes ParquetMetaData from the provided bytes.

Typically this is used to decode the metadata from the end of a parquet file. The format of buf is the Thift compact binary protocol, as specified by the Parquet Spec.

Source

fn parse_column_orders( t_column_orders: Option<Vec<TColumnOrder>>, schema_descr: &SchemaDescriptor, ) -> Option<Vec<ColumnOrder>>

Parses column orders from Thrift definition. If no column orders are defined, returns None.

Trait Implementations§

Source§

impl Default for ParquetMetaDataReader

Source§

fn default() -> ParquetMetaDataReader

Returns the “default value” for a type. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
§

impl<T> Allocation for T
where T: RefUnwindSafe + Send + Sync,

§

impl<T> ErasedDestructor for T
where T: 'static,

§

impl<T> MaybeSendSync for T