pub struct ParquetMetaDataReader {
metadata: Option<ParquetMetaData>,
column_index: bool,
offset_index: bool,
prefetch_hint: Option<usize>,
metadata_size: Option<usize>,
}
Expand description
Reads the ParquetMetaData
from a byte stream.
See crate::file::metadata::ParquetMetaDataWriter
for a description of
the Parquet metadata.
Parquet metadata is not necessarily contiguous in the files: part is stored in the footer (the last bytes of the file), but other portions (such as the PageIndex) can be stored elsewhere.
This reader handles reading the footer as well as the non contiguous parts of the metadata such as the page indexes; excluding Bloom Filters.
§Example
// read parquet metadata including page indexes from a file
let file = open_parquet_file("some_path.parquet");
let mut reader = ParquetMetaDataReader::new()
.with_page_indexes(true);
reader.try_parse(&file).unwrap();
let metadata = reader.finish().unwrap();
assert!(metadata.column_index().is_some());
assert!(metadata.offset_index().is_some());
Fields§
§metadata: Option<ParquetMetaData>
§column_index: bool
§offset_index: bool
§prefetch_hint: Option<usize>
§metadata_size: Option<usize>
Implementations§
Source§impl ParquetMetaDataReader
impl ParquetMetaDataReader
Sourcepub fn new() -> Self
pub fn new() -> Self
Create a new ParquetMetaDataReader
Sourcepub fn new_with_metadata(metadata: ParquetMetaData) -> Self
pub fn new_with_metadata(metadata: ParquetMetaData) -> Self
Create a new ParquetMetaDataReader
populated with a ParquetMetaData
struct
obtained via other means.
Sourcepub fn with_page_indexes(self, val: bool) -> Self
pub fn with_page_indexes(self, val: bool) -> Self
Enable or disable reading the page index structures described in
“Parquet page index: Layout to Support Page Skipping”. Equivalent to:
self.with_column_indexes(val).with_offset_indexes(val)
Sourcepub fn with_column_indexes(self, val: bool) -> Self
pub fn with_column_indexes(self, val: bool) -> Self
Enable or disable reading the Parquet ColumnIndex structure.
Sourcepub fn with_offset_indexes(self, val: bool) -> Self
pub fn with_offset_indexes(self, val: bool) -> Self
Enable or disable reading the Parquet OffsetIndex structure.
Sourcepub fn with_prefetch_hint(self, prefetch: Option<usize>) -> Self
pub fn with_prefetch_hint(self, prefetch: Option<usize>) -> Self
Provide a hint as to the number of bytes needed to fully parse the ParquetMetaData
.
Only used for the asynchronous Self::try_load()
method.
By default, the reader will first fetch the last 8 bytes of the input file to obtain the
size of the footer metadata. A second fetch will be performed to obtain the needed bytes.
After parsing the footer metadata, a third fetch will be performed to obtain the bytes
needed to decode the page index structures, if they have been requested. To avoid
unnecessary fetches, prefetch
can be set to an estimate of the number of bytes needed
to fully decode the ParquetMetaData
, which can reduce the number of fetch requests and
reduce latency. Setting prefetch
too small will not trigger an error, but will result
in extra fetches being performed.
Sourcepub fn has_metadata(&self) -> bool
pub fn has_metadata(&self) -> bool
Indicates whether this reader has a ParquetMetaData
internally.
Sourcepub fn finish(&mut self) -> Result<ParquetMetaData>
pub fn finish(&mut self) -> Result<ParquetMetaData>
Return the parsed ParquetMetaData
struct, leaving None
in its place.
Sourcepub fn parse_and_finish<R: ChunkReader>(
self,
reader: &R,
) -> Result<ParquetMetaData>
pub fn parse_and_finish<R: ChunkReader>( self, reader: &R, ) -> Result<ParquetMetaData>
Given a ChunkReader
, parse and return the ParquetMetaData
in a single pass.
If reader
is [Bytes
] based, then the buffer must contain sufficient bytes to complete
the request, and must include the Parquet footer. If page indexes are desired, the buffer
must contain the entire file, or Self::try_parse_sized()
should be used.
This call will consume self
.
§Example
// read parquet metadata including page indexes
let file = open_parquet_file("some_path.parquet");
let metadata = ParquetMetaDataReader::new()
.with_page_indexes(true)
.parse_and_finish(&file).unwrap();
Sourcepub fn try_parse<R: ChunkReader>(&mut self, reader: &R) -> Result<()>
pub fn try_parse<R: ChunkReader>(&mut self, reader: &R) -> Result<()>
Attempts to parse the footer metadata (and optionally page indexes) given a ChunkReader
.
If reader
is [Bytes
] based, then the buffer must contain sufficient bytes to complete
the request, and must include the Parquet footer. If page indexes are desired, the buffer
must contain the entire file, or Self::try_parse_sized()
should be used.
Sourcepub fn try_parse_sized<R: ChunkReader>(
&mut self,
reader: &R,
file_size: usize,
) -> Result<()>
pub fn try_parse_sized<R: ChunkReader>( &mut self, reader: &R, file_size: usize, ) -> Result<()>
Same as Self::try_parse()
, but provide the original file size in the case that reader
is a [Bytes
] struct that does not contain the entire file. This information is necessary
when the page indexes are desired. reader
must have access to the Parquet footer.
Using this function also allows for retrying with a larger buffer.
§Errors
This function will return ParquetError::NeedMoreData
in the event reader
does not
provide enough data to fully parse the metadata (see example below). The returned error
will be populated with a usize
field indicating the number of bytes required from the
tail of the file to completely parse the requested metadata.
Other errors returned include ParquetError::General
and ParquetError::EOF
.
§Example
let file = open_parquet_file("some_path.parquet");
let len = file.len() as usize;
// Speculatively read 1 kilobyte from the end of the file
let bytes = get_bytes(&file, len - 1024..len);
let mut reader = ParquetMetaDataReader::new().with_page_indexes(true);
match reader.try_parse_sized(&bytes, len) {
Ok(_) => (),
Err(ParquetError::NeedMoreData(needed)) => {
// Read the needed number of bytes from the end of the file
let bytes = get_bytes(&file, len - needed..len);
reader.try_parse_sized(&bytes, len).unwrap();
}
_ => panic!("unexpected error")
}
let metadata = reader.finish().unwrap();
Note that it is possible for the file metadata to be completely read, but there are
insufficient bytes available to read the page indexes. Self::has_metadata()
can be used
to test for this. In the event the file metadata is present, re-parsing of the file
metadata can be skipped by using Self::read_page_indexes_sized()
, as shown below.
let file = open_parquet_file("some_path.parquet");
let len = file.len() as usize;
// Speculatively read 1 kilobyte from the end of the file
let mut bytes = get_bytes(&file, len - 1024..len);
let mut reader = ParquetMetaDataReader::new().with_page_indexes(true);
// Loop until `bytes` is large enough
loop {
match reader.try_parse_sized(&bytes, len) {
Ok(_) => break,
Err(ParquetError::NeedMoreData(needed)) => {
// Read the needed number of bytes from the end of the file
bytes = get_bytes(&file, len - needed..len);
// If file metadata was read only read page indexes, otherwise continue loop
if reader.has_metadata() {
reader.read_page_indexes_sized(&bytes, len);
break;
}
}
_ => panic!("unexpected error")
}
}
let metadata = reader.finish().unwrap();
Sourcepub fn read_page_indexes<R: ChunkReader>(&mut self, reader: &R) -> Result<()>
pub fn read_page_indexes<R: ChunkReader>(&mut self, reader: &R) -> Result<()>
Read the page index structures when a ParquetMetaData
has already been obtained.
See Self::new_with_metadata()
and Self::has_metadata()
.
Sourcepub fn read_page_indexes_sized<R: ChunkReader>(
&mut self,
reader: &R,
file_size: usize,
) -> Result<()>
pub fn read_page_indexes_sized<R: ChunkReader>( &mut self, reader: &R, file_size: usize, ) -> Result<()>
Read the page index structures when a ParquetMetaData
has already been obtained.
This variant is used when reader
cannot access the entire Parquet file (e.g. it is
a [Bytes
] struct containing the tail of the file).
See Self::new_with_metadata()
and Self::has_metadata()
. Like
Self::try_parse_sized()
this function may return ParquetError::NeedMoreData
.
Sourcepub async fn load_and_finish<F: MetadataFetch>(
self,
fetch: F,
file_size: usize,
) -> Result<ParquetMetaData>
pub async fn load_and_finish<F: MetadataFetch>( self, fetch: F, file_size: usize, ) -> Result<ParquetMetaData>
Given a MetadataFetch
, parse and return the ParquetMetaData
in a single pass.
This call will consume self
.
See Self::with_prefetch_hint
for a discussion of how to reduce the number of fetches
performed by this function.
Sourcepub async fn try_load<F: MetadataFetch>(
&mut self,
fetch: F,
file_size: usize,
) -> Result<()>
pub async fn try_load<F: MetadataFetch>( &mut self, fetch: F, file_size: usize, ) -> Result<()>
Attempts to (asynchronously) parse the footer metadata (and optionally page indexes)
given a MetadataFetch
.
See Self::with_prefetch_hint
for a discussion of how to reduce the number of fetches
performed by this function.
Sourcepub async fn load_page_index<F: MetadataFetch>(
&mut self,
fetch: F,
) -> Result<()>
pub async fn load_page_index<F: MetadataFetch>( &mut self, fetch: F, ) -> Result<()>
Asynchronously fetch the page index structures when a ParquetMetaData
has already
been obtained. See Self::new_with_metadata()
.
async fn load_page_index_with_remainder<F: MetadataFetch>( &mut self, fetch: F, remainder: Option<(usize, Bytes)>, ) -> Result<()>
fn parse_column_index( &mut self, bytes: &Bytes, start_offset: usize, ) -> Result<()>
fn parse_offset_index( &mut self, bytes: &Bytes, start_offset: usize, ) -> Result<()>
fn range_for_page_index(&self) -> Option<Range<usize>>
fn parse_metadata<R: ChunkReader>( &mut self, chunk_reader: &R, ) -> Result<ParquetMetaData>
Sourcefn get_prefetch_size(&self) -> usize
fn get_prefetch_size(&self) -> usize
Return the number of bytes to read in the initial pass. If prefetch_size
has
been provided, then return that value if it is larger than the size of the Parquet
file footer (8 bytes). Otherwise returns 8
.
async fn load_metadata<F: MetadataFetch>( fetch: &mut F, file_size: usize, prefetch: usize, ) -> Result<(ParquetMetaData, Option<(usize, Bytes)>)>
Decodes the Parquet footer returning the metadata length in bytes
A parquet footer is 8 bytes long and has the following layout:
- 4 bytes for the metadata length
- 4 bytes for the magic bytes ‘PAR1’
+-----+--------+
| len | 'PAR1' |
+-----+--------+
Sourcepub fn decode_metadata(buf: &[u8]) -> Result<ParquetMetaData>
pub fn decode_metadata(buf: &[u8]) -> Result<ParquetMetaData>
Decodes ParquetMetaData
from the provided bytes.
Typically this is used to decode the metadata from the end of a parquet
file. The format of buf
is the Thift compact binary protocol, as specified
by the Parquet Spec.
Sourcefn parse_column_orders(
t_column_orders: Option<Vec<TColumnOrder>>,
schema_descr: &SchemaDescriptor,
) -> Result<Option<Vec<ColumnOrder>>>
fn parse_column_orders( t_column_orders: Option<Vec<TColumnOrder>>, schema_descr: &SchemaDescriptor, ) -> Result<Option<Vec<ColumnOrder>>>
Parses column orders from Thrift definition.
If no column orders are defined, returns None
.
Trait Implementations§
Source§impl Default for ParquetMetaDataReader
impl Default for ParquetMetaDataReader
Source§fn default() -> ParquetMetaDataReader
fn default() -> ParquetMetaDataReader
Auto Trait Implementations§
impl Freeze for ParquetMetaDataReader
impl RefUnwindSafe for ParquetMetaDataReader
impl Send for ParquetMetaDataReader
impl Sync for ParquetMetaDataReader
impl Unpin for ParquetMetaDataReader
impl UnwindSafe for ParquetMetaDataReader
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left
is true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left(&self)
returns true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read more