pub struct StatisticsConverter<'a> {
parquet_column_index: Option<usize>,
arrow_field: &'a Field,
missing_null_counts_as_zero: bool,
}
Expand description
Extracts Parquet statistics as Arrow arrays
This is used to convert Parquet statistics to Arrow [ArrayRef
], with
proper type conversions. This information can be used for pruning Parquet
files, row groups, and data pages based on the statistics embedded in
Parquet metadata.
§Schemas
The converter uses the schema of the Parquet file and the Arrow schema to
convert the underlying statistics value (stored as a parquet value) into the
corresponding Arrow value. For example, Decimals are stored as binary in
parquet files and this structure handles mapping them to the i128
representation used in Arrow.
Note: The Parquet schema and Arrow schema do not have to be identical (for
example, the columns may be in different orders and one or the other schemas
may have additional columns). The function parquet_column
is used to
match the column in the Parquet schema to the column in the Arrow schema.
Fields§
§parquet_column_index: Option<usize>
the index of the matched column in the Parquet schema
arrow_field: &'a Field
The field (with data type) of the column in the Arrow schema
missing_null_counts_as_zero: bool
treat missing null_counts as 0 nulls
Implementations§
Source§impl<'a> StatisticsConverter<'a>
impl<'a> StatisticsConverter<'a>
Sourcepub fn parquet_column_index(&self) -> Option<usize>
pub fn parquet_column_index(&self) -> Option<usize>
Return the index of the column in the Parquet schema, if any
Returns None
if the column is was present in the Arrow schema, but not
present in the parquet file
Sourcepub fn arrow_field(&self) -> &'a Field
pub fn arrow_field(&self) -> &'a Field
Return the arrow schema’s [Field]
of the column in the Arrow schema
Sourcepub fn with_missing_null_counts_as_zero(
self,
missing_null_counts_as_zero: bool,
) -> Self
pub fn with_missing_null_counts_as_zero( self, missing_null_counts_as_zero: bool, ) -> Self
Set the statistics converter to treat missing null counts as missing
By default, the converter will treat missing null counts as though
the null count is known to be 0
.
Note that parquet files written by parquet-rs currently do not store null counts even when it is known there are zero nulls, and the reader will return 0 for the null counts in that instance. This behavior may change in a future release.
Both parquet-java and parquet-cpp store null counts as 0 when there are no nulls, and don’t write unknown values to the null count field.
Sourcepub fn row_group_row_counts<I>(
&self,
metadatas: I,
) -> Result<Option<UInt64Array>>where
I: IntoIterator<Item = &'a RowGroupMetaData>,
pub fn row_group_row_counts<I>(
&self,
metadatas: I,
) -> Result<Option<UInt64Array>>where
I: IntoIterator<Item = &'a RowGroupMetaData>,
Returns a [UInt64Array
] with row counts for each row group
§Return Value
The returned array has no nulls, and has one value for each row group. Each value is the number of rows in the row group.
§Example
// Given the metadata for a parquet file and the arrow schema
let metadata: ParquetMetaData = get_parquet_metadata();
let arrow_schema: Schema = get_arrow_schema();
let parquet_schema = metadata.file_metadata().schema_descr();
// create a converter
let converter = StatisticsConverter::try_new("foo", &arrow_schema, parquet_schema)
.unwrap();
// get the row counts for each row group
let row_counts = converter.row_group_row_counts(metadata
.row_groups()
.iter()
).unwrap();
// file had 2 row groups, with 1024 and 23 rows respectively
assert_eq!(row_counts, Some(UInt64Array::from(vec![1024, 23])));
Sourcepub fn try_new<'b>(
column_name: &'b str,
arrow_schema: &'a Schema,
parquet_schema: &'a SchemaDescriptor,
) -> Result<Self>
pub fn try_new<'b>( column_name: &'b str, arrow_schema: &'a Schema, parquet_schema: &'a SchemaDescriptor, ) -> Result<Self>
Create a new StatisticsConverter
to extract statistics for a column
Note if there is no corresponding column in the parquet file, the returned arrays will be null. This can happen if the column is in the arrow schema but not in the parquet schema due to schema evolution.
See example on Self::row_group_mins
for usage
§Errors
- If the column is not found in the arrow schema
Sourcepub fn row_group_mins<I>(&self, metadatas: I) -> Result<ArrayRef>where
I: IntoIterator<Item = &'a RowGroupMetaData>,
pub fn row_group_mins<I>(&self, metadatas: I) -> Result<ArrayRef>where
I: IntoIterator<Item = &'a RowGroupMetaData>,
Extract the minimum values from row group statistics in RowGroupMetaData
§Return Value
The returned array contains 1 value for each row group, in the same order as metadatas
Each value is either
- the minimum value for the column
- a null value, if the statistics can not be extracted
Note that a null value does NOT mean the min value was actually
null
it means it the requested statistic is unknown
§Errors
Reasons for not being able to extract the statistics include:
- the column is not present in the parquet file
- statistics for the column are not present in the row group
- the stored statistic value can not be converted to the requested type
§Example
// Given the metadata for a parquet file and the arrow schema
let metadata: ParquetMetaData = get_parquet_metadata();
let arrow_schema: Schema = get_arrow_schema();
let parquet_schema = metadata.file_metadata().schema_descr();
// create a converter
let converter = StatisticsConverter::try_new("foo", &arrow_schema, parquet_schema)
.unwrap();
// get the minimum value for the column "foo" in the parquet file
let min_values: ArrayRef = converter
.row_group_mins(metadata.row_groups().iter())
.unwrap();
// if "foo" is a Float64 value, the returned array will contain Float64 values
assert_eq!(min_values, Arc::new(Float64Array::from(vec![Some(1.0), Some(2.0)])) as _);
Sourcepub fn row_group_maxes<I>(&self, metadatas: I) -> Result<ArrayRef>where
I: IntoIterator<Item = &'a RowGroupMetaData>,
pub fn row_group_maxes<I>(&self, metadatas: I) -> Result<ArrayRef>where
I: IntoIterator<Item = &'a RowGroupMetaData>,
Extract the maximum values from row group statistics in RowGroupMetaData
See docs on Self::row_group_mins
for details
Sourcepub fn row_group_null_counts<I>(&self, metadatas: I) -> Result<UInt64Array>where
I: IntoIterator<Item = &'a RowGroupMetaData>,
pub fn row_group_null_counts<I>(&self, metadatas: I) -> Result<UInt64Array>where
I: IntoIterator<Item = &'a RowGroupMetaData>,
Extract the null counts from row group statistics in RowGroupMetaData
See docs on Self::row_group_mins
for details
Sourcepub fn data_page_mins<I>(
&self,
column_page_index: &ParquetColumnIndex,
column_offset_index: &ParquetOffsetIndex,
row_group_indices: I,
) -> Result<ArrayRef>where
I: IntoIterator<Item = &'a usize>,
pub fn data_page_mins<I>(
&self,
column_page_index: &ParquetColumnIndex,
column_offset_index: &ParquetOffsetIndex,
row_group_indices: I,
) -> Result<ArrayRef>where
I: IntoIterator<Item = &'a usize>,
Extract the minimum values from Data Page statistics.
In Parquet files, in addition to the Column Chunk level statistics
(stored for each column for each row group) there are also
optional statistics stored for each data page, as part of
the ParquetColumnIndex
.
Since a single Column Chunk is stored as one or more pages, page level statistics can prune at a finer granularity.
However since they are stored in a separate metadata
structure (Index
) there is different code to extract them as
compared to arrow statistics.
§Parameters:
-
column_page_index
: The parquet column page indices, read fromParquetMetaData
column_index -
column_offset_index
: The parquet column offset indices, read fromParquetMetaData
offset_index -
row_group_indices
: The indices of the row groups, that are used to extract the column page index and offset index on a per row group per column basis.
§Return Value
The returned array contains 1 value for each NativeIndex
in the underlying Index
es, in the same order as they appear
in metadatas
.
For example, if there are two Index
es in metadatas
:
- the first having
3
PageIndex
entries - the second having
2
PageIndex
entries
The returned array would have 5 rows.
Each value is either:
- the minimum value for the page
- a null value, if the statistics can not be extracted
Note that a null value does NOT mean the min value was actually
null
it means it the requested statistic is unknown
§Errors
Reasons for not being able to extract the statistics include:
- the column is not present in the parquet file
- statistics for the pages are not present in the row group
- the stored statistic value can not be converted to the requested type
Sourcepub fn data_page_maxes<I>(
&self,
column_page_index: &ParquetColumnIndex,
column_offset_index: &ParquetOffsetIndex,
row_group_indices: I,
) -> Result<ArrayRef>where
I: IntoIterator<Item = &'a usize>,
pub fn data_page_maxes<I>(
&self,
column_page_index: &ParquetColumnIndex,
column_offset_index: &ParquetOffsetIndex,
row_group_indices: I,
) -> Result<ArrayRef>where
I: IntoIterator<Item = &'a usize>,
Extract the maximum values from Data Page statistics.
See docs on Self::data_page_mins
for details.
Sourcepub fn data_page_null_counts<I>(
&self,
column_page_index: &ParquetColumnIndex,
column_offset_index: &ParquetOffsetIndex,
row_group_indices: I,
) -> Result<UInt64Array>where
I: IntoIterator<Item = &'a usize>,
pub fn data_page_null_counts<I>(
&self,
column_page_index: &ParquetColumnIndex,
column_offset_index: &ParquetOffsetIndex,
row_group_indices: I,
) -> Result<UInt64Array>where
I: IntoIterator<Item = &'a usize>,
Returns a [UInt64Array
] with null counts for each data page.
See docs on Self::data_page_mins
for details.
Sourcepub fn data_page_row_counts<I>(
&self,
column_offset_index: &ParquetOffsetIndex,
row_group_metadatas: &'a [RowGroupMetaData],
row_group_indices: I,
) -> Result<Option<UInt64Array>>where
I: IntoIterator<Item = &'a usize>,
pub fn data_page_row_counts<I>(
&self,
column_offset_index: &ParquetOffsetIndex,
row_group_metadatas: &'a [RowGroupMetaData],
row_group_indices: I,
) -> Result<Option<UInt64Array>>where
I: IntoIterator<Item = &'a usize>,
Returns a [UInt64Array
] with row counts for each data page.
This function iterates over the given row group indexes and computes the row count for each page in the specified column.
§Parameters:
-
column_offset_index
: The parquet column offset indices, read fromParquetMetaData
offset_index -
row_group_metadatas
: The metadata slice of the row groups, read fromParquetMetaData
row_groups -
row_group_indices
: The indices of the row groups, that are used to extract the column offset index on a per row group per column basis.
See docs on Self::data_page_mins
for details.
Sourcefn make_null_array<I, A>(&self, data_type: &DataType, metadatas: I) -> ArrayRefwhere
I: IntoIterator<Item = A>,
fn make_null_array<I, A>(&self, data_type: &DataType, metadatas: I) -> ArrayRefwhere
I: IntoIterator<Item = A>,
Returns a null array of data_type with one element per row group
Trait Implementations§
Auto Trait Implementations§
impl<'a> Freeze for StatisticsConverter<'a>
impl<'a> RefUnwindSafe for StatisticsConverter<'a>
impl<'a> Send for StatisticsConverter<'a>
impl<'a> Sync for StatisticsConverter<'a>
impl<'a> Unpin for StatisticsConverter<'a>
impl<'a> UnwindSafe for StatisticsConverter<'a>
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left
is true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left(&self)
returns true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read more