parquet::arrow::arrow_reader::filter

Struct RowFilter

Source
pub struct RowFilter {
    pub(crate) predicates: Vec<Box<dyn ArrowPredicate>>,
}
Expand description

Filter applied during the parquet read process

RowFilter applies predicates in order, after decoding only the columns required. As predicates eliminate rows, fewer rows from subsequent columns may be required, thus potentially reducing IO and decode.

A RowFilter consists of a list of ArrowPredicates. Only the rows for which all the predicates evaluate to true will be returned. Any RowSelection provided to the reader will be applied prior to the first predicate, and each predicate in turn will then be used to compute a more refined RowSelection used when evaluating the subsequent predicates.

Once all predicates have been evaluated, the final RowSelection is applied to the top-level ProjectionMask to produce the final output [RecordBatch].

This design has a couple of implications:

  • RowFilter can be used to skip entire pages, and thus IO, in addition to CPU decode overheads
  • Columns may be decoded multiple times if they appear in multiple ProjectionMask
  • IO will be deferred until needed by a ProjectionMask

As such there is a trade-off between a single large predicate, or multiple predicates, that will depend on the shape of the data. Whilst multiple smaller predicates may minimise the amount of data scanned/decoded, it may not be faster overall.

For example, if a predicate that needs a single column of data filters out all but 1% of the rows, applying it as one of the early ArrowPredicateFn will likely significantly improve performance.

As a counter example, if a predicate needs several columns of data to evaluate but leaves 99% of the rows, it may be better to not filter the data from parquet and apply the filter after the RecordBatch has been fully decoded.

Additionally, even if a predicate eliminates a moderate number of rows, it may still be faster to filter the data after the RecordBatch has been fully decoded, if the eliminated rows are not contiguous.

Fields§

§predicates: Vec<Box<dyn ArrowPredicate>>

A list of ArrowPredicate

Implementations§

Source§

impl RowFilter

Source

pub fn new(predicates: Vec<Box<dyn ArrowPredicate>>) -> Self

Create a new RowFilter from an array of ArrowPredicate

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
§

impl<T> ErasedDestructor for T
where T: 'static,

§

impl<T> MaybeSendSync for T