parquet::arrow::arrow_reader

Struct ArrowReaderBuilder

Source
pub struct ArrowReaderBuilder<T> {
    pub(crate) input: T,
    pub(crate) metadata: Arc<ParquetMetaData>,
    pub(crate) schema: SchemaRef,
    pub(crate) fields: Option<Arc<ParquetField>>,
    pub(crate) batch_size: usize,
    pub(crate) row_groups: Option<Vec<usize>>,
    pub(crate) projection: ProjectionMask,
    pub(crate) filter: Option<RowFilter>,
    pub(crate) selection: Option<RowSelection>,
    pub(crate) limit: Option<usize>,
    pub(crate) offset: Option<usize>,
}
Expand description

Builder for constructing parquet readers into arrow.

Most users should use one of the following specializations:

Fields§

§input: T§metadata: Arc<ParquetMetaData>§schema: SchemaRef§fields: Option<Arc<ParquetField>>§batch_size: usize§row_groups: Option<Vec<usize>>§projection: ProjectionMask§filter: Option<RowFilter>§selection: Option<RowSelection>§limit: Option<usize>§offset: Option<usize>

Implementations§

Source§

impl<T> ArrowReaderBuilder<T>

Source

pub(crate) fn new_builder(input: T, metadata: ArrowReaderMetadata) -> Self

Source

pub fn metadata(&self) -> &Arc<ParquetMetaData>

Returns a reference to the ParquetMetaData for this parquet file

Source

pub fn parquet_schema(&self) -> &SchemaDescriptor

Returns the parquet SchemaDescriptor for this parquet file

Source

pub fn schema(&self) -> &SchemaRef

Returns the arrow [SchemaRef] for this parquet file

Source

pub fn with_batch_size(self, batch_size: usize) -> Self

Set the size of [RecordBatch] to produce. Defaults to 1024 If the batch_size more than the file row count, use the file row count.

Source

pub fn with_row_groups(self, row_groups: Vec<usize>) -> Self

Only read data from the provided row group indexes

This is also called row group filtering

Source

pub fn with_projection(self, mask: ProjectionMask) -> Self

Only read data from the provided column indexes

Source

pub fn with_row_selection(self, selection: RowSelection) -> Self

Provide a RowSelection to filter out rows, and avoid fetching their data into memory.

This feature is used to restrict which rows are decoded within row groups, skipping ranges of rows that are not needed. Such selections could be determined by evaluating predicates against the parquet page Index or some other external information available to a query engine.

§Notes

Row group filtering (see Self::with_row_groups) is applied prior to applying the row selection, and therefore rows from skipped row groups should not be included in the RowSelection (see example below)

It is recommended to enable writing the page index if using this functionality, to allow more efficient skipping over data pages. See ArrowReaderOptions::with_page_index.

§Example

Given a parquet file with 4 row groups, and a row group filter of [0, 2, 3], in order to scan rows 50-100 in row group 2 and rows 200-300 in row group 3:

  Row Group 0, 1000 rows (selected)
  Row Group 1, 1000 rows (skipped)
  Row Group 2, 1000 rows (selected, but want to only scan rows 50-100)
  Row Group 3, 1000 rows (selected, but want to only scan rows 200-300)

You could pass the following RowSelection:

 Select 1000    (scan all rows in row group 0)
 Skip 50        (skip the first 50 rows in row group 2)
 Select 50      (scan rows 50-100 in row group 2)
 Skip 900       (skip the remaining rows in row group 2)
 Skip 200       (skip the first 200 rows in row group 3)
 Select 100     (scan rows 200-300 in row group 3)
 Skip 700       (skip the remaining rows in row group 3)

Note there is no entry for the (entirely) skipped row group 1.

Note you can represent the same selection with fewer entries. Instead of

 Skip 900       (skip the remaining rows in row group 2)
 Skip 200       (skip the first 200 rows in row group 3)

you could use

Skip 1100      (skip the remaining 900 rows in row group 2 and the first 200 rows in row group 3)
Source

pub fn with_row_filter(self, filter: RowFilter) -> Self

Provide a RowFilter to skip decoding rows

Row filters are applied after row group selection and row selection

It is recommended to enable reading the page index if using this functionality, to allow more efficient skipping over data pages. See ArrowReaderOptions::with_page_index.

Source

pub fn with_limit(self, limit: usize) -> Self

Provide a limit to the number of rows to be read

The limit will be applied after any Self::with_row_selection and Self::with_row_filter allowing it to limit the final set of rows decoded after any pushed down predicates

It is recommended to enable reading the page index if using this functionality, to allow more efficient skipping over data pages. See ArrowReaderOptions::with_page_index

Source

pub fn with_offset(self, offset: usize) -> Self

Provide an offset to skip over the given number of rows

The offset will be applied after any Self::with_row_selection and Self::with_row_filter allowing it to skip rows after any pushed down predicates

It is recommended to enable reading the page index if using this functionality, to allow more efficient skipping over data pages. See ArrowReaderOptions::with_page_index

Source§

impl<T: ChunkReader + 'static> ArrowReaderBuilder<SyncReader<T>>

Source

pub fn try_new(reader: T) -> Result<Self>

Create a new ParquetRecordBatchReaderBuilder

let mut builder = ParquetRecordBatchReaderBuilder::try_new(file).unwrap();

// Inspect metadata
assert_eq!(builder.metadata().num_row_groups(), 1);

// Construct reader
let mut reader: ParquetRecordBatchReader = builder.with_row_groups(vec![0]).build().unwrap();

// Read data
let _batch = reader.next().unwrap().unwrap();
Source

pub fn try_new_with_options( reader: T, options: ArrowReaderOptions, ) -> Result<Self>

Source

pub fn new_with_metadata(input: T, metadata: ArrowReaderMetadata) -> Self

Create a ParquetRecordBatchReaderBuilder from the provided ArrowReaderMetadata

This interface allows:

  1. Loading metadata once and using it to create multiple builders with potentially different settings or run on different threads

  2. Using a cached copy of the metadata rather than re-reading it from the file each time a reader is constructed.

See the docs on ArrowReaderMetadata for more details

§Example
let metadata = ArrowReaderMetadata::load(&file, Default::default()).unwrap();
let mut a = ParquetRecordBatchReaderBuilder::new_with_metadata(file.clone(), metadata.clone()).build().unwrap();
let mut b = ParquetRecordBatchReaderBuilder::new_with_metadata(file, metadata).build().unwrap();

// Should be able to read from both in parallel
assert_eq!(a.next().unwrap().unwrap(), b.next().unwrap().unwrap());
Source

pub fn build(self) -> Result<ParquetRecordBatchReader>

Build a ParquetRecordBatchReader

Note: this will eagerly evaluate any RowFilter before returning

Source§

impl<T: AsyncFileReader + Send + 'static> ArrowReaderBuilder<AsyncReader<T>>

Source

pub async fn new(input: T) -> Result<Self>

Create a new ParquetRecordBatchStreamBuilder with the provided parquet file

§Example
// Open async file containing parquet data
let mut file = tokio::fs::File::from_std(file);
// construct the reader
let mut reader = ParquetRecordBatchStreamBuilder::new(file)
  .await.unwrap().build().unwrap();
// Read batche
let batch: RecordBatch = reader.next().await.unwrap().unwrap();
Source

pub async fn new_with_options( input: T, options: ArrowReaderOptions, ) -> Result<Self>

Create a new ParquetRecordBatchStreamBuilder with the provided parquet file and ArrowReaderOptions

Source

pub fn new_with_metadata(input: T, metadata: ArrowReaderMetadata) -> Self

Create a ParquetRecordBatchStreamBuilder from the provided ArrowReaderMetadata

This allows loading metadata once and using it to create multiple builders with potentially different settings, that can be read in parallel.

§Example of reading from multiple streams in parallel
// open file with parquet data
let mut file = tokio::fs::File::from_std(file);
// load metadata once
let meta = ArrowReaderMetadata::load_async(&mut file, Default::default()).await.unwrap();
// create two readers, a and b, from the same underlying file
// without reading the metadata again
let mut a = ParquetRecordBatchStreamBuilder::new_with_metadata(
    file.try_clone().await.unwrap(),
    meta.clone()
).build().unwrap();
let mut b = ParquetRecordBatchStreamBuilder::new_with_metadata(file, meta).build().unwrap();

// Can read batches from both readers in parallel
assert_eq!(
  a.next().await.unwrap().unwrap(),
  b.next().await.unwrap().unwrap(),
);
Source

pub async fn get_row_group_column_bloom_filter( &mut self, row_group_idx: usize, column_idx: usize, ) -> Result<Option<Sbbf>>

Read bloom filter for a column in a row group Returns None if the column does not have a bloom filter

We should call this function after other forms pruning, such as projection and predicate pushdown.

Source

pub fn build(self) -> Result<ParquetRecordBatchStream<T>>

Auto Trait Implementations§

§

impl<T> Freeze for ArrowReaderBuilder<T>
where T: Freeze,

§

impl<T> !RefUnwindSafe for ArrowReaderBuilder<T>

§

impl<T> Send for ArrowReaderBuilder<T>
where T: Send,

§

impl<T> !Sync for ArrowReaderBuilder<T>

§

impl<T> Unpin for ArrowReaderBuilder<T>
where T: Unpin,

§

impl<T> !UnwindSafe for ArrowReaderBuilder<T>

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
§

impl<T> ErasedDestructor for T
where T: 'static,

§

impl<T> MaybeSendSync for T