pub struct ArrowReaderOptions {
skip_arrow_metadata: bool,
supplied_schema: Option<SchemaRef>,
pub(crate) page_index_policy: PageIndexPolicy,
metadata_options: ParquetMetaDataOptions,
pub(crate) file_decryption_properties: Option<Arc<FileDecryptionProperties>>,
virtual_columns: Vec<FieldRef>,
}Expand description
Options that control how ParquetMetaData is read when constructing
an Arrow reader.
To use these options, pass them to one of the following methods:
ParquetRecordBatchReaderBuilder::try_new_with_optionsParquetRecordBatchStreamBuilder::new_with_options
For fine-grained control over metadata loading, use
ArrowReaderMetadata::load to load metadata with these options,
See ArrowReaderBuilder for how to configure how the column data
is then read from the file, including projection and filter pushdown
Fields§
§skip_arrow_metadata: boolShould the reader strip any user defined metadata from the Arrow schema
supplied_schema: Option<SchemaRef>If provided, used as the schema hint when determining the Arrow schema, otherwise the schema hint is read from the ARROW_SCHEMA_META_KEY
page_index_policy: PageIndexPolicyPolicy for reading offset and column indexes.
metadata_options: ParquetMetaDataOptionsOptions to control reading of Parquet metadata
file_decryption_properties: Option<Arc<FileDecryptionProperties>>If encryption is enabled, the file decryption properties can be provided
virtual_columns: Vec<FieldRef>Implementations§
Source§impl ArrowReaderOptions
impl ArrowReaderOptions
Sourcepub fn new() -> Self
pub fn new() -> Self
Create a new ArrowReaderOptions with the default settings
Sourcepub fn with_skip_arrow_metadata(self, skip_arrow_metadata: bool) -> Self
pub fn with_skip_arrow_metadata(self, skip_arrow_metadata: bool) -> Self
Skip decoding the embedded arrow metadata (defaults to false)
Parquet files generated by some writers may contain embedded arrow schema and metadata. This may not be correct or compatible with your system, for example, see ARROW-16184
Sourcepub fn with_schema(self, schema: SchemaRef) -> Self
pub fn with_schema(self, schema: SchemaRef) -> Self
Provide a schema hint to use when reading the Parquet file.
If provided, this schema takes precedence over any arrow schema embedded
in the metadata (see the arrow documentation for more details).
If the provided schema is not compatible with the data stored in the parquet file schema, an error will be returned when constructing the builder.
This option is only required if you want to explicitly control the
conversion of Parquet types to Arrow types, such as casting a column to
a different type. For example, if you wanted to read an Int64 in
a Parquet file to a TimestampMicrosecondArray in the Arrow schema.
§Notes
The provided schema must have the same number of columns as the parquet schema and the column names must be the same.
§Example
use std::io::Bytes;
use std::sync::Arc;
use tempfile::tempfile;
use arrow_array::{ArrayRef, Int32Array, RecordBatch};
use arrow_schema::{DataType, Field, Schema, TimeUnit};
use parquet::arrow::arrow_reader::{ArrowReaderOptions, ParquetRecordBatchReaderBuilder};
use parquet::arrow::ArrowWriter;
// Write data - schema is inferred from the data to be Int32
let file = tempfile().unwrap();
let batch = RecordBatch::try_from_iter(vec![
("col_1", Arc::new(Int32Array::from(vec![1, 2, 3])) as ArrayRef),
]).unwrap();
let mut writer = ArrowWriter::try_new(file.try_clone().unwrap(), batch.schema(), None).unwrap();
writer.write(&batch).unwrap();
writer.close().unwrap();
// Read the file back.
// Supply a schema that interprets the Int32 column as a Timestamp.
let supplied_schema = Arc::new(Schema::new(vec![
Field::new("col_1", DataType::Timestamp(TimeUnit::Nanosecond, None), false)
]));
let options = ArrowReaderOptions::new().with_schema(supplied_schema.clone());
let mut builder = ParquetRecordBatchReaderBuilder::try_new_with_options(
file.try_clone().unwrap(),
options
).expect("Error if the schema is not compatible with the parquet file schema.");
// Create the reader and read the data using the supplied schema.
let mut reader = builder.build().unwrap();
let _batch = reader.next().unwrap().unwrap();Sourcepub fn with_page_index(self, page_index: bool) -> Self
pub fn with_page_index(self, page_index: bool) -> Self
Enable reading the PageIndex from the metadata, if present (defaults to false)
The PageIndex can be used to push down predicates to the parquet scan,
potentially eliminating unnecessary IO, by some query engines.
If this is enabled, ParquetMetaData::column_index and
ParquetMetaData::offset_index will be populated if the corresponding
information is present in the file.
Sourcepub fn with_page_index_policy(self, policy: PageIndexPolicy) -> Self
pub fn with_page_index_policy(self, policy: PageIndexPolicy) -> Self
Set the PageIndexPolicy to determine how page indexes should be read.
See Self::with_page_index for more details.
Sourcepub fn with_parquet_schema(self, schema: Arc<SchemaDescriptor>) -> Self
pub fn with_parquet_schema(self, schema: Arc<SchemaDescriptor>) -> Self
Provide a Parquet schema to use when decoding the metadata. The schema in the Parquet footer will be skipped.
This can be used to avoid reparsing the schema from the file when it is already known.
Sourcepub fn with_file_decryption_properties(
self,
file_decryption_properties: Arc<FileDecryptionProperties>,
) -> Self
pub fn with_file_decryption_properties( self, file_decryption_properties: Arc<FileDecryptionProperties>, ) -> Self
Provide the file decryption properties to use when reading encrypted parquet files.
If encryption is enabled and the file is encrypted, the file_decryption_properties must be provided.
Sourcepub fn with_virtual_columns(
self,
virtual_columns: Vec<FieldRef>,
) -> Result<Self>
pub fn with_virtual_columns( self, virtual_columns: Vec<FieldRef>, ) -> Result<Self>
Include virtual columns in the output.
Virtual columns are columns that are not part of the Parquet schema, but are added to the output by the reader such as row numbers.
§Example
// Create a simple record batch with some data
let values = Arc::new(Int64Array::from(vec![1, 2, 3])) as ArrayRef;
let batch = RecordBatch::try_from_iter(vec![("value", values)])?;
// Write the batch to a temporary parquet file
let file = tempfile()?;
let mut writer = ArrowWriter::try_new(
file.try_clone()?,
batch.schema(),
None
)?;
writer.write(&batch)?;
writer.close()?;
// Create a virtual column for row numbers
let row_number_field = Arc::new(Field::new("row_number", DataType::Int64, false)
.with_extension_type(RowNumber));
// Configure options with virtual columns
let options = ArrowReaderOptions::new()
.with_virtual_columns(vec![row_number_field])?;
// Create a reader with the options
let mut reader = ParquetRecordBatchReaderBuilder::try_new_with_options(
file,
options
)?
.build()?;
// Read the batch - it will include both the original column and the virtual row_number column
let result_batch = reader.next().unwrap()?;
assert_eq!(result_batch.num_columns(), 2); // "value" + "row_number"
assert_eq!(result_batch.num_rows(), 3);Sourcepub fn page_index(&self) -> bool
pub fn page_index(&self) -> bool
Retrieve the currently set page index behavior.
This can be set via with_page_index.
Sourcepub fn metadata_options(&self) -> &ParquetMetaDataOptions
pub fn metadata_options(&self) -> &ParquetMetaDataOptions
Retrieve the currently set metadata decoding options.
Sourcepub fn file_decryption_properties(
&self,
) -> Option<&Arc<FileDecryptionProperties>>
pub fn file_decryption_properties( &self, ) -> Option<&Arc<FileDecryptionProperties>>
Retrieve the currently set file decryption properties.
This can be set via
file_decryption_properties.
Trait Implementations§
Source§impl Clone for ArrowReaderOptions
impl Clone for ArrowReaderOptions
Source§fn clone(&self) -> ArrowReaderOptions
fn clone(&self) -> ArrowReaderOptions
1.0.0 · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read moreSource§impl Debug for ArrowReaderOptions
impl Debug for ArrowReaderOptions
Source§impl Default for ArrowReaderOptions
impl Default for ArrowReaderOptions
Source§fn default() -> ArrowReaderOptions
fn default() -> ArrowReaderOptions
Auto Trait Implementations§
impl Freeze for ArrowReaderOptions
impl !RefUnwindSafe for ArrowReaderOptions
impl Send for ArrowReaderOptions
impl Sync for ArrowReaderOptions
impl Unpin for ArrowReaderOptions
impl !UnwindSafe for ArrowReaderOptions
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more