pub struct ArrowReaderOptions {
skip_arrow_metadata: bool,
supplied_schema: Option<SchemaRef>,
pub(crate) column_index: PageIndexPolicy,
pub(crate) offset_index: PageIndexPolicy,
metadata_options: ParquetMetaDataOptions,
pub(crate) file_decryption_properties: Option<Arc<FileDecryptionProperties>>,
virtual_columns: Vec<FieldRef>,
}Expand description
Options that control how ParquetMetaData is read when constructing
an Arrow reader.
To use these options, pass them to one of the following methods:
ParquetRecordBatchReaderBuilder::try_new_with_optionsParquetRecordBatchStreamBuilder::new_with_options
For fine-grained control over metadata loading, use
ArrowReaderMetadata::load to load metadata with these options,
See ArrowReaderBuilder for how to configure how the column data
is then read from the file, including projection and filter pushdown
Fields§
§skip_arrow_metadata: boolShould the reader strip any user defined metadata from the Arrow schema
supplied_schema: Option<SchemaRef>If provided, used as the schema hint when determining the Arrow schema, otherwise the schema hint is read from the ARROW_SCHEMA_META_KEY
column_index: PageIndexPolicy§offset_index: PageIndexPolicy§metadata_options: ParquetMetaDataOptionsOptions to control reading of Parquet metadata
file_decryption_properties: Option<Arc<FileDecryptionProperties>>If encryption is enabled, the file decryption properties can be provided
virtual_columns: Vec<FieldRef>Implementations§
Source§impl ArrowReaderOptions
impl ArrowReaderOptions
Sourcepub fn new() -> Self
pub fn new() -> Self
Create a new ArrowReaderOptions with the default settings
Sourcepub fn with_skip_arrow_metadata(self, skip_arrow_metadata: bool) -> Self
pub fn with_skip_arrow_metadata(self, skip_arrow_metadata: bool) -> Self
Skip decoding the embedded arrow metadata (defaults to false)
Parquet files generated by some writers may contain embedded arrow schema and metadata. This may not be correct or compatible with your system, for example, see ARROW-16184
Sourcepub fn with_schema(self, schema: SchemaRef) -> Self
pub fn with_schema(self, schema: SchemaRef) -> Self
Provide a schema hint to use when reading the Parquet file.
If provided, this schema takes precedence over any arrow schema embedded
in the metadata (see the arrow documentation for more details).
If the provided schema is not compatible with the data stored in the parquet file schema, an error will be returned when constructing the builder.
This option is only required if you want to explicitly control the
conversion of Parquet types to Arrow types, such as casting a column to
a different type. For example, if you wanted to read an Int64 in
a Parquet file to a TimestampMicrosecondArray in the Arrow schema.
§Notes
The provided schema must have the same number of columns as the parquet schema and the column names must be the same.
§Example
// Write data - schema is inferred from the data to be Int32
let mut file = Vec::new();
let batch = RecordBatch::try_from_iter(vec![
("col_1", Arc::new(Int32Array::from(vec![1, 2, 3])) as ArrayRef),
]).unwrap();
let mut writer = ArrowWriter::try_new(&mut file, batch.schema(), None).unwrap();
writer.write(&batch).unwrap();
writer.close().unwrap();
let file = Bytes::from(file);
// Read the file back.
// Supply a schema that interprets the Int32 column as a Timestamp.
let supplied_schema = Arc::new(Schema::new(vec![
Field::new("col_1", DataType::Timestamp(TimeUnit::Nanosecond, None), false)
]));
let options = ArrowReaderOptions::new().with_schema(supplied_schema.clone());
let mut builder = ParquetRecordBatchReaderBuilder::try_new_with_options(
file.clone(),
options
).expect("Error if the schema is not compatible with the parquet file schema.");
// Create the reader and read the data using the supplied schema.
let mut reader = builder.build().unwrap();
let _batch = reader.next().unwrap().unwrap();§Example: Preserving Dictionary Encoding
By default, Parquet string columns are read as Utf8Array (or LargeUtf8Array),
even if the underlying Parquet data uses dictionary encoding. You can preserve
the dictionary encoding by specifying a Dictionary type in the schema hint:
// Write a Parquet file with string data
let mut file = Vec::new();
let schema = Arc::new(Schema::new(vec![
Field::new("city", DataType::Utf8, false)
]));
let cities = StringArray::from(vec!["Berlin", "Berlin", "Paris", "Berlin", "Paris"]);
let batch = RecordBatch::try_new(schema.clone(), vec![Arc::new(cities)]).unwrap();
let mut writer = ArrowWriter::try_new(&mut file, batch.schema(), None).unwrap();
writer.write(&batch).unwrap();
writer.close().unwrap();
let file = Bytes::from(file);
// Read the file back, requesting dictionary encoding preservation
let dict_schema = Arc::new(Schema::new(vec![
Field::new("city", DataType::Dictionary(
Box::new(DataType::Int32),
Box::new(DataType::Utf8)
), false)
]));
let options = ArrowReaderOptions::new().with_schema(dict_schema);
let builder = ParquetRecordBatchReaderBuilder::try_new_with_options(
file.clone(),
options
).unwrap();
let mut reader = builder.build().unwrap();
let batch = reader.next().unwrap().unwrap();
// The column is now a DictionaryArray
assert!(matches!(
batch.column(0).data_type(),
DataType::Dictionary(_, _)
));Note: Dictionary encoding preservation works best when:
- The original column was dictionary encoded (the default for string columns)
- There are a small number of distinct values
Sourcepub fn with_page_index(self, page_index: bool) -> Self
👎Deprecated since 57.2.0: Use with_page_index_policy instead
pub fn with_page_index(self, page_index: bool) -> Self
with_page_index_policy insteadEnable reading the PageIndex from the metadata, if present (defaults to false)
The PageIndex can be used to push down predicates to the parquet scan,
potentially eliminating unnecessary IO, by some query engines.
If this is enabled, ParquetMetaData::column_index and
ParquetMetaData::offset_index will be populated if the corresponding
information is present in the file.
Sourcepub fn with_page_index_policy(self, policy: PageIndexPolicy) -> Self
pub fn with_page_index_policy(self, policy: PageIndexPolicy) -> Self
Sets the PageIndexPolicy for both the column and offset indexes.
The PageIndex consists of two structures: the ColumnIndex and OffsetIndex.
This method sets the same policy for both. For fine-grained control, use
Self::with_column_index_policy and Self::with_offset_index_policy.
See Self::with_page_index for more details on page indexes.
Sourcepub fn with_column_index_policy(self, policy: PageIndexPolicy) -> Self
pub fn with_column_index_policy(self, policy: PageIndexPolicy) -> Self
Sets the PageIndexPolicy for the Parquet ColumnIndex structure.
The ColumnIndex contains min/max statistics for each page, which can be used
for predicate pushdown and page-level pruning.
Sourcepub fn with_offset_index_policy(self, policy: PageIndexPolicy) -> Self
pub fn with_offset_index_policy(self, policy: PageIndexPolicy) -> Self
Sets the PageIndexPolicy for the Parquet OffsetIndex structure.
The OffsetIndex contains the locations and sizes of each page, which enables
efficient page-level skipping and random access within column chunks.
Sourcepub fn with_parquet_schema(self, schema: Arc<SchemaDescriptor>) -> Self
pub fn with_parquet_schema(self, schema: Arc<SchemaDescriptor>) -> Self
Provide a Parquet schema to use when decoding the metadata. The schema in the Parquet footer will be skipped.
This can be used to avoid reparsing the schema from the file when it is already known.
Sourcepub fn with_encoding_stats_as_mask(self, val: bool) -> Self
pub fn with_encoding_stats_as_mask(self, val: bool) -> Self
Set whether to convert the encoding_stats in the Parquet ColumnMetaData to a bitmask
(defaults to false).
See ColumnChunkMetaData::page_encoding_stats_mask for an explanation of why this
might be desirable.
Sourcepub fn with_encoding_stats_policy(self, policy: ParquetStatisticsPolicy) -> Self
pub fn with_encoding_stats_policy(self, policy: ParquetStatisticsPolicy) -> Self
Sets the decoding policy for encoding_stats in the Parquet ColumnMetaData.
Sourcepub fn with_column_stats_policy(self, policy: ParquetStatisticsPolicy) -> Self
pub fn with_column_stats_policy(self, policy: ParquetStatisticsPolicy) -> Self
Sets the decoding policy for statistics in the Parquet ColumnMetaData.
Sourcepub fn with_size_stats_policy(self, policy: ParquetStatisticsPolicy) -> Self
pub fn with_size_stats_policy(self, policy: ParquetStatisticsPolicy) -> Self
Sets the decoding policy for size_statistics in the Parquet ColumnMetaData.
Sourcepub fn with_file_decryption_properties(
self,
file_decryption_properties: Arc<FileDecryptionProperties>,
) -> Self
pub fn with_file_decryption_properties( self, file_decryption_properties: Arc<FileDecryptionProperties>, ) -> Self
Provide the file decryption properties to use when reading encrypted parquet files.
If encryption is enabled and the file is encrypted, the file_decryption_properties must be provided.
Sourcepub fn with_virtual_columns(
self,
virtual_columns: Vec<FieldRef>,
) -> Result<Self>
pub fn with_virtual_columns( self, virtual_columns: Vec<FieldRef>, ) -> Result<Self>
Include virtual columns in the output.
Virtual columns are columns that are not part of the Parquet schema, but are added to the output by the reader such as row numbers and row group indices.
§Example
// Create a simple record batch with some data
let values = Arc::new(Int64Array::from(vec![1, 2, 3])) as ArrayRef;
let batch = RecordBatch::try_from_iter(vec![("value", values)])?;
// Write the batch to an in-memory buffer
let mut file = Vec::new();
let mut writer = ArrowWriter::try_new(
&mut file,
batch.schema(),
None
)?;
writer.write(&batch)?;
writer.close()?;
let file = Bytes::from(file);
// Create a virtual column for row numbers
let row_number_field = Arc::new(Field::new("row_number", DataType::Int64, false)
.with_extension_type(RowNumber));
// Configure options with virtual columns
let options = ArrowReaderOptions::new()
.with_virtual_columns(vec![row_number_field])?;
// Create a reader with the options
let mut reader = ParquetRecordBatchReaderBuilder::try_new_with_options(
file,
options
)?
.build()?;
// Read the batch - it will include both the original column and the virtual row_number column
let result_batch = reader.next().unwrap()?;
assert_eq!(result_batch.num_columns(), 2); // "value" + "row_number"
assert_eq!(result_batch.num_rows(), 3);Sourcepub fn page_index(&self) -> bool
👎Deprecated since 57.2.0: Use column_index_policy or offset_index_policy instead
pub fn page_index(&self) -> bool
column_index_policy or offset_index_policy insteadReturns whether page index reading is enabled.
This returns true if both the column index and offset index policies are not PageIndexPolicy::Skip.
This can be set via with_page_index or
with_page_index_policy.
Sourcepub fn offset_index_policy(&self) -> PageIndexPolicy
pub fn offset_index_policy(&self) -> PageIndexPolicy
Retrieve the currently set PageIndexPolicy for the offset index.
This can be set via with_offset_index_policy
or with_page_index_policy.
Sourcepub fn column_index_policy(&self) -> PageIndexPolicy
pub fn column_index_policy(&self) -> PageIndexPolicy
Retrieve the currently set PageIndexPolicy for the column index.
This can be set via with_column_index_policy
or with_page_index_policy.
Sourcepub fn metadata_options(&self) -> &ParquetMetaDataOptions
pub fn metadata_options(&self) -> &ParquetMetaDataOptions
Retrieve the currently set metadata decoding options.
Sourcepub fn file_decryption_properties(
&self,
) -> Option<&Arc<FileDecryptionProperties>>
pub fn file_decryption_properties( &self, ) -> Option<&Arc<FileDecryptionProperties>>
Retrieve the currently set file decryption properties.
This can be set via
file_decryption_properties.
Trait Implementations§
Source§impl Clone for ArrowReaderOptions
impl Clone for ArrowReaderOptions
Source§fn clone(&self) -> ArrowReaderOptions
fn clone(&self) -> ArrowReaderOptions
1.0.0 · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read moreSource§impl Debug for ArrowReaderOptions
impl Debug for ArrowReaderOptions
Source§impl Default for ArrowReaderOptions
impl Default for ArrowReaderOptions
Source§fn default() -> ArrowReaderOptions
fn default() -> ArrowReaderOptions
Auto Trait Implementations§
impl Freeze for ArrowReaderOptions
impl !RefUnwindSafe for ArrowReaderOptions
impl Send for ArrowReaderOptions
impl Sync for ArrowReaderOptions
impl Unpin for ArrowReaderOptions
impl !UnwindSafe for ArrowReaderOptions
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more