Expand description
API for reading/writing Arrow RecordBatch
es and Array
s to/from
Parquet Files.
See the crate-level documentation for more details on other APIs
§Schema Conversion
These APIs ensure that data in Arrow RecordBatch
es written to Parquet are
read back as RecordBatch
es with the exact same types and values.
Parquet and Arrow have different type systems, and there is not
always a one to one mapping between the systems. For example, data
stored as a Parquet BYTE_ARRAY
can be read as either an Arrow
BinaryViewArray
or BinaryArray
.
To recover the original Arrow types, the writers in this module add a “hint” to
the metadata in the ARROW_SCHEMA_META_KEY
key which records the original Arrow
schema. The metadata hint follows the same convention as arrow-cpp based
implementations such as pyarrow
. The reader looks for the schema hint in the
metadata to determine Arrow types, and if it is not present, infers the Arrow schema
from the Parquet schema.
In situations where the embedded Arrow schema is not compatible with the Parquet schema, the Parquet schema takes precedence and no error is raised. See #1663
You can also control the type conversion process in more detail using:
-
ArrowSchemaConverter
control the conversion of Arrow types to Parquet types. -
ArrowReaderOptions::with_schema
to explicitly specify your own Arrow schema hint to use when reading Parquet, overriding any metadata that may be present.
§Example: Writing Arrow RecordBatch
to Parquet file
let ids = Int32Array::from(vec![1, 2, 3, 4]);
let vals = Int32Array::from(vec![5, 6, 7, 8]);
let batch = RecordBatch::try_from_iter(vec![
("id", Arc::new(ids) as ArrayRef),
("val", Arc::new(vals) as ArrayRef),
]).unwrap();
let file = tempfile().unwrap();
// WriterProperties can be used to set Parquet file options
let props = WriterProperties::builder()
.set_compression(Compression::SNAPPY)
.build();
let mut writer = ArrowWriter::try_new(file, batch.schema(), Some(props)).unwrap();
writer.write(&batch).expect("Writing batch");
// writer must be closed to write footer
writer.close().unwrap();
§Example: Reading Parquet file into Arrow RecordBatch
let file = File::open("data.parquet").unwrap();
let builder = ParquetRecordBatchReaderBuilder::try_new(file).unwrap();
println!("Converted arrow schema is: {}", builder.schema());
let mut reader = builder.build().unwrap();
let record_batch = reader.next().unwrap().unwrap();
println!("Read {} records.", record_batch.num_rows());
§Example: Reading non-uniformly encrypted parquet file into arrow record batch
Note: This requires the experimental encryption
feature to be enabled at compile time.
let file = File::open(path).unwrap();
// Define the AES encryption keys required required for decrypting the footer metadata
// and column-specific data. If only a footer key is used then it is assumed that the
// file uses uniform encryption and all columns are encrypted with the footer key.
// If any column keys are specified, other columns without a key provided are assumed
// to be unencrypted
let footer_key = "0123456789012345".as_bytes(); // Keys are 128 bits (16 bytes)
let column_1_key = "1234567890123450".as_bytes();
let column_2_key = "1234567890123451".as_bytes();
let decryption_properties = FileDecryptionProperties::builder(footer_key.to_vec())
.with_column_key("double_field", column_1_key.to_vec())
.with_column_key("float_field", column_2_key.to_vec())
.build()
.unwrap();
let options = ArrowReaderOptions::default()
.with_file_decryption_properties(decryption_properties);
let reader_metadata = ArrowReaderMetadata::load(&file, options.clone()).unwrap();
let file_metadata = reader_metadata.metadata().file_metadata();
assert_eq!(50, file_metadata.num_rows());
let mut reader = ParquetRecordBatchReaderBuilder::try_new_with_options(file, options)
.unwrap()
.build()
.unwrap();
let record_batch = reader.next().unwrap().unwrap();
assert_eq!(50, record_batch.num_rows());
Re-exports§
pub use self::arrow_writer::ArrowWriter;
pub use self::async_reader::ParquetRecordBatchStreamBuilder;
pub use self::async_writer::AsyncArrowWriter;
Modules§
- arrow_
reader - Contains reader which reads parquet data into arrow [
RecordBatch
] - arrow_
writer - Contains writer which writes arrow data into parquet data.
- async_
reader async
API for reading Parquet files as [RecordBatch
]es- async_
writer async
API for writing [RecordBatch
]es to Parquet files- buffer 🔒
- Logic for reading data into arrow buffers
- decoder 🔒
- Specialized decoders optimised for decoding to arrow format
- record_
reader 🔒
Structs§
- Arrow
Schema Converter - Converter for Arrow schema to Parquet schema
- Field
Levels - Schema information necessary to decode a parquet file as arrow [
Fields
] - Projection
Mask - A
ProjectionMask
identifies a set of columns within a potentially nested schema to project
Constants§
- ARROW_
SCHEMA_ META_ KEY - Schema metadata key used to store serialized Arrow schema
- PARQUET_
FIELD_ ID_ META_ KEY - The value of this metadata key, if present on
Field::metadata
, will be used to populateBasicTypeInfo::id
Functions§
- add_
encoded_ arrow_ schema_ to_ metadata - Mutates writer metadata by storing the encoded Arrow schema hint in
ARROW_SCHEMA_META_KEY
. - arrow_
to_ parquet_ schema Deprecated - Convert arrow schema to parquet schema
- encode_
arrow_ schema - Encodes the Arrow schema into the IPC format, and base64 encodes it
- parquet_
column - Lookups up the parquet column by name
- parquet_
to_ arrow_ field_ levels - Convert a parquet
SchemaDescriptor
toFieldLevels
- parquet_
to_ arrow_ schema - Convert Parquet schema to Arrow schema including optional metadata
- parquet_
to_ arrow_ schema_ by_ columns - Convert parquet schema to arrow schema including optional metadata, only preserving some leaf columns.