Module arrow

Source
Expand description

API for reading/writing Arrow RecordBatches and Arrays to/from Parquet Files.

See the crate-level documentation for more details on other APIs

§Schema Conversion

These APIs ensure that data in Arrow RecordBatches written to Parquet are read back as RecordBatches with the exact same types and values.

Parquet and Arrow have different type systems, and there is not always a one to one mapping between the systems. For example, data stored as a Parquet BYTE_ARRAY can be read as either an Arrow BinaryViewArray or BinaryArray.

To recover the original Arrow types, the writers in this module add a “hint” to the metadata in the ARROW_SCHEMA_META_KEY key which records the original Arrow schema. The metadata hint follows the same convention as arrow-cpp based implementations such as pyarrow. The reader looks for the schema hint in the metadata to determine Arrow types, and if it is not present, infers the Arrow schema from the Parquet schema.

In situations where the embedded Arrow schema is not compatible with the Parquet schema, the Parquet schema takes precedence and no error is raised. See #1663

You can also control the type conversion process in more detail using:

§Example: Writing Arrow RecordBatch to Parquet file

 let ids = Int32Array::from(vec![1, 2, 3, 4]);
 let vals = Int32Array::from(vec![5, 6, 7, 8]);
 let batch = RecordBatch::try_from_iter(vec![
   ("id", Arc::new(ids) as ArrayRef),
   ("val", Arc::new(vals) as ArrayRef),
 ]).unwrap();

 let file = tempfile().unwrap();

 // WriterProperties can be used to set Parquet file options
 let props = WriterProperties::builder()
     .set_compression(Compression::SNAPPY)
     .build();

 let mut writer = ArrowWriter::try_new(file, batch.schema(), Some(props)).unwrap();

 writer.write(&batch).expect("Writing batch");

 // writer must be closed to write footer
 writer.close().unwrap();

§Example: Reading Parquet file into Arrow RecordBatch

let file = File::open("data.parquet").unwrap();

let builder = ParquetRecordBatchReaderBuilder::try_new(file).unwrap();
println!("Converted arrow schema is: {}", builder.schema());

let mut reader = builder.build().unwrap();

let record_batch = reader.next().unwrap().unwrap();

println!("Read {} records.", record_batch.num_rows());

§Example: Reading non-uniformly encrypted parquet file into arrow record batch

Note: This requires the experimental encryption feature to be enabled at compile time.

 let file = File::open(path).unwrap();

 // Define the AES encryption keys required required for decrypting the footer metadata
 // and column-specific data. If only a footer key is used then it is assumed that the
 // file uses uniform encryption and all columns are encrypted with the footer key.
 // If any column keys are specified, other columns without a key provided are assumed
 // to be unencrypted
 let footer_key = "0123456789012345".as_bytes(); // Keys are 128 bits (16 bytes)
 let column_1_key = "1234567890123450".as_bytes();
 let column_2_key = "1234567890123451".as_bytes();

 let decryption_properties = FileDecryptionProperties::builder(footer_key.to_vec())
     .with_column_key("double_field", column_1_key.to_vec())
     .with_column_key("float_field", column_2_key.to_vec())
     .build()
     .unwrap();

 let options = ArrowReaderOptions::default()
  .with_file_decryption_properties(decryption_properties);
 let reader_metadata = ArrowReaderMetadata::load(&file, options.clone()).unwrap();
 let file_metadata = reader_metadata.metadata().file_metadata();
 assert_eq!(50, file_metadata.num_rows());

 let mut reader = ParquetRecordBatchReaderBuilder::try_new_with_options(file, options)
   .unwrap()
   .build()
   .unwrap();

 let record_batch = reader.next().unwrap().unwrap();
 assert_eq!(50, record_batch.num_rows());

Re-exports§

pub use self::arrow_writer::ArrowWriter;
pub use self::async_reader::ParquetRecordBatchStreamBuilder;
pub use self::async_writer::AsyncArrowWriter;

Modules§

arrow_reader
Contains reader which reads parquet data into arrow [RecordBatch]
arrow_writer
Contains writer which writes arrow data into parquet data.
async_reader
async API for reading Parquet files as [RecordBatch]es
async_writer
async API for writing [RecordBatch]es to Parquet files
buffer 🔒
Logic for reading data into arrow buffers
decoder 🔒
Specialized decoders optimised for decoding to arrow format
record_reader 🔒

Structs§

ArrowSchemaConverter
Converter for Arrow schema to Parquet schema
FieldLevels
Schema information necessary to decode a parquet file as arrow [Fields]
ProjectionMask
A ProjectionMask identifies a set of columns within a potentially nested schema to project

Constants§

ARROW_SCHEMA_META_KEY
Schema metadata key used to store serialized Arrow schema
PARQUET_FIELD_ID_META_KEY
The value of this metadata key, if present on Field::metadata, will be used to populate BasicTypeInfo::id

Functions§

add_encoded_arrow_schema_to_metadata
Mutates writer metadata by storing the encoded Arrow schema hint in ARROW_SCHEMA_META_KEY.
arrow_to_parquet_schemaDeprecated
Convert arrow schema to parquet schema
encode_arrow_schema
Encodes the Arrow schema into the IPC format, and base64 encodes it
parquet_column
Lookups up the parquet column by name
parquet_to_arrow_field_levels
Convert a parquet SchemaDescriptor to FieldLevels
parquet_to_arrow_schema
Convert Parquet schema to Arrow schema including optional metadata
parquet_to_arrow_schema_by_columns
Convert parquet schema to arrow schema including optional metadata, only preserving some leaf columns.