Module arrow

Expand description

API for reading/writing Arrow RecordBatches and Arrays to/from Parquet Files.

See the crate-level documentation for more details on other APIs

§Schema Conversion

These APIs ensure that data in Arrow RecordBatches written to Parquet are read back as RecordBatches with the exact same types and values.

Parquet and Arrow have different type systems, and there is not always a one to one mapping between the systems. For example, data stored as a Parquet BYTE_ARRAY can be read as either an Arrow BinaryViewArray or BinaryArray.

To recover the original Arrow types, the writers in this module add a “hint” to the metadata in the ARROW_SCHEMA_META_KEY key which records the original Arrow schema. The metadata hint follows the same convention as arrow-cpp based implementations such as pyarrow. The reader looks for the schema hint in the metadata to determine Arrow types, and if it is not present, infers the Arrow schema from the Parquet schema.

In situations where the embedded Arrow schema is not compatible with the Parquet schema, the Parquet schema takes precedence and no error is raised. See #1663

You can also control the type conversion process in more detail using:

ArrowSchemaConverter control the conversion of Arrow types to Parquet types.
ArrowReaderOptions::with_schema to explicitly specify your own Arrow schema hint to use when reading Parquet, overriding any metadata that may be present.

§Example: Writing Arrow `RecordBatch` to Parquet file

 let ids = Int32Array::from(vec![1, 2, 3, 4]);
 let vals = Int32Array::from(vec![5, 6, 7, 8]);
 let batch = RecordBatch::try_from_iter(vec![
   ("id", Arc::new(ids) as ArrayRef),
   ("val", Arc::new(vals) as ArrayRef),
 ]).unwrap();

 let file = tempfile().unwrap();

 // WriterProperties can be used to set Parquet file options
 let props = WriterProperties::builder()
     .set_compression(Compression::SNAPPY)
     .build();

 let mut writer = ArrowWriter::try_new(file, batch.schema(), Some(props)).unwrap();

 writer.write(&batch).expect("Writing batch");

 // writer must be closed to write footer
 writer.close().unwrap();

§Example: Reading Parquet file into Arrow `RecordBatch`

let file = File::open("data.parquet").unwrap();

let builder = ParquetRecordBatchReaderBuilder::try_new(file).unwrap();
println!("Converted arrow schema is: {}", builder.schema());

let mut reader = builder.build().unwrap();

let record_batch = reader.next().unwrap().unwrap();

println!("Read {} records.", record_batch.num_rows());

§Example: Reading non-uniformly encrypted parquet file into arrow record batch

Note: This requires the experimental encryption feature to be enabled at compile time.

 let file = File::open(path).unwrap();

 // Define the AES encryption keys required required for decrypting the footer metadata
 // and column-specific data. If only a footer key is used then it is assumed that the
 // file uses uniform encryption and all columns are encrypted with the footer key.
 // If any column keys are specified, other columns without a key provided are assumed
 // to be unencrypted
 let footer_key = "0123456789012345".as_bytes(); // Keys are 128 bits (16 bytes)
 let column_1_key = "1234567890123450".as_bytes();
 let column_2_key = "1234567890123451".as_bytes();

 let decryption_properties = FileDecryptionProperties::builder(footer_key.to_vec())
     .with_column_key("double_field", column_1_key.to_vec())
     .with_column_key("float_field", column_2_key.to_vec())
     .build()
     .unwrap();

 let options = ArrowReaderOptions::default()
  .with_file_decryption_properties(decryption_properties);
 let reader_metadata = ArrowReaderMetadata::load(&file, options.clone()).unwrap();
 let file_metadata = reader_metadata.metadata().file_metadata();
 assert_eq!(50, file_metadata.num_rows());

 let mut reader = ParquetRecordBatchReaderBuilder::try_new_with_options(file, options)
   .unwrap()
   .build()
   .unwrap();

 let record_batch = reader.next().unwrap().unwrap();
 assert_eq!(50, record_batch.num_rows());

Re-exports§

pub use self::arrow_writer::ArrowWriter;
pub use self::async_reader::ParquetRecordBatchStreamBuilder;
pub use self::async_writer::AsyncArrowWriter;

Modules§

arrow_reader: Contains reader which reads parquet data into arrow [RecordBatch]
arrow_writer: Contains writer which writes arrow data into parquet data.
async_reader: async API for reading Parquet files as [RecordBatch]es
async_writer: async API for writing [RecordBatch]es to Parquet files
buffer 🔒: Logic for reading data into arrow buffers
decoder 🔒: Specialized decoders optimised for decoding to arrow format
record_reader 🔒

Structs§

ArrowSchemaConverter: Converter for Arrow schema to Parquet schema
FieldLevels: Schema information necessary to decode a parquet file as arrow [Fields]
ProjectionMask: A ProjectionMask identifies a set of columns within a potentially nested schema to project

Constants§

ARROW_SCHEMA_META_KEY: Schema metadata key used to store serialized Arrow schema
PARQUET_FIELD_ID_META_KEY: The value of this metadata key, if present on Field::metadata, will be used to populate BasicTypeInfo::id

Functions§

add_encoded_arrow_schema_to_metadata: Mutates writer metadata by storing the encoded Arrow schema hint in ARROW_SCHEMA_META_KEY.
encode_arrow_schema: Encodes the Arrow schema into the IPC format, and base64 encodes it
parquet_column: Lookups up the parquet column by name
parquet_to_arrow_field_levels: Convert a parquet SchemaDescriptor to FieldLevels
parquet_to_arrow_schema: Convert Parquet schema to Arrow schema including optional metadata
parquet_to_arrow_schema_by_columns: Convert parquet schema to arrow schema including optional metadata, only preserving some leaf columns.

Module arrowCopy item path