Crate parquet

Crate parquet 

Source
Expand description

This crate contains the official Native Rust implementation of Apache Parquet, part of the Apache Arrow project. The crate provides a number of APIs to read and write Parquet files, covering a range of use cases.

Please see the parquet crates.io page for feature flags and tips to improve performance.

§Format Overview

Parquet is a columnar format, which means that unlike row formats like CSV, values are iterated along columns instead of rows. Parquet is similar in spirit to Arrow, but focuses on storage efficiency whereas Arrow prioritizes compute efficiency.

Parquet files are partitioned for scalability. Each file contains metadata, along with zero or more “row groups”, each row group containing one or more columns. The APIs in this crate reflect this structure.

Data in Parquet files is strongly typed and differentiates between logical and physical types (see schema). In addition, Parquet files may contain other metadata, such as statistics, which can be used to optimize reading (see file::metadata). For more details about the Parquet format itself, see the Parquet spec

§APIs

This crate exposes a number of APIs for different use-cases.

§Metadata and Schema

The schema module provides APIs to work with Parquet schemas. The file::metadata module provides APIs to work with Parquet metadata.

§Reading and Writing Arrow (arrow feature)

The arrow module supports reading and writing Parquet data to/from Arrow RecordBatches. Using Arrow is simple and performant, and allows workloads to leverage the wide range of data transforms provided by the arrow crate, and by the ecosystem of Arrow compatible systems.

Most users will use ArrowWriter for writing and ParquetRecordBatchReaderBuilder for reading.

Lower level APIs include ArrowColumnWriter for writing using multiple threads, and RowFilter to apply filters during decode.

§async Reading and Writing Arrow (async feature)

The async_reader and async_writer modules provide async APIs to read and write RecordBatches asynchronously.

Most users will use AsyncArrowWriter for writing and ParquetRecordBatchStreamBuilder for reading. When the object_store feature is enabled, ParquetObjectReader provides efficient integration with object storage services such as S3 via the object_store crate, automatically optimizing IO based on any predicates or projections provided.

§Variant Logical Type (variant_experimental feature)

The variant module supports reading and writing Parquet files with the Variant Binary Encoding logical type, which can represent semi-structured data such as JSON efficiently.

§Read/Write Parquet Directly

Workloads needing finer-grained control, or to avoid a dependence on arrow, can use the APIs in file directly. These APIs are harder to use as they directly use the underlying Parquet data model, and require knowledge of the Parquet format, including the details of Dremel record shredding and Logical Types.

Modules§

arrow
API for reading/writing Arrow RecordBatches and Arrays to/from Parquet Files.
basic
Contains Rust mappings for Thrift definition. This module contains only mappings for thrift enums and unions. Thrift structs are handled elsewhere. Refer to parquet.thrift file to see raw definitions.
bloom_filter
Bloom filter implementation specific to Parquet, as described in the spec.
column
Low level column reader and writer APIs.
data_type
Data types that connect Parquet physical types with their Rust-specific representations.
errors
Common Parquet errors and macros.
file
APIs for reading parquet data.
format
Automatically generated code from the Parquet thrift definition.
parquet_macros 🔒
This is a collection of macros used to parse Thrift IDL descriptions of structs, unions, and enums into their corresponding Rust types. These macros will also generate the code necessary to serialize and deserialize to/from the Thrift compact protocol.
parquet_thrift 🔒
Structs used for encoding and decoding Parquet Thrift objects.
record
Contains record-based API for reading Parquet files.
schema
Parquet schema definitions and methods to print and parse schema.
thrift
Custom thrift definitions
utf8
check_valid_utf8 validation function
variant
⚠️ Experimental Support for reading and writing Variants to / from Parquet files ⚠️

Macros§

experimental 🔒
Defines a an item with an experimental public API
thrift_enum
Macro used to generate rust enums from a Thrift enum definition.
thrift_struct
Macro used to generate Rust structs from a Thrift struct definition.
thrift_union
Macro used to generate Rust enums for Thrift unions where variants are a mix of unit and tuple types.
thrift_union_all_empty
Macro used to generate Rust enums for Thrift unions in which all variants are typed with empty structs.

Enums§

DecodeResult
What data is needed to read the next item from a decoder.