parquet/lib.rs
1// Licensed to the Apache Software Foundation (ASF) under one
2// or more contributor license agreements. See the NOTICE file
3// distributed with this work for additional information
4// regarding copyright ownership. The ASF licenses this file
5// to you under the Apache License, Version 2.0 (the
6// "License"); you may not use this file except in compliance
7// with the License. You may obtain a copy of the License at
8//
9// http://www.apache.org/licenses/LICENSE-2.0
10//
11// Unless required by applicable law or agreed to in writing,
12// software distributed under the License is distributed on an
13// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14// KIND, either express or implied. See the License for the
15// specific language governing permissions and limitations
16// under the License.
17
18//!
19//! This crate contains the official Native Rust implementation of
20//! [Apache Parquet](https://parquet.apache.org/), part of
21//! the [Apache Arrow](https://arrow.apache.org/) project.
22//! The crate provides a number of APIs to read and write Parquet files,
23//! covering a range of use cases.
24//!
25//! Please see the [parquet crates.io](https://crates.io/crates/parquet)
26//! page for feature flags and tips to improve performance.
27//!
28//! # Format Overview
29//!
30//! Parquet is a columnar format, which means that unlike row formats like [CSV], values are
31//! iterated along columns instead of rows. Parquet is similar in spirit to [Arrow], but
32//! focuses on storage efficiency whereas Arrow prioritizes compute efficiency.
33//!
34//! Parquet files are partitioned for scalability. Each file contains metadata,
35//! along with zero or more "row groups", each row group containing one or
36//! more columns. The APIs in this crate reflect this structure.
37//!
38//! Data in Parquet files is strongly typed and differentiates between logical
39//! and physical types (see [`schema`]). In addition, Parquet files may contain
40//! other metadata, such as statistics, which can be used to optimize reading
41//! (see [`file::metadata`]).
42//! For more details about the Parquet format itself, see the [Parquet spec]
43//!
44//! [Parquet spec]: https://github.com/apache/parquet-format/blob/master/README.md#file-format
45//!
46//! # APIs
47//!
48//! This crate exposes a number of APIs for different use-cases.
49//!
50//! ## Metadata and Schema
51//!
52//! The [`schema`] module provides APIs to work with Parquet schemas. The
53//! [`file::metadata`] module provides APIs to work with Parquet metadata.
54//!
55//! ## Reading and Writing Arrow (`arrow` feature)
56//!
57//! The [`arrow`] module supports reading and writing Parquet data to/from
58//! Arrow `RecordBatch`es. Using Arrow is simple and performant, and allows workloads
59//! to leverage the wide range of data transforms provided by the [arrow] crate, and by the
60//! ecosystem of [Arrow] compatible systems.
61//!
62//! Most users will use [`ArrowWriter`] for writing and [`ParquetRecordBatchReaderBuilder`] for
63//! reading.
64//!
65//! Lower level APIs include [`ArrowColumnWriter`] for writing using multiple
66//! threads, and [`RowFilter`] to apply filters during decode.
67//!
68//! [`ArrowWriter`]: arrow::arrow_writer::ArrowWriter
69//! [`ParquetRecordBatchReaderBuilder`]: arrow::arrow_reader::ParquetRecordBatchReaderBuilder
70//! [`ArrowColumnWriter`]: arrow::arrow_writer::ArrowColumnWriter
71//! [`RowFilter`]: arrow::arrow_reader::RowFilter
72//!
73//! ## `async` Reading and Writing Arrow (`async` feature)
74//!
75//! The [`async_reader`] and [`async_writer`] modules provide async APIs to
76//! read and write `RecordBatch`es asynchronously.
77//!
78//! Most users will use [`AsyncArrowWriter`] for writing and [`ParquetRecordBatchStreamBuilder`]
79//! for reading. When the `object_store` feature is enabled, [`ParquetObjectReader`]
80//! provides efficient integration with object storage services such as S3 via the [object_store]
81//! crate, automatically optimizing IO based on any predicates or projections provided.
82//!
83//! [`async_reader`]: arrow::async_reader
84//! [`async_writer`]: arrow::async_writer
85//! [`AsyncArrowWriter`]: arrow::async_writer::AsyncArrowWriter
86//! [`ParquetRecordBatchStreamBuilder`]: arrow::async_reader::ParquetRecordBatchStreamBuilder
87//! [`ParquetObjectReader`]: arrow::async_reader::ParquetObjectReader
88//!
89//! ## Read/Write Parquet Directly
90//!
91//! Workloads needing finer-grained control, or to avoid a dependence on arrow,
92//! can use the APIs in [`mod@file`] directly. These APIs are harder to use
93//! as they directly use the underlying Parquet data model, and require knowledge
94//! of the Parquet format, including the details of [Dremel] record shredding
95//! and [Logical Types].
96//!
97//! [arrow]: https://docs.rs/arrow/latest/arrow/index.html
98//! [Arrow]: https://arrow.apache.org/
99//! [CSV]: https://en.wikipedia.org/wiki/Comma-separated_values
100//! [Dremel]: https://research.google/pubs/pub36632/
101//! [Logical Types]: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
102//! [object_store]: https://docs.rs/object_store/latest/object_store/
103
104#![doc(
105 html_logo_url = "https://raw.githubusercontent.com/apache/parquet-format/25f05e73d8cd7f5c83532ce51cb4f4de8ba5f2a2/logo/parquet-logos_1.svg",
106 html_favicon_url = "https://raw.githubusercontent.com/apache/parquet-format/25f05e73d8cd7f5c83532ce51cb4f4de8ba5f2a2/logo/parquet-logos_1.svg"
107)]
108#![cfg_attr(docsrs, feature(doc_auto_cfg))]
109#![warn(missing_docs)]
110/// Defines a an item with an experimental public API
111///
112/// The module will not be documented, and will only be public if the
113/// experimental feature flag is enabled
114///
115/// Experimental components have no stability guarantees
116#[cfg(feature = "experimental")]
117macro_rules! experimental {
118 ($(#[$meta:meta])* $vis:vis mod $module:ident) => {
119 #[doc(hidden)]
120 $(#[$meta])*
121 pub mod $module;
122 }
123}
124
125#[cfg(not(feature = "experimental"))]
126macro_rules! experimental {
127 ($(#[$meta:meta])* $vis:vis mod $module:ident) => {
128 $(#[$meta])*
129 $vis mod $module;
130 }
131}
132
133#[macro_use]
134pub mod errors;
135pub mod basic;
136
137/// Automatically generated code from the Parquet thrift definition.
138///
139/// This module code generated from [parquet.thrift]. See [crate::file] for
140/// more information on reading Parquet encoded data.
141///
142/// [parquet.thrift]: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift
143// see parquet/CONTRIBUTING.md for instructions on regenerating
144// Don't try clippy and format auto generated code
145#[allow(clippy::all, missing_docs)]
146#[rustfmt::skip]
147pub mod format;
148
149#[macro_use]
150pub mod data_type;
151
152// Exported for external use, such as benchmarks
153#[cfg(feature = "experimental")]
154#[doc(hidden)]
155pub use self::encodings::{decoding, encoding};
156
157experimental!(#[macro_use] mod util);
158
159pub use util::utf8;
160
161#[cfg(feature = "arrow")]
162pub mod arrow;
163pub mod column;
164experimental!(mod compression);
165experimental!(mod encodings);
166pub mod bloom_filter;
167
168#[cfg(feature = "encryption")]
169experimental!(pub mod encryption);
170
171pub mod file;
172pub mod record;
173pub mod schema;
174
175pub mod thrift;