Skip to main content

parquet/
lib.rs

1// Licensed to the Apache Software Foundation (ASF) under one
2// or more contributor license agreements.  See the NOTICE file
3// distributed with this work for additional information
4// regarding copyright ownership.  The ASF licenses this file
5// to you under the Apache License, Version 2.0 (the
6// "License"); you may not use this file except in compliance
7// with the License.  You may obtain a copy of the License at
8//
9//   http://www.apache.org/licenses/LICENSE-2.0
10//
11// Unless required by applicable law or agreed to in writing,
12// software distributed under the License is distributed on an
13// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14// KIND, either express or implied.  See the License for the
15// specific language governing permissions and limitations
16// under the License.
17
18//!
19//! This crate contains the official Native Rust implementation of
20//! [Apache Parquet](https://parquet.apache.org/), part of
21//! the [Apache Arrow](https://arrow.apache.org/) project.
22//! The crate provides a number of APIs to read and write Parquet files,
23//! covering a range of use cases.
24//!
25//! Please see the [parquet crates.io](https://crates.io/crates/parquet)
26//! page for feature flags and tips to improve performance.
27//!
28//! # Format Overview
29//!
30//! Parquet is a columnar format, which means that unlike row formats like [CSV], values are
31//! iterated along columns instead of rows. Parquet is similar in spirit to [Arrow], but
32//! focuses on storage efficiency whereas Arrow prioritizes compute efficiency.
33//!
34//! Parquet files are partitioned for scalability. Each file contains metadata,
35//! along with zero or more "row groups", each row group containing one or
36//! more columns. The APIs in this crate reflect this structure.
37//!
38//! Data in Parquet files is strongly typed and differentiates between logical
39//! and physical types (see [`schema`]). In addition, Parquet files may contain
40//! other metadata, such as statistics, which can be used to optimize reading
41//! (see [`file::metadata`]).
42//! For more details about the Parquet format itself, see the [Parquet spec]
43//!
44//! [Parquet spec]: https://github.com/apache/parquet-format/blob/master/README.md#file-format
45//!
46//! # APIs
47//!
48//! This crate exposes a number of APIs for different use-cases.
49//!
50//! ## Metadata and Schema
51//!
52//! The [`schema`] module provides APIs to work with Parquet schemas. The
53//! [`file::metadata`] module provides APIs to work with Parquet metadata.
54//!
55//! ## Reading and Writing Arrow (`arrow` feature)
56//!
57//! The [`arrow`] module supports reading and writing Parquet data to/from
58//! Arrow [`RecordBatch`]es. Using Arrow is simple and performant, and allows workloads
59//! to leverage the wide range of data transforms provided by the [arrow] crate, and by the
60//! ecosystem of [Arrow] compatible systems.
61//!
62//! Most users will use [`ArrowWriter`] for writing and [`ParquetRecordBatchReaderBuilder`] for
63//! reading from synchronous IO sources such as files or in-memory buffers.
64//!
65//! Lower level APIs include
66//! * [`ParquetPushDecoder`] for file grained control over interleaving of IO and CPU.
67//! * [`ArrowColumnWriter`] for writing using multiple threads,
68//! * [`RowFilter`] to apply filters during decode
69//!
70//! ### EXPERIMENTAL: Content-Defined Chunking
71//!
72//! [`ArrowWriter`] supports content-defined chunking (CDC), which creates data page
73//! boundaries based on content rather than fixed sizes. CDC enables efficient
74//! deduplication in content-addressable storage (CAS) systems: when the same data
75//! appears in successive file versions, it will produce identical byte sequences that
76//! CAS backends can deduplicate.
77//!
78//! Enable CDC via [`WriterProperties`]:
79//!
80//! ```rust
81//! # use parquet::file::properties::{WriterProperties, CdcOptions};
82//! let props = WriterProperties::builder()
83//!     .set_content_defined_chunking(Some(CdcOptions::default()))
84//!     .build();
85//! ```
86//!
87//! See [`CdcOptions`] for chunk size and normalization parameters.
88//!
89//! [`WriterProperties`]: file::properties::WriterProperties
90//! [`CdcOptions`]: file::properties::CdcOptions
91//!
92//! [`ArrowWriter`]: arrow::arrow_writer::ArrowWriter
93//! [`ParquetRecordBatchReaderBuilder`]: arrow::arrow_reader::ParquetRecordBatchReaderBuilder
94//! [`ParquetPushDecoder`]: arrow::push_decoder::ParquetPushDecoder
95//! [`ArrowColumnWriter`]: arrow::arrow_writer::ArrowColumnWriter
96//! [`RowFilter`]: arrow::arrow_reader::RowFilter
97//!
98//! ## `async` Reading and Writing Arrow (`arrow` feature + `async` feature)
99//!
100//! The [`async_reader`] and [`async_writer`] modules provide async APIs to
101//! read and write [`RecordBatch`]es  asynchronously.
102//!
103//! Most users will use [`AsyncArrowWriter`] for writing and [`ParquetRecordBatchStreamBuilder`]
104//! for reading. When the `object_store` feature is enabled, [`ParquetObjectReader`]
105//! provides efficient integration with object storage services such as S3 via the [object_store]
106//! crate, automatically optimizing IO based on any predicates or projections provided.
107//!
108//! [`async_reader`]: arrow::async_reader
109//! [`async_writer`]: arrow::async_writer
110//! [`AsyncArrowWriter`]: arrow::async_writer::AsyncArrowWriter
111//! [`ParquetRecordBatchStreamBuilder`]: arrow::async_reader::ParquetRecordBatchStreamBuilder
112//! [`ParquetObjectReader`]: arrow::async_reader::ParquetObjectReader
113//!
114//! ## Variant Logical Type (`variant_experimental` feature)
115//!
116//! The [`variant`] module supports reading and writing Parquet files
117//! with the [Variant Binary Encoding] logical type, which can represent
118//! semi-structured data such as JSON efficiently.
119//!
120//! [Variant Binary Encoding]: https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
121//!
122//! ## Read/Write Parquet Directly
123//!
124//! Workloads needing finer-grained control, or to avoid a dependence on arrow,
125//! can use the APIs in [`mod@file`] directly. These APIs  are harder to use
126//! as they directly use the underlying Parquet data model, and require knowledge
127//! of the Parquet format, including the details of [Dremel] record shredding
128//! and [Logical Types].
129//!
130//! [arrow]: https://docs.rs/arrow/latest/arrow/index.html
131//! [Arrow]: https://arrow.apache.org/
132//! [`RecordBatch`]: https://docs.rs/arrow/latest/arrow/array/struct.RecordBatch.html
133//! [CSV]: https://en.wikipedia.org/wiki/Comma-separated_values
134//! [Dremel]: https://research.google/pubs/pub36632/
135//! [Logical Types]: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
136//! [object_store]: https://docs.rs/object_store/latest/object_store/
137
138#![doc(
139    html_logo_url = "https://raw.githubusercontent.com/apache/parquet-format/25f05e73d8cd7f5c83532ce51cb4f4de8ba5f2a2/logo/parquet-logos_1.svg",
140    html_favicon_url = "https://raw.githubusercontent.com/apache/parquet-format/25f05e73d8cd7f5c83532ce51cb4f4de8ba5f2a2/logo/parquet-logos_1.svg"
141)]
142#![cfg_attr(docsrs, feature(doc_cfg))]
143#![warn(missing_docs)]
144/// Defines a an item with an experimental public API
145///
146/// The module will not be documented, and will only be public if the
147/// experimental feature flag is enabled
148///
149/// Experimental components have no stability guarantees
150#[cfg(feature = "experimental")]
151macro_rules! experimental {
152    ($(#[$meta:meta])* $vis:vis mod $module:ident) => {
153        #[doc(hidden)]
154        $(#[$meta])*
155        pub mod $module;
156    }
157}
158
159#[cfg(not(feature = "experimental"))]
160macro_rules! experimental {
161    ($(#[$meta:meta])* $vis:vis mod $module:ident) => {
162        $(#[$meta])*
163        $vis mod $module;
164    }
165}
166
167#[cfg(all(
168    feature = "flate2",
169    not(any(feature = "flate2-zlib-rs", feature = "flate2-rust_backened"))
170))]
171compile_error!(
172    "When enabling `flate2` you must enable one of the features: `flate2-zlib-rs` or `flate2-rust_backened`."
173);
174
175#[macro_use]
176pub mod errors;
177pub mod basic;
178
179#[macro_use]
180pub mod data_type;
181
182use std::fmt::Debug;
183use std::ops::Range;
184// Exported for external use, such as benchmarks
185#[cfg(feature = "experimental")]
186#[doc(hidden)]
187pub use self::encodings::{decoding, encoding};
188
189experimental!(#[macro_use] mod util);
190
191pub use util::utf8;
192
193#[cfg(feature = "arrow")]
194pub mod arrow;
195pub mod column;
196experimental!(mod compression);
197experimental!(mod encodings);
198pub mod bloom_filter;
199
200#[cfg(feature = "encryption")]
201experimental!(pub mod encryption);
202
203pub mod file;
204pub mod record;
205pub mod schema;
206
207mod parquet_macros;
208mod parquet_thrift;
209
210/// What data is needed to read the next item from a decoder.
211///
212/// This is used to communicate between the decoder and the caller
213/// to indicate what data is needed next, or what the result of decoding is.
214#[derive(Debug)]
215pub enum DecodeResult<T: Debug> {
216    /// The ranges of data necessary to proceed
217    // TODO: distinguish between minimim needed to make progress and what could be used?
218    NeedsData(Vec<Range<u64>>),
219    /// The decoder produced an output item
220    Data(T),
221    /// The decoder finished processing
222    Finished,
223}
224
225#[cfg(feature = "variant_experimental")]
226pub mod variant;
227experimental!(pub mod geospatial);