Expand description
Parquet metadata API
Users should use these structures to interact with Parquet metadata.
-
ParquetMetaData: Top level metadata container, read from the Parquet file footer. -
FileMetaData: File level metadata such as schema, row counts and version. -
RowGroupMetaData: Metadata for each Row Group with a File, such as location and number of rows, and column chunks. -
ColumnChunkMetaData: Metadata for each column chunk (primitive leaf) within a Row Group including encoding and compression information, number of values, statistics, etc.
§APIs for working with Parquet Metadata
The Parquet readers and writers in this crate handle reading and writing metadata into parquet files. To work with metadata directly, the following APIs are available:
ParquetMetaDataReaderfor reading metadata from an I/O source (sync and async)ParquetMetaDataPushDecoderfor decoding from bytes without I/OParquetMetaDataWriterfor writing.
§Examples
Please see external_metadata.rs
§Metadata Encodings and Structures
There are three different encodings of Parquet Metadata in this crate:
-
bytes:encoded with the ThriftTCompactProtocolas defined in parquet.thrift -
format: Rust structures automatically generated by the thrift compiler from parquet.thrift. These structures are low level and mirror the thrift definitions. -
file::metadata(this module): Easier to use Rust structures with a more idiomatic API. Note that, confusingly, some but not all of these structures have the same name as theformatstructures.
Graphically, this is how the different structures relate to each other:
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
┌──────────────┐ │ ┌───────────────────────┐ │
│ │ ColumnIndex │ ││ ParquetMetaData │
└──────────────┘ │ └───────────────────────┘ │
┌──────────────┐ │ ┌────────────────┐ │┌───────────────────────┐
│ ..0x24.. │ ◀────▶ │ OffsetIndex │ │ ◀────▶ │ ParquetMetaData │ │
└──────────────┘ │ └────────────────┘ │└───────────────────────┘
... │ ... │
│ ┌──────────────────┐ │ ┌──────────────────┐
bytes │ FileMetaData* │ │ │ FileMetaData* │ │
(thrift encoded) │ └──────────────────┘ │ └──────────────────┘
─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
format::meta structures file::metadata structures
* Same name, different structModules§
- footer_
tail 🔒 - memory 🔒
- Memory calculations for
ParquetMetadata::memory_size - parser 🔒
- Internal metadata parsing routines
- push_
decoder 🔒 - reader 🔒
- thrift 🔒
- This module is the bridge between a Parquet file’s thrift encoded metadata and this crate’s Parquet metadata API. It contains objects and functions used to serialize/deserialize metadata objects into/from the Thrift compact protocol format as defined by the Parquet specification.
- writer 🔒
Structs§
- Column
Chunk Meta Data - Metadata for a column chunk.
- Column
Chunk Meta Data Builder - Builder for
ColumnChunkMetaData - Column
Index Builder - Builder for Parquet
ColumnIndex, part of the Parquet PageIndex - File
Meta Data - File level metadata for a Parquet file.
- Footer
Tail - Parsed Parquet footer tail (last 8 bytes of a Parquet file)
- KeyValue
- A key-value pair for
FileMetaData. - Level
Histogram - Histograms for repetition and definition levels.
- Offset
Index Builder - Builder for offset index, part of the Parquet PageIndex.
- Page
Encoding Stats - PageEncodingStats for a column chunk and data page.
- Parquet
Meta Data - Parsed metadata for a single Parquet file
- Parquet
Meta Data Builder - A builder for creating / manipulating
ParquetMetaData - Parquet
Meta Data Push Decoder - A push decoder for
ParquetMetaData. - Parquet
Meta Data Reader - Reads
ParquetMetaDatafrom a byte stream, with either synchronous or asynchronous I/O. - Parquet
Meta Data Writer - Writes
ParquetMetaDatato a byte stream - RowGroup
Meta Data - Metadata for a row group
- RowGroup
Meta Data Builder - Builder for row group metadata.
- Sorting
Column - Sort order within a RowGroup of a leaf column
Enums§
- Page
Index Policy - Describes the policy for reading page indexes
Type Aliases§
- File
Meta Data Ptr - Reference counted pointer for
FileMetaData. - Parquet
Column Index - Page level statistics for each column chunk of each row group.
- Parquet
Offset Index OffsetIndexMetaDatafor each data page of each row group of each column- RowGroup
Meta Data Ptr - Reference counted pointer for
RowGroupMetaData.