Reading and writing Parquet files¶
See also
The Parquet format is a space-efficient columnar storage format for complex data. The Parquet C++ implementation is part of the Apache Arrow project and benefits from tight integration with the Arrow C++ classes and facilities.
Reading Parquet files¶
The arrow::FileReader
class reads data into Arrow Tables and Record
Batches.
The StreamReader
class allows for data to be read using a C++ input
stream approach to read fields column by column and row by row. This approach
is offered for ease of use and type-safety. It is of course also useful when
data must be streamed as files are read and written incrementally.
Please note that the performance of the StreamReader
will not
be as good due to the type checking and the fact that column values
are processed one at a time.
FileReader¶
To read Parquet data into Arrow structures, use arrow::FileReader
.
To construct, it requires a ::arrow::io::RandomAccessFile
instance
representing the input file. To read the whole file at once,
use arrow::FileReader::ReadTable()
:
// #include "arrow/io/api.h"
// #include "arrow/parquet/arrow/reader.h"
arrow::MemoryPool* pool = arrow::default_memory_pool();
std::shared_ptr<arrow::io::RandomAccessFile> input;
ARROW_ASSIGN_OR_RAISE(input, arrow::io::ReadableFile::Open(path_to_file));
// Open Parquet file reader
std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
ARROW_RETURN_NOT_OK(parquet::arrow::OpenFile(input, pool, &arrow_reader));
// Read entire file as a single Arrow table
std::shared_ptr<arrow::Table> table;
ARROW_RETURN_NOT_OK(arrow_reader->ReadTable(&table));
Finer-grained options are available through the
arrow::FileReaderBuilder
helper class, which accepts the ReaderProperties
and ArrowReaderProperties
classes.
For reading as a stream of batches, use the arrow::FileReader::GetRecordBatchReader()
method to retrieve a arrow::RecordBatchReader
. It will use the batch
size set in ArrowReaderProperties
.
// #include "arrow/io/api.h"
// #include "arrow/parquet/arrow/reader.h"
arrow::MemoryPool* pool = arrow::default_memory_pool();
// Configure general Parquet reader settings
auto reader_properties = parquet::ReaderProperties(pool);
reader_properties.set_buffer_size(4096 * 4);
reader_properties.enable_buffered_stream();
// Configure Arrow-specific Parquet reader settings
auto arrow_reader_props = parquet::ArrowReaderProperties();
arrow_reader_props.set_batch_size(128 * 1024); // default 64 * 1024
parquet::arrow::FileReaderBuilder reader_builder;
ARROW_RETURN_NOT_OK(
reader_builder.OpenFile(path_to_file, /*memory_map=*/false, reader_properties));
reader_builder.memory_pool(pool);
reader_builder.properties(arrow_reader_props);
std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
ARROW_ASSIGN_OR_RAISE(arrow_reader, reader_builder.Build());
std::shared_ptr<::arrow::RecordBatchReader> rb_reader;
ARROW_RETURN_NOT_OK(arrow_reader->GetRecordBatchReader(&rb_reader));
for (arrow::Result<std::shared_ptr<arrow::RecordBatch>> maybe_batch : *rb_reader) {
// Operate on each batch...
}
See also
For reading multi-file datasets or pushing down filters to prune row groups, see Tabular Datasets.
Performance and Memory Efficiency¶
For remote filesystems, use read coalescing (pre-buffering) to reduce number of API calls:
auto arrow_reader_props = parquet::ArrowReaderProperties();
reader_properties.set_prebuffer(true);
The defaults are generally tuned towards good performance, but parallel column
decoding is off by default. Enable it in the constructor of ArrowReaderProperties
:
auto arrow_reader_props = parquet::ArrowReaderProperties(/*use_threads=*/true);
If memory efficiency is more important than performance, then:
Do not turn on read coalescing (pre-buffering) in
parquet::ArrowReaderProperties
.Read data in batches using
arrow::FileReader::GetRecordBatchReader()
.Turn on
enable_buffered_stream
inparquet::ReaderProperties
.
In addition, if you know certain columns contain many repeated values, you can
read them as dictionary encoded columns. This is
enabled with the set_read_dictionary
setting on ArrowReaderProperties
.
If the files were written with Arrow C++ and the store_schema
was activated,
then the original Arrow schema will be automatically read and will override this
setting.
StreamReader¶
The StreamReader
allows for Parquet files to be read using
standard C++ input operators which ensures type-safety.
Please note that types must match the schema exactly i.e. if the
schema field is an unsigned 16-bit integer then you must supply a
uint16_t
type.
Exceptions are used to signal errors. A ParquetException
is
thrown in the following circumstances:
Attempt to read field by supplying the incorrect type.
Attempt to read beyond end of row.
Attempt to read beyond end of file.
#include "arrow/io/file.h"
#include "parquet/stream_reader.h"
{
std::shared_ptr<arrow::io::ReadableFile> infile;
PARQUET_ASSIGN_OR_THROW(
infile,
arrow::io::ReadableFile::Open("test.parquet"));
parquet::StreamReader stream{parquet::ParquetFileReader::Open(infile)};
std::string article;
float price;
uint32_t quantity;
while ( !stream.eof() )
{
stream >> article >> price >> quantity >> parquet::EndRow;
// ...
}
}
Writing Parquet files¶
WriteTable¶
The arrow::WriteTable()
function writes an entire
::arrow::Table
to an output file.
// #include "parquet/arrow/writer.h"
// #include "arrow/util/type_fwd.h"
using parquet::ArrowWriterProperties;
using parquet::WriterProperties;
ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::Table> table, GetTable());
// Choose compression
std::shared_ptr<WriterProperties> props =
WriterProperties::Builder().compression(arrow::Compression::SNAPPY)->build();
// Opt to store Arrow schema for easier reads back into Arrow
std::shared_ptr<ArrowWriterProperties> arrow_props =
ArrowWriterProperties::Builder().store_schema()->build();
std::shared_ptr<arrow::io::FileOutputStream> outfile;
ARROW_ASSIGN_OR_RAISE(outfile, arrow::io::FileOutputStream::Open(path_to_file));
ARROW_RETURN_NOT_OK(parquet::arrow::WriteTable(*table.get(),
arrow::default_memory_pool(), outfile,
/*chunk_size=*/3, props, arrow_props));
Note
Column compression is off by default in C++. See below for how to choose a compression codec in the writer properties.
To write out data batch-by-batch, use arrow::FileWriter
.
// #include "parquet/arrow/writer.h"
// #include "arrow/util/type_fwd.h"
using parquet::ArrowWriterProperties;
using parquet::WriterProperties;
// Data is in RBR
std::shared_ptr<arrow::RecordBatchReader> batch_stream;
ARROW_ASSIGN_OR_RAISE(batch_stream, GetRBR());
// Choose compression
std::shared_ptr<WriterProperties> props =
WriterProperties::Builder().compression(arrow::Compression::SNAPPY)->build();
// Opt to store Arrow schema for easier reads back into Arrow
std::shared_ptr<ArrowWriterProperties> arrow_props =
ArrowWriterProperties::Builder().store_schema()->build();
// Create a writer
std::shared_ptr<arrow::io::FileOutputStream> outfile;
ARROW_ASSIGN_OR_RAISE(outfile, arrow::io::FileOutputStream::Open(path_to_file));
std::unique_ptr<parquet::arrow::FileWriter> writer;
ARROW_ASSIGN_OR_RAISE(
writer, parquet::arrow::FileWriter::Open(*batch_stream->schema().get(),
arrow::default_memory_pool(), outfile,
props, arrow_props));
// Write each batch as a row_group
for (arrow::Result<std::shared_ptr<arrow::RecordBatch>> maybe_batch : *batch_stream) {
ARROW_ASSIGN_OR_RAISE(auto batch, maybe_batch);
ARROW_ASSIGN_OR_RAISE(auto table,
arrow::Table::FromRecordBatches(batch->schema(), {batch}));
ARROW_RETURN_NOT_OK(writer->WriteTable(*table.get(), batch->num_rows()));
}
// Write file footer and close
ARROW_RETURN_NOT_OK(writer->Close());
StreamWriter¶
The StreamWriter
allows for Parquet files to be written using
standard C++ output operators, similar to reading with the StreamReader
class. This type-safe approach also ensures that rows are written without
omitting fields and allows for new row groups to be created automatically
(after certain volume of data) or explicitly by using the EndRowGroup
stream modifier.
Exceptions are used to signal errors. A ParquetException
is
thrown in the following circumstances:
Attempt to write a field using an incorrect type.
Attempt to write too many fields in a row.
Attempt to skip a required field.
#include "arrow/io/file.h"
#include "parquet/stream_writer.h"
{
std::shared_ptr<arrow::io::FileOutputStream> outfile;
PARQUET_ASSIGN_OR_THROW(
outfile,
arrow::io::FileOutputStream::Open("test.parquet"));
parquet::WriterProperties::Builder builder;
std::shared_ptr<parquet::schema::GroupNode> schema;
// Set up builder with required compression type etc.
// Define schema.
// ...
parquet::StreamWriter os{
parquet::ParquetFileWriter::Open(outfile, schema, builder.build())};
// Loop over some data structure which provides the required
// fields to be written and write each row.
for (const auto& a : getArticles())
{
os << a.name() << a.price() << a.quantity() << parquet::EndRow;
}
}
Writer properties¶
To configure how Parquet files are written, use the WriterProperties::Builder
:
#include "parquet/arrow/writer.h"
#include "arrow/util/type_fwd.h"
using parquet::WriterProperties;
using parquet::ParquetVersion;
using parquet::ParquetDataPageVersion;
using arrow::Compression;
std::shared_ptr<WriterProperties> props = WriterProperties::Builder()
.max_row_group_length(64 * 1024)
.created_by("My Application")
.version(ParquetVersion::PARQUET_2_6)
.data_page_version(ParquetDataPageVersion::V2)
.compression(Compression::SNAPPY)
.build();
The max_row_group_length
sets an upper bound on the number of rows per row
group that takes precedent over the chunk_size
passed in the write methods.
You can set the version of Parquet to write with version
, which determines
which logical types are available. In addition, you can set the data page version
with data_page_version
. It’s V1 by default; setting to V2 will allow more
optimal compression (skipping compressing pages where there isn’t a space
benefit), but not all readers support this data page version.
Compression is off by default, but to get the most out of Parquet, you should
also choose a compression codec. You can choose one for the whole file or
choose one for individual columns. If you choose a mix, the file-level option
will apply to columns that don’t have a specific compression codec. See
::arrow::Compression
for options.
Column data encodings can likewise be applied at the file-level or at the
column level. By default, the writer will attempt to dictionary encode all
supported columns, unless the dictionary grows too large. This behavior can
be changed at file-level or at the column level with disable_dictionary()
.
When not using dictionary encoding, it will fallback to the encoding set for
the column or the overall file; by default Encoding::PLAIN
, but this can
be changed with encoding()
.
#include "parquet/arrow/writer.h"
#include "arrow/util/type_fwd.h"
using parquet::WriterProperties;
using arrow::Compression;
using parquet::Encoding;
std::shared_ptr<WriterProperties> props = WriterProperties::Builder()
.compression(Compression::SNAPPY) // Fallback
->compression("colA", Compression::ZSTD) // Only applies to column "colA"
->encoding(Encoding::BIT_PACKED) // Fallback
->encoding("colB", Encoding::RLE) // Only applies to column "colB"
->disable_dictionary("colB") // Never dictionary-encode column "colB"
->build();
Statistics are enabled by default for all columns. You can disable statistics for
all columns or specific columns using disable_statistics
on the builder.
There is a max_statistics_size
which limits the maximum number of bytes that
may be used for min and max values, useful for types like strings or binary blobs.
There are also Arrow-specific settings that can be configured with
parquet::ArrowWriterProperties
:
#include "parquet/arrow/writer.h"
using parquet::ArrowWriterProperties;
std::shared_ptr<ArrowWriterProperties> arrow_props = ArrowWriterProperties::Builder()
.enable_deprecated_int96_timestamps() // default False
->store_schema() // default False
->enable_compliant_nested_types() // default False
->build();
These options mostly dictate how Arrow types are converted to Parquet types.
Turning on store_schema
will cause the writer to store the serialized Arrow
schema within the file metadata. Since there is no bijection between Parquet
schemas and Arrow schemas, storing the Arrow schema allows the Arrow reader
to more faithfully recreate the original data. This mapping from Parquet types
back to original Arrow types includes:
Reading timestamps with original timezone information (Parquet does not support time zones);
Reading Arrow types from their storage types (such as Duration from int64 columns);
Reading string and binary columns back into large variants with 64-bit offsets;
Reading back columns as dictionary encoded (whether an Arrow column and the serialized Parquet version are dictionary encoded are independent).
Supported Parquet features¶
The Parquet format has many features, and Parquet C++ supports a subset of them.
Page types¶
Page type |
Notes |
---|---|
DATA_PAGE |
|
DATA_PAGE_V2 |
|
DICTIONARY_PAGE |
Unsupported page type: INDEX_PAGE. When reading a Parquet file, pages of this type are ignored.
Compression¶
Compression codec |
Notes |
---|---|
SNAPPY |
|
GZIP |
|
BROTLI |
|
LZ4 |
(1) |
ZSTD |
(1) On the read side, Parquet C++ is able to decompress both the regular LZ4 block format and the ad-hoc Hadoop LZ4 format used by the reference Parquet implementation. On the write side, Parquet C++ always generates the ad-hoc Hadoop LZ4 format.
Unsupported compression codec: LZO.
Encodings¶
Encoding |
Reading |
Writing |
Notes |
---|---|---|---|
PLAIN |
✓ |
✓ |
|
PLAIN_DICTIONARY |
✓ |
✓ |
|
BIT_PACKED |
✓ |
✓ |
(1) |
RLE |
✓ |
✓ |
(1) |
RLE_DICTIONARY |
✓ |
✓ |
(2) |
BYTE_STREAM_SPLIT |
✓ |
✓ |
|
DELTA_BINARY_PACKED |
✓ |
✓ |
|
DELTA_BYTE_ARRAY |
✓ |
||
DELTA_LENGTH_BYTE_ARRAY |
✓ |
✓ |
(1) Only supported for encoding definition and repetition levels, and boolean values.
(2) On the write path, RLE_DICTIONARY is only enabled if Parquet format version 2.4 or greater is selected in
WriterProperties::version()
.
Types¶
Physical types¶
Physical type |
Mapped Arrow type |
Notes |
---|---|---|
BOOLEAN |
Boolean |
|
INT32 |
Int32 / other |
(1) |
INT64 |
Int64 / other |
(1) |
INT96 |
Timestamp (nanoseconds) |
(2) |
FLOAT |
Float32 |
|
DOUBLE |
Float64 |
|
BYTE_ARRAY |
Binary / other |
(1) (3) |
FIXED_LENGTH_BYTE_ARRAY |
FixedSizeBinary / other |
(1) |
(1) Can be mapped to other Arrow types, depending on the logical type (see below).
(2) On the write side,
ArrowWriterProperties::support_deprecated_int96_timestamps()
must be enabled.(3) On the write side, an Arrow LargeBinary can also mapped to BYTE_ARRAY.
Logical types¶
Specific logical types can override the default Arrow type mapping for a given physical type.
Logical type |
Physical type |
Mapped Arrow type |
Notes |
---|---|---|---|
NULL |
Any |
Null |
(1) |
INT |
INT32 |
Int8 / UInt8 / Int16 / UInt16 / Int32 / UInt32 |
|
INT |
INT64 |
Int64 / UInt64 |
|
DECIMAL |
INT32 / INT64 / BYTE_ARRAY / FIXED_LENGTH_BYTE_ARRAY |
Decimal128 / Decimal256 |
(2) |
DATE |
INT32 |
Date32 |
(3) |
TIME |
INT32 |
Time32 (milliseconds) |
|
TIME |
INT64 |
Time64 (micro- or nanoseconds) |
|
TIMESTAMP |
INT64 |
Timestamp (milli-, micro- or nanoseconds) |
|
STRING |
BYTE_ARRAY |
Utf8 |
(4) |
LIST |
Any |
List |
(5) |
MAP |
Any |
Map |
(6) |
(1) On the write side, the Parquet physical type INT32 is generated.
(2) On the write side, a FIXED_LENGTH_BYTE_ARRAY is always emitted.
(3) On the write side, an Arrow Date64 is also mapped to a Parquet DATE INT32.
(4) On the write side, an Arrow LargeUtf8 is also mapped to a Parquet STRING.
(5) On the write side, an Arrow LargeList or FixedSizedList is also mapped to a Parquet LIST.
(6) On the read side, a key with multiple values does not get deduplicated, in contradiction with the Parquet specification.
Unsupported logical types: JSON, BSON, UUID. If such a type is encountered when reading a Parquet file, the default physical type mapping is used (for example, a Parquet JSON column may be read as Arrow Binary or FixedSizeBinary).
Converted types¶
While converted types are deprecated in the Parquet format (they are superceded by logical types), they are recognized and emitted by the Parquet C++ implementation so as to maximize compatibility with other Parquet implementations.
Special cases¶
An Arrow Extension type is written out as its storage type. It can still be recreated at read time using Parquet metadata (see “Roundtripping Arrow types” below).
An Arrow Dictionary type is written out as its value type. It can still be recreated at read time using Parquet metadata (see “Roundtripping Arrow types” below).
Roundtripping Arrow types¶
While there is no bijection between Arrow types and Parquet types, it is
possible to serialize the Arrow schema as part of the Parquet file metadata.
This is enabled using ArrowWriterProperties::store_schema()
.
On the read path, the serialized schema will be automatically recognized and will recreate the original Arrow data, converting the Parquet data as required (for example, a LargeList will be recreated from the Parquet LIST type).
As an example, when serializing an Arrow LargeList to Parquet:
The data is written out as a Parquet LIST
When read back, the Parquet LIST data is decoded as an Arrow LargeList if
ArrowWriterProperties::store_schema()
was enabled when writing the file; otherwise, it is decoded as an Arrow List.
Serialization details¶
The Arrow schema is serialized as a Arrow IPC schema message,
then base64-encoded and stored under the ARROW:schema
metadata key in
the Parquet file metadata.
Limitations¶
Writing or reading back FixedSizedList data with null entries is not supported.
Encryption¶
Parquet C++ implements all features specified in the encryption specification, except for encryption of column index and bloom filter modules.
More specifically, Parquet C++ supports:
AES_GCM_V1 and AES_GCM_CTR_V1 encryption algorithms.
AAD suffix for Footer, ColumnMetaData, Data Page, Dictionary Page, Data PageHeader, Dictionary PageHeader module types. Other module types (ColumnIndex, OffsetIndex, BloomFilter Header, BloomFilter Bitset) are not supported.
EncryptionWithFooterKey and EncryptionWithColumnKey modes.
Encrypted Footer and Plaintext Footer modes.
Miscellaneous¶
Feature |
Reading |
Writing |
Notes |
---|---|---|---|
Column Index |
✓ |
(1) |
|
Offset Index |
✓ |
(1) |
|
Bloom Filter |
✓ |
✓ |
(2) |
CRC checksums |
✓ |
✓ |
(3) |
(1) Access to the Column and Offset Index structures is provided, but data read APIs do not currently make any use of them.
(2) APIs are provided for creating, serializing and deserializing Bloom Filters, but they are not integrated into data read APIs.
(3) For now, only the checksums of V1 Data Pages and Dictionary Pages are computed.