pyarrow.parquet.ParquetWriter#

class pyarrow.parquet.ParquetWriter(where, schema, filesystem=None, flavor=None, version='2.6', use_dictionary=True, compression='snappy', write_statistics=True, use_deprecated_int96_timestamps=None, compression_level=None, use_byte_stream_split=False, column_encoding=None, writer_engine_version=None, data_page_version='1.0', use_compliant_nested_type=True, encryption_properties=None, write_batch_size=None, dictionary_pagesize_limit=None, store_schema=True, write_page_index=False, write_page_checksum=False, sorting_columns=None, store_decimal_as_integer=False, write_time_adjusted_to_utc=False, max_rows_per_page=None, **options)[source]#

Bases: object

Class for incrementally building a Parquet file for Arrow tables.

Parameters:

wherepath or file-like object

schemapyarrow.Schema

version{“1.0”, “2.4”, “2.6”}, default “2.6”

Determine which Parquet logical types are available for use, whether the reduced set from the Parquet 1.x.x format or the expanded logical types added in later format versions. Files written with version=’2.4’ or ‘2.6’ may not be readable in all Parquet implementations, so version=’1.0’ is likely the choice that maximizes file compatibility. UINT32 and some logical types are only available with version ‘2.4’. Nanosecond timestamps are only available with version ‘2.6’. Other features such as compression algorithms or the new serialized data page format must be enabled separately (see ‘compression’ and ‘data_page_version’).

use_dictionarybool or list, default True

Specify if we should use dictionary encoding in general or only for some columns. When encoding the column, if the dictionary size is too large, the column will fallback to PLAIN encoding. Specially, BOOLEAN type doesn’t support dictionary encoding.

compressionstr or dict, default ‘snappy’

Specify the compression codec, either on a general basis or per-column. Valid values: {‘NONE’, ‘SNAPPY’, ‘GZIP’, ‘BROTLI’, ‘LZ4’, ‘LZ4_RAW’, ‘ZSTD’}. ‘LZ4_RAW’ is accepted as an alias for ‘LZ4’ (both use the LZ4_RAW codec as defined in the Parquet specification).

write_statisticsbool or list, default True

Specify if we should write statistics in general (default is True) or only for some columns.

use_deprecated_int96_timestampsbool, default None

Write timestamps to INT96 Parquet format. Defaults to False unless enabled by flavor argument. This take priority over the coerce_timestamps option.

coerce_timestampsstr, default None

Cast timestamps to a particular resolution. If omitted, defaults are chosen depending on version. For version='1.0' and version='2.4', nanoseconds are cast to microseconds (‘us’), while for version='2.6' (the default), they are written natively without loss of resolution. Seconds are always cast to milliseconds (‘ms’) by default, as Parquet does not have any temporal type with seconds resolution. If the casting results in loss of data, it will raise an exception unless allow_truncated_timestamps=True is given. Valid values: {None, ‘ms’, ‘us’}

allow_truncated_timestampsbool, default False

Allow loss of data when coercing timestamps to a particular resolution. E.g. if microsecond or nanosecond data is lost when coercing to ‘ms’, do not raise an exception. Passing allow_truncated_timestamp=True will NOT result in the truncation exception being ignored unless coerce_timestamps is not None.

data_page_sizeint, default None

Set a target threshold for the approximate encoded size of data pages within a column chunk (in bytes). If None, use the default data page size of 1MByte.

max_rows_per_pageint, default None

Maximum number of rows per page within a column chunk. If None, use the default of 20000. Smaller values reduce memory usage during reads but increase metadata overhead.

flavor{‘spark’}, default None

Sanitize schema or set other compatibility options to work with various target systems.

filesystemFileSystem, default None

If nothing passed, will be inferred from where if path-like, else where is already a file-like object so no filesystem is needed.

compression_levelint or dict, default None

Specify the compression level for a codec, either on a general basis or per-column. If None is passed, arrow selects the compression level for the compression codec in use. The compression level has a different meaning for each codec, so you have to read the documentation of the codec you are using. An exception is thrown if the compression codec does not allow specifying a compression level.

use_byte_stream_splitbool or list, default False

Specify if the byte_stream_split encoding should be used in general or only for some columns. If both dictionary and byte_stream_stream are enabled, then dictionary is preferred. The byte_stream_split encoding is valid for integer, floating-point and fixed-size binary data types (including decimals); it should be combined with a compression codec so as to achieve size reduction.

column_encodingstr or dict, default None

Specify the encoding scheme on a per column basis. Can only be used when use_dictionary is set to False, and cannot be used in combination with use_byte_stream_split. Currently supported values: {‘PLAIN’, ‘BYTE_STREAM_SPLIT’, ‘DELTA_BINARY_PACKED’, ‘DELTA_LENGTH_BYTE_ARRAY’, ‘DELTA_BYTE_ARRAY’}. Certain encodings are only compatible with certain data types. Please refer to the encodings section of Reading and writing Parquet files.

data_page_version{“1.0”, “2.0”}, default “1.0”

The serialized Parquet data page format version to write, defaults to 1.0. This does not impact the file schema logical types and Arrow to Parquet type casting behavior; for that use the “version” option.

use_compliant_nested_typebool, default True

Whether to write compliant Parquet nested type (lists) as defined here, defaults to True. For use_compliant_nested_type=True, this will write into a list with 3-level structure where the middle level, named list, is a repeated group with a single field named element:

<list-repetition> group <name> (LIST) {
    repeated group list {
          <element-repetition> <element-type> element;
    }
}

For use_compliant_nested_type=False, this will also write into a list with 3-level structure, where the name of the single field of the middle level list is taken from the element name for nested columns in Arrow, which defaults to item:

<list-repetition> group <name> (LIST) {
    repeated group list {
        <element-repetition> <element-type> item;
    }
}

encryption_propertiesFileEncryptionProperties, default None

File encryption properties for Parquet Modular Encryption. If None, no encryption will be done. The encryption properties can be created using: CryptoFactory.file_encryption_properties().

write_batch_sizeint, default None

Number of values to write to a page at a time. If None, use the default of 1024. write_batch_size is complementary to data_page_size. If pages are exceeding the data_page_size due to large column values, lowering the batch size can help keep page sizes closer to the intended size.

dictionary_pagesize_limitint, default None

Specify the dictionary page size limit per row group. If None, use the default 1MB.

store_schemabool, default True

By default, the Arrow schema is serialized and stored in the Parquet file metadata (in the “ARROW:schema” key). When reading the file, if this key is available, it will be used to more faithfully recreate the original Arrow data. For example, for tz-aware timestamp columns it will restore the timezone (Parquet only stores the UTC values without timezone), or columns with duration type will be restored from the int64 Parquet column.

write_page_indexbool, default False

Whether to write a page index in general for all columns. Writing statistics to the page index disables the old method of writing statistics to each data page header. The page index makes statistics-based filtering more efficient than the page header, as it gathers all the statistics for a Parquet file in a single place, avoiding scattered I/O. Note that the page index is not yet used on the read size by PyArrow.

write_page_checksumbool, default False

Whether to write page checksums in general for all columns. Page checksums enable detection of data corruption, which might occur during transmission or in the storage.

sorting_columnsSequence of SortingColumn, default None

Specify the sort order of the data being written. The writer does not sort the data nor does it verify that the data is sorted. The sort order is written to the row group metadata, which can then be used by readers.

store_decimal_as_integerbool, default False

Allow decimals with 1 <= precision <= 18 to be stored as integers. In Parquet, DECIMAL can be stored in any of the following physical types:

int32: for 1 <= precision <= 9.
int64: for 10 <= precision <= 18.
fixed_len_byte_array: precision is limited by the array size. Length n can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits.
binary: precision is unlimited. The minimum number of bytes to store the unscaled value is used.

By default, this is DISABLED and all decimal types annotate fixed_len_byte_array. When enabled, the writer will use the following physical types to store decimals:

int32: for 1 <= precision <= 9.
int64: for 10 <= precision <= 18.
fixed_len_byte_array: for precision > 18.

As a consequence, decimal columns stored in integer types are more compact.

use_content_defined_chunkingbool or dict, default False

Optimize parquet files for content addressable storage (CAS) systems by writing data pages according to content-defined chunk boundaries. This allows for more efficient deduplication of data across files, hence more efficient network transfers and storage. The chunking is based on a rolling hash algorithm that identifies chunk boundaries based on the actual content of the data.

Note that it is an experimental feature and the API may change in the future.

If set to True, a default configuration is used with min_chunk_size=256 KiB and max_chunk_size=1024 KiB. The chunk size distribution approximates a normal distribution between min_chunk_size and max_chunk_size (sizes are accounted before any Parquet encodings).

A dict can be passed to adjust the chunker parameters with the following keys:

min_chunk_size: minimum chunk size in bytes, default 256 KiB The rolling hash will not be updated until this size is reached for each chunk. Note that all data sent through the hash function is counted towards the chunk size, including definition and repetition levels if present.
max_chunk_size: maximum chunk size in bytes, default is 1024 KiB The chunker will create a new chunk whenever the chunk size exceeds this value. Note that the parquet writer has a related data_pagesize property that controls the maximum size of a parquet data page after encoding. While setting data_page_size to a smaller value than max_chunk_size doesn’t affect the chunking effectiveness, it results in more small parquet data pages.
norm_level: normalization level to center the chunk size around the average size more aggressively, default 0 Increasing the normalization level increases the probability of finding a chunk, improving the deduplication ratio, but also increasing the number of small chunks resulting in many small parquet data pages. The default value provides a good balance between deduplication ratio and fragmentation. Use norm_level=1 or norm_level=2 to reach a higher deduplication ratio at the expense of fragmentation.

write_time_adjusted_to_utcbool, default False

Set the value of isAdjustedTOUTC when writing a TIME column. If True, this tells the Parquet reader that the TIME columns are expressed in reference to midnight in the UTC timezone. If False (the default), the TIME columns are assumed to be expressed in reference to midnight in an unknown, presumably local, timezone.

bloom_filter_optionsdict, default None

Create Bloom filters for the columns specified by the provided dict.

Bloom filters can be configured with two parameters: number of distinct values (NDV), and false-positive probability (FPP).

Bloom filters are most effective for high-cardinality columns. A good default is to set NDV equal to the number of rows. Lower values reduce disk usage but may not be worthwhile for very small NDVs. Increasing NDV (without increasing FPP) increases disk and memory usage.

Lower FPP values require more disk and memory space. For a fixed NDV, the space requirement grows roughly proportional to log(1/FPP). Recommended values are 0.1, 0.05, or 0.01. Very small values are counterproductive as the bitset may exceed the size of the actual data. Set NDV appropriately to minimize space usage.

The keys of the dict are column paths. For each path, the value can be either:

A dictionary, with keys ndv and fpp. The value for ndv must be a positive integer. If the ‘ndv’ key is not present, the default value of 1048576 will be used. The value for fpp must be a float between 0.0 and 1.0. If the fpp key is not present, the default value of 0.05 will be used.
A boolean, with True indicating that a Bloom filter should be produced with the above mentioned default values of ndv=1048576 and fpp=0.05. This is equivalent to passing an empty dict.

writer_engine_versionunused

**optionsdict

If options contains a key metadata_collector then the corresponding value is assumed to be a list (or any object with .append method) that will be filled with the file metadata instance of the written file.

Examples

Generate an example PyArrow Table and RecordBatch:

>>> import pyarrow as pa
>>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100],
...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
...                              "Brittle stars", "Centipede"]})
>>> batch = pa.record_batch([[2, 2, 4, 4, 5, 100],
...                         ["Flamingo", "Parrot", "Dog", "Horse",
...                          "Brittle stars", "Centipede"]],
...                         names=['n_legs', 'animal'])

create a ParquetWriter object:

>>> import pyarrow.parquet as pq
>>> writer = pq.ParquetWriter('example.parquet', table.schema)

and write the Table into the Parquet file:

>>> writer.write_table(table)
>>> writer.close()

>>> pq.read_table('example.parquet').to_pandas()
   n_legs         animal
     2       Flamingo
     2         Parrot
     4            Dog
     4          Horse
     5  Brittle stars
   100      Centipede

create a ParquetWriter object for the RecordBatch:

>>> writer2 = pq.ParquetWriter('example2.parquet', batch.schema)

and write the RecordBatch into the Parquet file:

>>> writer2.write_batch(batch)
>>> writer2.close()

>>> pq.read_table('example2.parquet').to_pandas()
   n_legs         animal
     2       Flamingo
     2         Parrot
     4            Dog
     4          Horse
     5  Brittle stars
   100      Centipede

__init__(where, schema, filesystem=None, flavor=None, version='2.6', use_dictionary=True, compression='snappy', write_statistics=True, use_deprecated_int96_timestamps=None, compression_level=None, use_byte_stream_split=False, column_encoding=None, writer_engine_version=None, data_page_version='1.0', use_compliant_nested_type=True, encryption_properties=None, write_batch_size=None, dictionary_pagesize_limit=None, store_schema=True, write_page_index=False, write_page_checksum=False, sorting_columns=None, store_decimal_as_integer=False, write_time_adjusted_to_utc=False, max_rows_per_page=None, **options)[source]#

Methods

`__init__`(where, schema[, filesystem, ...])
`add_key_value_metadata`(key_value_metadata)	Add key-value metadata to the file.
`close`()	Close the connection to the Parquet file.
`write`(table_or_batch[, row_group_size])	Write RecordBatch or Table to the Parquet file.
`write_batch`(batch[, row_group_size])	Write RecordBatch to the Parquet file.
`write_table`(table[, row_group_size])	Write Table to the Parquet file.

add_key_value_metadata(key_value_metadata)[source]#

Add key-value metadata to the file. This will overwrite any existing metadata with the same key.

Parameters:

key_value_metadatadict: Keys and values must be string-like / coercible to bytes.

close()[source]#: Close the connection to the Parquet file.

write(table_or_batch, row_group_size=None)[source]#

Write RecordBatch or Table to the Parquet file.

Parameters:

table_or_batch{RecordBatch, Table}
row_group_sizeint, default None: Maximum number of rows in each written row group. If None, the row group size will be the minimum of the number of rows in the Table/RecordBatch and 1024 * 1024.

write_batch(batch, row_group_size=None)[source]#

Write RecordBatch to the Parquet file.

Parameters:

batchRecordBatch
row_group_sizeint, default None: Maximum number of rows in written row group. If None, the row group size will be the minimum of the RecordBatch size (in rows) and 1024 * 1024. If set larger than 64 * 1024 * 1024 then 64 * 1024 * 1024 will be used instead.

write_table(table, row_group_size=None)[source]#

Write Table to the Parquet file.

Parameters:

tableTable
row_group_sizeint, default None: Maximum number of rows in each written row group. If None, the row group size will be the minimum of the Table size (in rows) and 1024 * 1024. If set larger than 64 * 1024 * 1024 then 64 * 1024 * 1024 will be used instead.