pyarrow.parquet.ParquetWriter¶

class pyarrow.parquet.ParquetWriter(where, schema, filesystem=None, flavor=None, version='2.4', use_dictionary=True, compression='snappy', write_statistics=True, use_deprecated_int96_timestamps=None, compression_level=None, use_byte_stream_split=False, column_encoding=None, writer_engine_version=None, data_page_version='1.0', use_compliant_nested_type=False, encryption_properties=None, write_batch_size=None, dictionary_pagesize_limit=None, **options)[source]¶

Bases: object

Class for incrementally building a Parquet file for Arrow tables.

Parameters:

wherepath or file-like object

schemapyarrow.Schema

version{“1.0”, “2.4”, “2.6”}, default “2.4”

Determine which Parquet logical types are available for use, whether the reduced set from the Parquet 1.x.x format or the expanded logical types added in later format versions. Files written with version=’2.4’ or ‘2.6’ may not be readable in all Parquet implementations, so version=’1.0’ is likely the choice that maximizes file compatibility. UINT32 and some logical types are only available with version ‘2.4’. Nanosecond timestamps are only available with version ‘2.6’. Other features such as compression algorithms or the new serialized data page format must be enabled separately (see ‘compression’ and ‘data_page_version’).

use_dictionarybool or list

Specify if we should use dictionary encoding in general or only for some columns.

use_deprecated_int96_timestampsbool, default None

Write timestamps to INT96 Parquet format. Defaults to False unless enabled by flavor argument. This take priority over the coerce_timestamps option.

coerce_timestampsstr, default None

Cast timestamps to a particular resolution. If omitted, defaults are chosen depending on version. By default, for version='1.0' (the default) and version='2.4', nanoseconds are cast to microseconds (‘us’), while for other version values, they are written natively without loss of resolution. Seconds are always cast to milliseconds (‘ms’) by default, as Parquet does not have any temporal type with seconds resolution. If the casting results in loss of data, it will raise an exception unless allow_truncated_timestamps=True is given. Valid values: {None, ‘ms’, ‘us’}

data_page_sizeint, default None

Set a target threshold for the approximate encoded size of data pages within a column chunk (in bytes). If None, use the default data page size of 1MByte.

allow_truncated_timestampsbool, default False

Allow loss of data when coercing timestamps to a particular resolution. E.g. if microsecond or nanosecond data is lost when coercing to ‘ms’, do not raise an exception. Passing allow_truncated_timestamp=True will NOT result in the truncation exception being ignored unless coerce_timestamps is not None.

compressionstr or dict

Specify the compression codec, either on a general basis or per-column. Valid values: {‘NONE’, ‘SNAPPY’, ‘GZIP’, ‘BROTLI’, ‘LZ4’, ‘ZSTD’}.

write_statisticsbool or list

Specify if we should write statistics in general (default is True) or only for some columns.

flavor{‘spark’}, default None

Sanitize schema or set other compatibility options to work with various target systems.

filesystemFileSystem, default None

If nothing passed, will be inferred from where if path-like, else where is already a file-like object so no filesystem is needed.

compression_levelint or dict, default None

Specify the compression level for a codec, either on a general basis or per-column. If None is passed, arrow selects the compression level for the compression codec in use. The compression level has a different meaning for each codec, so you have to read the documentation of the codec you are using. An exception is thrown if the compression codec does not allow specifying a compression level.

use_byte_stream_splitbool or list, default False

Specify if the byte_stream_split encoding should be used in general or only for some columns. If both dictionary and byte_stream_stream are enabled, then dictionary is preferred. The byte_stream_split encoding is valid only for floating-point data types and should be combined with a compression codec.

column_encodingstr or dict, default None

Specify the encoding scheme on a per column basis. Currently supported values: {‘PLAIN’, ‘BYTE_STREAM_SPLIT’}. Certain encodings are only compatible with certain data types. Please refer to the encodings section of Reading and writing Parquet files.

data_page_version{“1.0”, “2.0”}, default “1.0”

The serialized Parquet data page format version to write, defaults to 1.0. This does not impact the file schema logical types and Arrow to Parquet type casting behavior; for that use the “version” option.

use_compliant_nested_typebool, default False

Whether to write compliant Parquet nested type (lists) as defined here, defaults to False. For use_compliant_nested_type=True, this will write into a list with 3-level structure where the middle level, named list, is a repeated group with a single field named element:

<list-repetition> group <name> (LIST) {
    repeated group list {
          <element-repetition> <element-type> element;
    }
}

For use_compliant_nested_type=False, this will also write into a list with 3-level structure, where the name of the single field of the middle level list is taken from the element name for nested columns in Arrow, which defaults to item:

<list-repetition> group <name> (LIST) {
    repeated group list {
        <element-repetition> <element-type> item;
    }
}

encryption_propertiesFileEncryptionProperties, default None

File encryption properties for Parquet Modular Encryption. If None, no encryption will be done. The encryption properties can be created using: CryptoFactory.file_encryption_properties().

write_batch_sizeint, default None

Number of values to write to a page at a time. If None, use the default of 1024. write_batch_size is complementary to data_page_size. If pages are exceeding the data_page_size due to large column values, lowering the batch size can help keep page sizes closer to the intended size.

dictionary_pagesize_limitint, default None

Specify the dictionary page size limit per row group. If None, use the default 1MB.

writer_engine_versionunused

**optionsdict

If options contains a key metadata_collector then the corresponding value is assumed to be a list (or any object with .append method) that will be filled with the file metadata instance of the written file.

Examples

Generate an example PyArrow Table and RecordBatch:

>>> import pyarrow as pa
>>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100],
...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
...                              "Brittle stars", "Centipede"]})
>>> batch = pa.record_batch([[2, 2, 4, 4, 5, 100],
...                         ["Flamingo", "Parrot", "Dog", "Horse",
...                          "Brittle stars", "Centipede"]],
...                         names=['n_legs', 'animal'])

create a ParquetWriter object:

>>> import pyarrow.parquet as pq
>>> writer = pq.ParquetWriter('example.parquet', table.schema)

and write the Table into the Parquet file:

>>> writer.write_table(table)
>>> writer.close()

>>> pq.read_table('example.parquet').to_pandas()
   n_legs         animal
     2       Flamingo
     2         Parrot
     4            Dog
     4          Horse
     5  Brittle stars
   100      Centipede

create a ParquetWriter object for the RecordBatch:

>>> writer2 = pq.ParquetWriter('example2.parquet', batch.schema)

and write the RecordBatch into the Parquet file:

>>> writer2.write_batch(batch)
>>> writer2.close()

>>> pq.read_table('example2.parquet').to_pandas()
   n_legs         animal
     2       Flamingo
     2         Parrot
     4            Dog
     4          Horse
     5  Brittle stars
   100      Centipede

__init__(where, schema, filesystem=None, flavor=None, version='2.4', use_dictionary=True, compression='snappy', write_statistics=True, use_deprecated_int96_timestamps=None, compression_level=None, use_byte_stream_split=False, column_encoding=None, writer_engine_version=None, data_page_version='1.0', use_compliant_nested_type=False, encryption_properties=None, write_batch_size=None, dictionary_pagesize_limit=None, **options)[source]¶

Methods

`__init__`(where, schema[, filesystem, ...])
`close`()	Close the connection to the Parquet file.
`write`(table_or_batch[, row_group_size])	Write RecordBatch or Table to the Parquet file.
`write_batch`(batch[, row_group_size])	Write RecordBatch to the Parquet file.
`write_table`(table[, row_group_size])	Write Table to the Parquet file.

close()[source]¶: Close the connection to the Parquet file.

write(table_or_batch, row_group_size=None)[source]¶

Write RecordBatch or Table to the Parquet file.

Parameters:

table_or_batch{RecordBatch, Table}
row_group_sizeint, default None: Maximum size of each written row group. If None, the row group size will be the minimum of the input table or batch length and 64 * 1024 * 1024.

write_batch(batch, row_group_size=None)[source]¶

Write RecordBatch to the Parquet file.

Parameters:

batchRecordBatch
row_group_sizeint, default None: Maximum size of each written row group. If None, the row group size will be the minimum of the RecordBatch size and 64 * 1024 * 1024.

write_table(table, row_group_size=None)[source]¶

Write Table to the Parquet file.

Parameters:

tableTable
row_group_sizeint, default None: Maximum size of each written row group. If None, the row group size will be the minimum of the Table size and 64 * 1024 * 1024.

pyarrow.parquet.ParquetFile

pyarrow.parquet.read_table