pyarrow.parquet.ParquetWriter¶
-
class
pyarrow.parquet.
ParquetWriter
(where, schema, filesystem=None, flavor=None, version='1.0', use_dictionary=True, compression='snappy', write_statistics=True, use_deprecated_int96_timestamps=None, compression_level=None, use_byte_stream_split=False, writer_engine_version=None, data_page_version='1.0', use_compliant_nested_type=False, **options)[source]¶ Bases:
object
Class for incrementally building a Parquet file for Arrow tables.
- Parameters
where (path or file-like object) –
schema (arrow Schema) –
version ({"1.0", "2.0"}, default "1.0") – Determine which Parquet logical types are available for use, whether the reduced set from the Parquet 1.x.x format or the expanded logical types added in format version 2.0.0 and after. Note that files written with version=’2.0’ may not be readable in all Parquet implementations, so version=’1.0’ is likely the choice that maximizes file compatibility. Some features, such as lossless storage of nanosecond timestamps as INT64 physical storage, are only available with version=’2.0’. The Parquet 2.0.0 format version also introduced a new serialized data page format; this can be enabled separately using the data_page_version option.
use_dictionary (bool or list) – Specify if we should use dictionary encoding in general or only for some columns.
use_deprecated_int96_timestamps (bool, default None) – Write timestamps to INT96 Parquet format. Defaults to False unless enabled by flavor argument. This take priority over the coerce_timestamps option.
coerce_timestamps (str, default None) – Cast timestamps a particular resolution. The defaults depends on version. For
version='1.0'
(the default), nanoseconds will be cast to microseconds (‘us’), and seconds to milliseconds (‘ms’) by default. Forversion='2.0'
, the original resolution is preserved and no casting is done by default. The casting might result in loss of data, in which caseallow_truncated_timestamps=True
can be used to suppress the raised exception. Valid values: {None, ‘ms’, ‘us’}data_page_size (int, default None) – Set a target threshold for the approximate encoded size of data pages within a column chunk (in bytes). If None, use the default data page size of 1MByte.
allow_truncated_timestamps (bool, default False) – Allow loss of data when coercing timestamps to a particular resolution. E.g. if microsecond or nanosecond data is lost when coercing to ‘ms’, do not raise an exception. Passing
allow_truncated_timestamp=True
will NOT result in the truncation exception being ignored unlesscoerce_timestamps
is not None.compression (str or dict) – Specify the compression codec, either on a general basis or per-column. Valid values: {‘NONE’, ‘SNAPPY’, ‘GZIP’, ‘BROTLI’, ‘LZ4’, ‘ZSTD’}.
write_statistics (bool or list) – Specify if we should write statistics in general (default is True) or only for some columns.
flavor ({'spark'}, default None) – Sanitize schema or set other compatibility options to work with various target systems.
filesystem (FileSystem, default None) – If nothing passed, will be inferred from where if path-like, else where is already a file-like object so no filesystem is needed.
compression_level (int or dict, default None) – Specify the compression level for a codec, either on a general basis or per-column. If None is passed, arrow selects the compression level for the compression codec in use. The compression level has a different meaning for each codec, so you have to read the documentation of the codec you are using. An exception is thrown if the compression codec does not allow specifying a compression level.
use_byte_stream_split (bool or list, default False) – Specify if the byte_stream_split encoding should be used in general or only for some columns. If both dictionary and byte_stream_stream are enabled, then dictionary is preferred. The byte_stream_split encoding is valid only for floating-point data types and should be combined with a compression codec.
data_page_version ({"1.0", "2.0"}, default "1.0") – The serialized Parquet data page format version to write, defaults to 1.0. This does not impact the file schema logical types and Arrow to Parquet type casting behavior; for that use the “version” option.
use_compliant_nested_type (bool, default False) –
Whether to write compliant Parquet nested type (lists) as defined here, defaults to
False
. Foruse_compliant_nested_type=True
, this will write into a list with 3-level structure where the middle level, namedlist
, is a repeated group with a single field namedelement
:<list-repetition> group <name> (LIST) { repeated group list { <element-repetition> <element-type> element; } }
For
use_compliant_nested_type=False
, this will also write into a list with 3-level structure, where the name of the single field of the middle levellist
is taken from the element name for nested columns in Arrow, which defaults toitem
:<list-repetition> group <name> (LIST) { repeated group list { <element-repetition> <element-type> item; } }
**options (dict) – If options contains a key metadata_collector then the corresponding value is assumed to be a list (or any object with .append method) that will be filled with the file metadata instance of the written file.
-
__init__
(where, schema, filesystem=None, flavor=None, version='1.0', use_dictionary=True, compression='snappy', write_statistics=True, use_deprecated_int96_timestamps=None, compression_level=None, use_byte_stream_split=False, writer_engine_version=None, data_page_version='1.0', use_compliant_nested_type=False, **options)[source]¶ Initialize self. See help(type(self)) for accurate signature.
Methods
__init__
(where, schema[, filesystem, …])Initialize self.
close
()write_table
(table[, row_group_size])