pyarrow.parquet.write_to_dataset¶

pyarrow.parquet.write_to_dataset(table, root_path, partition_cols=None, partition_filename_cb=None, filesystem=None, use_legacy_dataset=None, **kwargs)[source]¶

Wrapper around parquet.write_table for writing a Table to Parquet format by partitions. For each combination of partition columns and values, a subdirectories are created in the following manner:

root_dir/

group1=value1

group2=value1: <uuid>.parquet
group2=value2: <uuid>.parquet

group1=valueN

group2=value1: <uuid>.parquet
group2=valueN: <uuid>.parquet

Parameters

table (pyarrow.Table) –
root_path (str, pathlib.Path) – The root directory of the dataset
filesystem (FileSystem, default None) – If nothing passed, paths assumed to be found in the local on-disk filesystem
partition_cols (list,) – Column names by which to partition the dataset Columns are partitioned in the order they are given
partition_filename_cb (callable,) – A callback function that takes the partition key(s) as an argument and allow you to override the partition filename. If nothing is passed, the filename will consist of a uuid.
use_legacy_dataset (bool) – Default is True unless a pyarrow.fs filesystem is passed. Set to False to enable the new code path (experimental, using the new Arrow Dataset API). This is more efficient when using partition columns, but does not (yet) support partition_filename_cb and metadata_collector keywords.
**kwargs (dict,) – Additional kwargs for write_table function. See docstring for write_table or ParquetWriter for more information. Using metadata_collector in kwargs allows one to collect the file metadata instances of dataset pieces. The file paths in the ColumnChunkMetaData will be set relative to root_path.

pyarrow.parquet.write_table

pyarrow.orc.ORCFile