pyarrow.parquet.write_to_dataset

pyarrow.parquet.write_to_dataset(table, root_path, partition_cols=None, partition_filename_cb=None, filesystem=None, use_legacy_dataset=None, **kwargs)[source]

Wrapper around parquet.write_table for writing a Table to Parquet format by partitions. For each combination of partition columns and values, a subdirectories are created in the following manner:

root_dir/
group1=value1
group2=value1

<uuid>.parquet

group2=value2

<uuid>.parquet

group1=valueN
group2=value1

<uuid>.parquet

group2=valueN

<uuid>.parquet

Parameters
  • table (pyarrow.Table) –

  • root_path (str, pathlib.Path) – The root directory of the dataset

  • filesystem (FileSystem, default None) – If nothing passed, paths assumed to be found in the local on-disk filesystem

  • partition_cols (list,) – Column names by which to partition the dataset Columns are partitioned in the order they are given

  • partition_filename_cb (callable,) – A callback function that takes the partition key(s) as an argument and allow you to override the partition filename. If nothing is passed, the filename will consist of a uuid.

  • use_legacy_dataset (bool) – Default is True unless a pyarrow.fs filesystem is passed. Set to False to enable the new code path (experimental, using the new Arrow Dataset API). This is more efficient when using partition columns, but does not (yet) support partition_filename_cb and metadata_collector keywords.

  • **kwargs (dict,) – Additional kwargs for write_table function. See docstring for write_table or ParquetWriter for more information. Using metadata_collector in kwargs allows one to collect the file metadata instances of dataset pieces. The file paths in the ColumnChunkMetaData will be set relative to root_path.