pyarrow.parquet.write_to_dataset#
- pyarrow.parquet.write_to_dataset(table, root_path, partition_cols=None, filesystem=None, schema=None, partitioning=None, basename_template=None, use_threads=None, file_visitor=None, existing_data_behavior=None, **kwargs)[source]#
- Wrapper around dataset.write_dataset for writing a Table to Parquet format by partitions. For each combination of partition columns and values, a subdirectories are created in the following manner: - root_dir/
- group1=value1
- group2=value1
- <uuid>.parquet 
- group2=value2
- <uuid>.parquet 
 
- group1=valueN
- group2=value1
- <uuid>.parquet 
- group2=valueN
- <uuid>.parquet 
 
 
 - Parameters:
- tablepyarrow.Table
- root_pathstr,pathlib.Path
- The root directory of the dataset. 
- partition_colslist,
- Column names by which to partition the dataset. Columns are partitioned in the order they are given. 
- filesystemFileSystem, defaultNone
- If nothing passed, will be inferred based on path. Path will try to be found in the local on-disk filesystem otherwise it will be parsed as an URI to determine the filesystem. 
- schemaSchema, optional
- This Schema of the dataset. 
- partitioningPartitioningorlist[str], optional
- The partitioning scheme specified with the - pyarrow.dataset.partitioning()function or a list of field names. When providing a list of field names, you can use- partitioning_flavorto drive which partitioning type should be used.
- basename_templatestr, optional
- A template string used to generate basenames of written data files. The token ‘{i}’ will be replaced with an automatically incremented integer. If not specified, it defaults to “guid-{i}.parquet”. 
- use_threadsbool, default True
- Write files in parallel. If enabled, then maximum parallelism will be used determined by the number of available CPU cores. 
- file_visitorfunction
- If set, this function will be called with a WrittenFile instance for each file created during the call. This object will have both a path attribute and a metadata attribute. - The path attribute will be a string containing the path to the created file. - The metadata attribute will be the parquet metadata of the file. This metadata will have the file path attribute set and can be used to build a _metadata file. The metadata attribute will be None if the format is not parquet. - Example visitor which simple collects the filenames created: - visited_paths = [] def file_visitor(written_file): visited_paths.append(written_file.path) 
- existing_data_behavior‘overwrite_or_ignore’ | ‘error’ | ‘delete_matching’
- Controls how the dataset will handle data that already exists in the destination. The default behaviour is ‘overwrite_or_ignore’. - ‘overwrite_or_ignore’ will ignore any existing data and will overwrite files with the same name as an output file. Other existing files will be ignored. This behavior, in combination with a unique basename_template for each write, will allow for an append workflow. - ‘error’ will raise an error if any data exists in the destination. - ‘delete_matching’ is useful when you are writing a partitioned dataset. The first time each partition directory is encountered the entire directory will be deleted. This allows you to overwrite old partitions completely. 
- **kwargsdict,
- Used as additional kwargs for - pyarrow.dataset.write_dataset()function for matching kwargs, and remainder to- pyarrow.dataset.ParquetFileFormat.make_write_options(). See the docstring of- write_table()and- pyarrow.dataset.write_dataset()for the available options. Using metadata_collector in kwargs allows one to collect the file metadata instances of dataset pieces. The file paths in the ColumnChunkMetaData will be set relative to root_path.
 
- table
 - Examples - Generate an example PyArrow Table: - >>> import pyarrow as pa >>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021], ... 'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]}) - and write it to a partitioned dataset: - >>> import pyarrow.parquet as pq >>> pq.write_to_dataset(table, root_path='dataset_name_3', ... partition_cols=['year']) >>> pq.ParquetDataset('dataset_name_3').files ['dataset_name_3/year=2019/...-0.parquet', ... - Write a single Parquet file into the root folder: - >>> pq.write_to_dataset(table, root_path='dataset_name_4') >>> pq.ParquetDataset('dataset_name_4/').files ['dataset_name_4/...-0.parquet'] 
 
    