Dataset

Warning

The pyarrow.dataset module is experimental (specifically the classes), and a stable API is not yet guaranteed.

Factory functions

dataset(source[, schema, format, …])

Open a dataset.

parquet_dataset(metadata_path[, schema, …])

Create a FileSystemDataset from a _metadata file created via pyarrrow.parquet.write_metadata.

partitioning([schema, field_names, flavor, …])

Specify a partitioning scheme.

field(name)

Reference a named column of the dataset.

scalar(value)

Expression representing a scalar value.

write_dataset(data, base_dir[, …])

Write a dataset to a given format and partitioning.

Classes

FileFormat()

ParquetFileFormat([read_options, …])

FileFormat for Parquet

IpcFileFormat()

CsvFileFormat(ParseOptions parse_options=None)

FileFormat for CSV files.

Partitioning()

PartitioningFactory()

DirectoryPartitioning(Schema schema[, …])

A Partitioning based on a specified Schema.

HivePartitioning(Schema schema[, …])

A Partitioning for “/$key=$value/” nested directories as found in Apache Hive.

Dataset()

Collection of data fragments and potentially child datasets.

FileSystemDataset(fragments, Schema schema, …)

A Dataset of file fragments.

FileSystemFactoryOptions([…])

Influences the discovery of filesystem paths.

FileSystemDatasetFactory(…)

Create a DatasetFactory from a list of paths with schema inspection.

UnionDataset(Schema schema, children)

A Dataset wrapping child datasets.

Scanner()

A materialized scan operation with context and options bound.

Expression()

A logical expression to be evaluated against some input.