Datasets

Warning

The pyarrow.dataset module is experimental (specifically the classes), and a stable API is not yet guaranteed.

Factory functions

source(path_or_paths[, filesystem, …])

Open a (multi-file) data source.

dataset(sources[, filesystem, partitioning, …])

Open a (multi-source) dataset.

partitioning([schema, field_names, flavor])

Specify a partitioning scheme.

field(name)

References a named column of the dataset.

scalar(value)

Expression representing a scalar value.

Classes

FileFormat

ParquetFileFormat

Partitioning

PartitioningFactory

DirectoryPartitioning

A Partitioning based on a specified Schema.

HivePartitioning

A Partitioning for “/$key=$value/” nested directories as found in Apache Hive.

Source

Basic component of a Dataset which yields zero or more fragments.

FileSystemSource

A Source created from a set of files on a particular filesystem

FileSystemFactoryOptions

Influences the discovery of filesystem paths.

FileSystemSourceFactory

Create a SourceFactory from a list of paths with schema inspection.

Dataset

Collection of data fragments coming from possibly multiple sources.

ScannerBuilder

Scanner

A materialized scan operation with context and options bound.

Expression