Dataset#

Factory functions#

dataset(source[, schema, format, ...])

Open a dataset.

parquet_dataset(metadata_path[, schema, ...])

Create a FileSystemDataset from a _metadata file created via pyarrrow.parquet.write_metadata.

partitioning([schema, field_names, flavor, ...])

Specify a partitioning scheme.

field(*name_or_index)

Reference a column of the dataset.

scalar(value)

Expression representing a scalar value.

write_dataset(data, base_dir, *[, ...])

Write a dataset to a given format and partitioning.

Classes#

FileFormat()

CsvFileFormat(ParseOptions parse_options=None)

FileFormat for CSV files.

CsvFragmentScanOptions(...)

Scan-specific options for CSV fragments.

IpcFileFormat()

ParquetFileFormat([read_options, ...])

FileFormat for Parquet

ParquetReadOptions([dictionary_columns, ...])

Parquet format specific options for reading.

ParquetFragmentScanOptions(...[, ...])

Scan-specific options for Parquet fragments.

OrcFileFormat()

Partitioning()

PartitioningFactory()

DirectoryPartitioning(Schema schema[, ...])

A Partitioning based on a specified Schema.

HivePartitioning(Schema schema[, ...])

A Partitioning for "/$key=$value/" nested directories as found in Apache Hive.

FilenamePartitioning(Schema schema[, ...])

A Partitioning based on a specified Schema.

Dataset()

Collection of data fragments and potentially child datasets.

FileSystemDataset(fragments, Schema schema, ...)

A Dataset of file fragments.

FileSystemFactoryOptions([...])

Influences the discovery of filesystem paths.

FileSystemDatasetFactory(...)

Create a DatasetFactory from a list of paths with schema inspection.

UnionDataset(Schema schema, children)

A Dataset wrapping child datasets.

Fragment()

Fragment of data from a Dataset.

FragmentScanOptions()

Scan options specific to a particular fragment and scan operation.

TaggedRecordBatch(record_batch, fragment)

A combination of a record batch and the fragment it came from.

Scanner()

A materialized scan operation with context and options bound.

Expression()

A logical expression to be evaluated against some input.

InMemoryDataset(source, Schema schema=None)

A Dataset wrapping in-memory data.

WrittenFile(path, metadata, size)

Metadata information about files written as part of a dataset write operation

Helper functions#

get_partition_keys(...)

Extract partition keys (equality constraints between a field and a scalar) from an expression as a dict mapping the field's name to its value.