pyarrow.dataset.Dataset

class pyarrow.dataset.Dataset

Bases: pyarrow.lib._Weakrefable

Collection of data fragments and potentially child datasets.

Arrow Datasets allow you to query against data that has been split across multiple files. This sharding of data may indicate partitioning, which can accelerate queries that only touch some partitions (files).

__init__(*args, **kwargs)

Methods

__init__(*args, **kwargs)

count_rows(self, **kwargs)

Count rows matching the scanner filter.

get_fragments(self, Expression filter=None)

Returns an iterator over the fragments in this dataset.

head(self, int num_rows, **kwargs)

Load the first N rows of the dataset.

replace_schema(self, Schema schema)

Return a copy of this Dataset with a different schema.

scanner(self, **kwargs)

Builds a scan operation against the dataset.

take(self, indices, **kwargs)

Select rows of data by index.

to_batches(self, **kwargs)

Read the dataset as materialized record batches.

to_table(self, **kwargs)

Read the dataset to an arrow table.

Attributes

partition_expression

An Expression which evaluates to true for all data viewed by this Dataset.

schema

The common schema of the full Dataset

count_rows(self, **kwargs)

Count rows matching the scanner filter.

See scanner method parameters documentation.

Returns
countint
get_fragments(self, Expression filter=None)

Returns an iterator over the fragments in this dataset.

Parameters
filterExpression, default None

Return fragments matching the optional filter, either using the partition_expression or internal information like Parquet’s statistics.

Returns
fragmentsiterator of Fragment
head(self, int num_rows, **kwargs)

Load the first N rows of the dataset.

See scanner method parameters documentation.

Returns
Table
partition_expression

An Expression which evaluates to true for all data viewed by this Dataset.

replace_schema(self, Schema schema)

Return a copy of this Dataset with a different schema.

The copy will view the same Fragments. If the new schema is not compatible with the original dataset’s schema then an error will be raised.

scanner(self, **kwargs)

Builds a scan operation against the dataset.

Data is not loaded immediately. Instead, this produces a Scanner, which exposes further operations (e.g. loading all data as a table, counting rows).

Parameters
columnslist of str, default None

The columns to project. This can be a list of column names to include (order and duplicates will be preserved), or a dictionary with {new_column_name: expression} values for more advanced projections. The columns will be passed down to Datasets and corresponding data fragments to avoid loading, copying, and deserializing columns that will not be required further down the compute chain. By default all of the available columns are projected. Raises an exception if any of the referenced column names does not exist in the dataset’s Schema.

filterExpression, default None

Scan will return only the rows matching the filter. If possible the predicate will be pushed down to exploit the partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them.

batch_sizeint, default 1M

The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size.

use_threadsbool, default True

If enabled, then maximum parallelism will be used determined by the number of available CPU cores.

use_asyncbool, default True

This flag is deprecated and is being kept for this release for backwards compatibility. It will be removed in the next release.

memory_poolMemoryPool, default None

For memory allocations, if required. If not specified, uses the default pool.

fragment_scan_optionsFragmentScanOptions, default None

Options specific to a particular scan and fragment type, which can change between different scans of the same dataset.

Returns
scannerScanner

Examples

>>> import pyarrow.dataset as ds
>>> dataset = ds.dataset("path/to/dataset")

Selecting a subset of the columns:

>>> dataset.scanner(columns=["A", "B"]).to_table()

Projecting selected columns using an expression:

>>> dataset.scanner(columns={
...     "A_int": ds.field("A").cast("int64"),
... }).to_table()

Filtering rows while scanning:

>>> dataset.scanner(filter=ds.field("A") > 0).to_table()
schema

The common schema of the full Dataset

take(self, indices, **kwargs)

Select rows of data by index.

See scanner method parameters documentation.

Returns
Table
to_batches(self, **kwargs)

Read the dataset as materialized record batches.

See scanner method parameters documentation.

Returns
record_batchesiterator of RecordBatch
to_table(self, **kwargs)

Read the dataset to an arrow table.

Note that this method reads all the selected data from the dataset into memory.

See scanner method parameters documentation.

Returns
Table