pyarrow.dataset.Dataset

class pyarrow.dataset.Dataset

Bases: _Weakrefable

Collection of data fragments and potentially child datasets.

Arrow Datasets allow you to query against data that has been split across multiple files. This sharding of data may indicate partitioning, which can accelerate queries that only touch some partitions (files).

__init__(*args, **kwargs)

Methods

__init__(*args, **kwargs)

count_rows(self, **kwargs)

Count rows matching the scanner filter.

get_fragments(self, Expression filter=None)

Returns an iterator over the fragments in this dataset.

head(self, int num_rows, **kwargs)

Load the first N rows of the dataset.

join(self, right_dataset, keys[, ...])

Perform a join between this dataset and another one.

replace_schema(self, Schema schema)

Return a copy of this Dataset with a different schema.

scanner(self, **kwargs)

Build a scan operation against the dataset.

take(self, indices, **kwargs)

Select rows of data by index.

to_batches(self, **kwargs)

Read the dataset as materialized record batches.

to_table(self, **kwargs)

Read the dataset to an Arrow table.

Attributes

partition_expression

An Expression which evaluates to true for all data viewed by this Dataset.

schema

The common schema of the full Dataset

count_rows(self, **kwargs)

Count rows matching the scanner filter.

Parameters:
**kwargsdict, optional

See scanner() method for full parameter description.

Returns:
countint
get_fragments(self, Expression filter=None)

Returns an iterator over the fragments in this dataset.

Parameters:
filterExpression, default None

Return fragments matching the optional filter, either using the partition_expression or internal information like Parquet’s statistics.

Returns:
fragmentsiterator of Fragment
head(self, int num_rows, **kwargs)

Load the first N rows of the dataset.

Parameters:
num_rowsint

The number of rows to load.

**kwargsdict, optional

See scanner() method for full parameter description.

Returns:
tableTable
join(self, right_dataset, keys, right_keys=None, join_type='left outer', left_suffix=None, right_suffix=None, coalesce_keys=True, use_threads=True)

Perform a join between this dataset and another one.

Result of the join will be a new dataset, where further operations can be applied.

Parameters:
right_datasetdataset

The dataset to join to the current one, acting as the right dataset in the join operation.

keysstr or list[str]

The columns from current dataset that should be used as keys of the join operation left side.

right_keysstr or list[str], default None

The columns from the right_dataset that should be used as keys on the join operation right side. When None use the same key names as the left dataset.

join_typestr, default “left outer”

The kind of join that should be performed, one of (“left semi”, “right semi”, “left anti”, “right anti”, “inner”, “left outer”, “right outer”, “full outer”)

left_suffixstr, default None

Which suffix to add to right column names. This prevents confusion when the columns in left and right datasets have colliding names.

right_suffixstr, default None

Which suffic to add to the left column names. This prevents confusion when the columns in left and right datasets have colliding names.

coalesce_keysbool, default True

If the duplicated keys should be omitted from one of the sides in the join result.

use_threadsbool, default True

Whenever to use multithreading or not.

Returns:
InMemoryDataset
partition_expression

An Expression which evaluates to true for all data viewed by this Dataset.

replace_schema(self, Schema schema)

Return a copy of this Dataset with a different schema.

The copy will view the same Fragments. If the new schema is not compatible with the original dataset’s schema then an error will be raised.

Parameters:
schemaSchema

The new dataset schema.

scanner(self, **kwargs)

Build a scan operation against the dataset.

Data is not loaded immediately. Instead, this produces a Scanner, which exposes further operations (e.g. loading all data as a table, counting rows).

See the Scanner.from_dataset method for further information.

Parameters:
**kwargsdict, optional

Arguments for Scanner.from_dataset.

Returns:
scannerScanner

Examples

>>> import pyarrow as pa
>>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
...                   'n_legs': [2, 2, 4, 4, 5, 100],
...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
...                              "Brittle stars", "Centipede"]})
>>> 
>>> import pyarrow.parquet as pq
>>> pq.write_table(table, "dataset_scanner.parquet")
>>> import pyarrow.dataset as ds
>>> dataset = ds.dataset("dataset_scanner.parquet")

Selecting a subset of the columns:

>>> dataset.scanner(columns=["year", "n_legs"]).to_table()
pyarrow.Table
year: int64
n_legs: int64
----
year: [[2020,2022,2021,2022,2019,2021]]
n_legs: [[2,2,4,4,5,100]]

Projecting selected columns using an expression:

>>> dataset.scanner(columns={
...     "n_legs_uint": ds.field("n_legs").cast("uint8"),
... }).to_table()
pyarrow.Table
n_legs_uint: uint8
----
n_legs_uint: [[2,2,4,4,5,100]]

Filtering rows while scanning:

>>> dataset.scanner(filter=ds.field("year") > 2020).to_table()
pyarrow.Table
year: int64
n_legs: int64
animal: string
----
year: [[2022,2021,2022,2021]]
n_legs: [[2,4,4,100]]
animal: [["Parrot","Dog","Horse","Centipede"]]
schema

The common schema of the full Dataset

take(self, indices, **kwargs)

Select rows of data by index.

Parameters:
indicesArray or array-like

indices of rows to select in the dataset.

**kwargsdict, optional

See scanner() method for full parameter description.

Returns:
tableTable
to_batches(self, **kwargs)

Read the dataset as materialized record batches.

Parameters:
**kwargsdict, optional

Arguments for Scanner.from_dataset.

Returns:
record_batchesiterator of RecordBatch
to_table(self, **kwargs)

Read the dataset to an Arrow table.

Note that this method reads all the selected data from the dataset into memory.

Parameters:
**kwargsdict, optional

Arguments for Scanner.from_dataset.

Returns:
tableTable