pyarrow.dataset.UnionDataset

class pyarrow.dataset.UnionDataset(Schema schema, children)

Bases: Dataset

A Dataset wrapping child datasets.

Children’s schemas must agree with the provided schema.

Parameters:
schemaSchema

A known schema to conform to.

childrenlist of Dataset

One or more input children

__init__(*args, **kwargs)

Methods

__init__(*args, **kwargs)

count_rows(self, **kwargs)

Count rows matching the scanner filter.

filter(self, expression)

Apply a row filter to the dataset.

get_fragments(self, Expression filter=None)

Returns an iterator over the fragments in this dataset.

head(self, int num_rows, **kwargs)

Load the first N rows of the dataset.

join(self, right_dataset, keys[, ...])

Perform a join between this dataset and another one.

replace_schema(self, Schema schema)

Return a copy of this Dataset with a different schema.

scanner(self, **kwargs)

Build a scan operation against the dataset.

sort_by(self, sorting, **kwargs)

Sort the Dataset by one or multiple columns.

take(self, indices, **kwargs)

Select rows of data by index.

to_batches(self, **kwargs)

Read the dataset as materialized record batches.

to_table(self, **kwargs)

Read the dataset to an Arrow table.

Attributes

children

partition_expression

An Expression which evaluates to true for all data viewed by this Dataset.

schema

The common schema of the full Dataset

children
count_rows(self, **kwargs)

Count rows matching the scanner filter.

Parameters:
**kwargsdict, optional

See scanner() method for full parameter description.

Returns:
countint
filter(self, expression)

Apply a row filter to the dataset.

Parameters:
expressionExpression

The filter that should be applied to the dataset.

Returns:
Dataset
get_fragments(self, Expression filter=None)

Returns an iterator over the fragments in this dataset.

Parameters:
filterExpression, default None

Return fragments matching the optional filter, either using the partition_expression or internal information like Parquet’s statistics.

Returns:
fragmentsiterator of Fragment
head(self, int num_rows, **kwargs)

Load the first N rows of the dataset.

Parameters:
num_rowsint

The number of rows to load.

**kwargsdict, optional

See scanner() method for full parameter description.

Returns:
tableTable
join(self, right_dataset, keys, right_keys=None, join_type='left outer', left_suffix=None, right_suffix=None, coalesce_keys=True, use_threads=True)

Perform a join between this dataset and another one.

Result of the join will be a new dataset, where further operations can be applied.

Parameters:
right_datasetdataset

The dataset to join to the current one, acting as the right dataset in the join operation.

keysstr or list[str]

The columns from current dataset that should be used as keys of the join operation left side.

right_keysstr or list[str], default None

The columns from the right_dataset that should be used as keys on the join operation right side. When None use the same key names as the left dataset.

join_typestr, default “left outer”

The kind of join that should be performed, one of (“left semi”, “right semi”, “left anti”, “right anti”, “inner”, “left outer”, “right outer”, “full outer”)

left_suffixstr, default None

Which suffix to add to right column names. This prevents confusion when the columns in left and right datasets have colliding names.

right_suffixstr, default None

Which suffic to add to the left column names. This prevents confusion when the columns in left and right datasets have colliding names.

coalesce_keysbool, default True

If the duplicated keys should be omitted from one of the sides in the join result.

use_threadsbool, default True

Whenever to use multithreading or not.

Returns:
InMemoryDataset
partition_expression

An Expression which evaluates to true for all data viewed by this Dataset.

replace_schema(self, Schema schema)

Return a copy of this Dataset with a different schema.

The copy will view the same Fragments. If the new schema is not compatible with the original dataset’s schema then an error will be raised.

Parameters:
schemaSchema

The new dataset schema.

scanner(self, **kwargs)

Build a scan operation against the dataset.

Data is not loaded immediately. Instead, this produces a Scanner, which exposes further operations (e.g. loading all data as a table, counting rows).

See the Scanner.from_dataset() method for further information.

Parameters:
**kwargsdict, optional

Arguments for Scanner.from_dataset.

Returns:
scannerScanner

Examples

>>> import pyarrow as pa
>>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
...                   'n_legs': [2, 2, 4, 4, 5, 100],
...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
...                              "Brittle stars", "Centipede"]})
>>>
>>> import pyarrow.parquet as pq
>>> pq.write_table(table, "dataset_scanner.parquet")
>>> import pyarrow.dataset as ds
>>> dataset = ds.dataset("dataset_scanner.parquet")

Selecting a subset of the columns:

>>> dataset.scanner(columns=["year", "n_legs"]).to_table()
pyarrow.Table
year: int64
n_legs: int64
----
year: [[2020,2022,2021,2022,2019,2021]]
n_legs: [[2,2,4,4,5,100]]

Projecting selected columns using an expression:

>>> dataset.scanner(columns={
...     "n_legs_uint": ds.field("n_legs").cast("uint8"),
... }).to_table()
pyarrow.Table
n_legs_uint: uint8
----
n_legs_uint: [[2,2,4,4,5,100]]

Filtering rows while scanning:

>>> dataset.scanner(filter=ds.field("year") > 2020).to_table()
pyarrow.Table
year: int64
n_legs: int64
animal: string
----
year: [[2022,2021,2022,2021]]
n_legs: [[2,4,4,100]]
animal: [["Parrot","Dog","Horse","Centipede"]]
schema

The common schema of the full Dataset

sort_by(self, sorting, **kwargs)

Sort the Dataset by one or multiple columns.

Parameters:
sortingstr or list[tuple(name, order)]

Name of the column to use to sort (ascending), or a list of multiple sorting conditions where each entry is a tuple with column name and sorting order (“ascending” or “descending”)

**kwargsdict, optional

Additional sorting options. As allowed by SortOptions

Returns:
InMemoryDataset

A new dataset sorted according to the sort keys.

take(self, indices, **kwargs)

Select rows of data by index.

Parameters:
indicesArray or array-like

indices of rows to select in the dataset.

**kwargsdict, optional

See scanner() method for full parameter description.

Returns:
tableTable
to_batches(self, **kwargs)

Read the dataset as materialized record batches.

Parameters:
**kwargsdict, optional

Arguments for Scanner.from_dataset.

Returns:
record_batchesiterator of RecordBatch
to_table(self, **kwargs)

Read the dataset to an Arrow table.

Note that this method reads all the selected data from the dataset into memory.

Parameters:
**kwargsdict, optional

Arguments for Scanner.from_dataset.

Returns:
tableTable