pyarrow.dataset.UnionDataset¶

class pyarrow.dataset.UnionDataset(Schema schema, children)¶

Bases: Dataset

A Dataset wrapping child datasets.

Children’s schemas must agree with the provided schema.

Parameters:

schemaSchema: A known schema to conform to.
childrenlist of Dataset: One or more input children

__init__(*args, **kwargs)¶

Methods

`__init__`(args, *kwargs)
`count_rows`(self, **kwargs)	Count rows matching the scanner filter.
`filter`(self, expression)	Apply a row filter to the dataset.
`get_fragments`(self, Expression filter=None)	Returns an iterator over the fragments in this dataset.
`head`(self, int num_rows, **kwargs)	Load the first N rows of the dataset.
`join`(self, right_dataset, keys[, ...])	Perform a join between this dataset and another one.
`replace_schema`(self, Schema schema)	Return a copy of this Dataset with a different schema.
`scanner`(self, **kwargs)	Build a scan operation against the dataset.
`sort_by`(self, sorting, **kwargs)	Sort the Dataset by one or multiple columns.
`take`(self, indices, **kwargs)	Select rows of data by index.
`to_batches`(self, **kwargs)	Read the dataset as materialized record batches.
`to_table`(self, **kwargs)	Read the dataset to an Arrow table.

Attributes

`children`
`partition_expression`	An Expression which evaluates to true for all data viewed by this Dataset.
`schema`	The common schema of the full Dataset

children¶

count_rows(self, **kwargs)¶

Count rows matching the scanner filter.

Parameters:

**kwargsdict, optional: See scanner() method for full parameter description.

Returns:

countint

filter(self, expression)¶

Apply a row filter to the dataset.

Parameters:

expressionExpression: The filter that should be applied to the dataset.

Returns:

Dataset

get_fragments(self, Expression filter=None)¶

Returns an iterator over the fragments in this dataset.

Parameters:

filterExpression, default None: Return fragments matching the optional filter, either using the partition_expression or internal information like Parquet’s statistics.

Returns:

fragmentsiterator of Fragment

head(self, int num_rows, **kwargs)¶

Load the first N rows of the dataset.

Parameters:

num_rowsint: The number of rows to load.
**kwargsdict, optional: See scanner() method for full parameter description.

Returns:

tableTable

join(self, right_dataset, keys, right_keys=None, join_type='left outer', left_suffix=None, right_suffix=None, coalesce_keys=True, use_threads=True)¶

Perform a join between this dataset and another one.

Result of the join will be a new dataset, where further operations can be applied.

Parameters:

right_datasetdataset: The dataset to join to the current one, acting as the right dataset in the join operation.
keysstr or list[str]: The columns from current dataset that should be used as keys of the join operation left side.
right_keysstr or list[str], default None: The columns from the right_dataset that should be used as keys on the join operation right side. When None use the same key names as the left dataset.
join_typestr, default “left outer”: The kind of join that should be performed, one of (“left semi”, “right semi”, “left anti”, “right anti”, “inner”, “left outer”, “right outer”, “full outer”)
left_suffixstr, default None: Which suffix to add to right column names. This prevents confusion when the columns in left and right datasets have colliding names.
right_suffixstr, default None: Which suffic to add to the left column names. This prevents confusion when the columns in left and right datasets have colliding names.
coalesce_keysbool, default True: If the duplicated keys should be omitted from one of the sides in the join result.
use_threadsbool, default True: Whenever to use multithreading or not.

Returns:

InMemoryDataset

partition_expression¶: An Expression which evaluates to true for all data viewed by this Dataset.

replace_schema(self, Schema schema)¶

Return a copy of this Dataset with a different schema.

The copy will view the same Fragments. If the new schema is not compatible with the original dataset’s schema then an error will be raised.

Parameters:

schemaSchema: The new dataset schema.

scanner(self, **kwargs)¶

Build a scan operation against the dataset.

Data is not loaded immediately. Instead, this produces a Scanner, which exposes further operations (e.g. loading all data as a table, counting rows).

See the Scanner.from_dataset() method for further information.

Parameters:

**kwargsdict, optional: Arguments for Scanner.from_dataset.

Returns:

scannerScanner

Examples

>>> import pyarrow as pa
>>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
...                   'n_legs': [2, 2, 4, 4, 5, 100],
...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
...                              "Brittle stars", "Centipede"]})
>>>
>>> import pyarrow.parquet as pq
>>> pq.write_table(table, "dataset_scanner.parquet")

>>> import pyarrow.dataset as ds
>>> dataset = ds.dataset("dataset_scanner.parquet")

Selecting a subset of the columns:

>>> dataset.scanner(columns=["year", "n_legs"]).to_table()
pyarrow.Table
year: int64
n_legs: int64
----
year: [[2020,2022,2021,2022,2019,2021]]
n_legs: [[2,2,4,4,5,100]]

Projecting selected columns using an expression:

>>> dataset.scanner(columns={
...     "n_legs_uint": ds.field("n_legs").cast("uint8"),
... }).to_table()
pyarrow.Table
n_legs_uint: uint8
----
n_legs_uint: [[2,2,4,4,5,100]]

Filtering rows while scanning:

>>> dataset.scanner(filter=ds.field("year") > 2020).to_table()
pyarrow.Table
year: int64
n_legs: int64
animal: string
----
year: [[2022,2021,2022,2021]]
n_legs: [[2,4,4,100]]
animal: [["Parrot","Dog","Horse","Centipede"]]

schema¶: The common schema of the full Dataset

sort_by(self, sorting, **kwargs)¶

Sort the Dataset by one or multiple columns.

Parameters:

sortingstr or list[tuple(name, order)]: Name of the column to use to sort (ascending), or a list of multiple sorting conditions where each entry is a tuple with column name and sorting order (“ascending” or “descending”)
**kwargsdict, optional: Additional sorting options. As allowed by SortOptions

Returns:

InMemoryDataset: A new dataset sorted according to the sort keys.

take(self, indices, **kwargs)¶

Select rows of data by index.

Parameters:

indicesArray or array-like: indices of rows to select in the dataset.
**kwargsdict, optional: See scanner() method for full parameter description.

Returns:

tableTable

to_batches(self, **kwargs)¶

Read the dataset as materialized record batches.

Parameters:

**kwargsdict, optional: Arguments for Scanner.from_dataset.

Returns:

record_batchesiterator of RecordBatch

to_table(self, **kwargs)¶

Read the dataset to an Arrow table.

Note that this method reads all the selected data from the dataset into memory.

Parameters:

**kwargsdict, optional: Arguments for Scanner.from_dataset.

Returns:

tableTable

pyarrow.dataset.FileSystemDatasetFactory

pyarrow.dataset.Fragment