pyarrow.dataset.UnionDataset¶
- class pyarrow.dataset.UnionDataset(Schema schema, children)¶
Bases:
Dataset
A Dataset wrapping child datasets.
Children’s schemas must agree with the provided schema.
- Parameters:
- __init__(*args, **kwargs)¶
Methods
__init__
(*args, **kwargs)count_rows
(self, **kwargs)Count rows matching the scanner filter.
filter
(self, expression)Apply a row filter to the dataset.
get_fragments
(self, Expression filter=None)Returns an iterator over the fragments in this dataset.
head
(self, int num_rows, **kwargs)Load the first N rows of the dataset.
join
(self, right_dataset, keys[, ...])Perform a join between this dataset and another one.
replace_schema
(self, Schema schema)Return a copy of this Dataset with a different schema.
scanner
(self, **kwargs)Build a scan operation against the dataset.
sort_by
(self, sorting, **kwargs)Sort the Dataset by one or multiple columns.
take
(self, indices, **kwargs)Select rows of data by index.
to_batches
(self, **kwargs)Read the dataset as materialized record batches.
to_table
(self, **kwargs)Read the dataset to an Arrow table.
Attributes
An Expression which evaluates to true for all data viewed by this Dataset.
The common schema of the full Dataset
- children¶
- count_rows(self, **kwargs)¶
Count rows matching the scanner filter.
- filter(self, expression)¶
Apply a row filter to the dataset.
- Parameters:
- expression
Expression
The filter that should be applied to the dataset.
- expression
- Returns:
- get_fragments(self, Expression filter=None)¶
Returns an iterator over the fragments in this dataset.
- Parameters:
- filter
Expression
, defaultNone
Return fragments matching the optional filter, either using the partition_expression or internal information like Parquet’s statistics.
- filter
- Returns:
- fragmentsiterator of
Fragment
- fragmentsiterator of
- head(self, int num_rows, **kwargs)¶
Load the first N rows of the dataset.
- join(self, right_dataset, keys, right_keys=None, join_type='left outer', left_suffix=None, right_suffix=None, coalesce_keys=True, use_threads=True)¶
Perform a join between this dataset and another one.
Result of the join will be a new dataset, where further operations can be applied.
- Parameters:
- right_datasetdataset
The dataset to join to the current one, acting as the right dataset in the join operation.
- keys
str
orlist
[str
] The columns from current dataset that should be used as keys of the join operation left side.
- right_keys
str
orlist
[str
], defaultNone
The columns from the right_dataset that should be used as keys on the join operation right side. When
None
use the same key names as the left dataset.- join_type
str
, default “left outer” The kind of join that should be performed, one of (“left semi”, “right semi”, “left anti”, “right anti”, “inner”, “left outer”, “right outer”, “full outer”)
- left_suffix
str
, defaultNone
Which suffix to add to right column names. This prevents confusion when the columns in left and right datasets have colliding names.
- right_suffix
str
, defaultNone
Which suffic to add to the left column names. This prevents confusion when the columns in left and right datasets have colliding names.
- coalesce_keysbool, default
True
If the duplicated keys should be omitted from one of the sides in the join result.
- use_threadsbool, default
True
Whenever to use multithreading or not.
- Returns:
- partition_expression¶
An Expression which evaluates to true for all data viewed by this Dataset.
- replace_schema(self, Schema schema)¶
Return a copy of this Dataset with a different schema.
The copy will view the same Fragments. If the new schema is not compatible with the original dataset’s schema then an error will be raised.
- Parameters:
- schema
Schema
The new dataset schema.
- schema
- scanner(self, **kwargs)¶
Build a scan operation against the dataset.
Data is not loaded immediately. Instead, this produces a Scanner, which exposes further operations (e.g. loading all data as a table, counting rows).
See the
Scanner.from_dataset()
method for further information.Examples
>>> import pyarrow as pa >>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021], ... 'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]}) >>> >>> import pyarrow.parquet as pq >>> pq.write_table(table, "dataset_scanner.parquet")
>>> import pyarrow.dataset as ds >>> dataset = ds.dataset("dataset_scanner.parquet")
Selecting a subset of the columns:
>>> dataset.scanner(columns=["year", "n_legs"]).to_table() pyarrow.Table year: int64 n_legs: int64 ---- year: [[2020,2022,2021,2022,2019,2021]] n_legs: [[2,2,4,4,5,100]]
Projecting selected columns using an expression:
>>> dataset.scanner(columns={ ... "n_legs_uint": ds.field("n_legs").cast("uint8"), ... }).to_table() pyarrow.Table n_legs_uint: uint8 ---- n_legs_uint: [[2,2,4,4,5,100]]
Filtering rows while scanning:
>>> dataset.scanner(filter=ds.field("year") > 2020).to_table() pyarrow.Table year: int64 n_legs: int64 animal: string ---- year: [[2022,2021,2022,2021]] n_legs: [[2,4,4,100]] animal: [["Parrot","Dog","Horse","Centipede"]]
- schema¶
The common schema of the full Dataset
- sort_by(self, sorting, **kwargs)¶
Sort the Dataset by one or multiple columns.
- Parameters:
- Returns:
InMemoryDataset
A new dataset sorted according to the sort keys.
- take(self, indices, **kwargs)¶
Select rows of data by index.
- Parameters:
- indices
Array
orarray-like
indices of rows to select in the dataset.
- **kwargs
dict
, optional See scanner() method for full parameter description.
- indices
- Returns:
- table
Table
- table
- to_batches(self, **kwargs)¶
Read the dataset as materialized record batches.
- Parameters:
- **kwargs
dict
, optional Arguments for Scanner.from_dataset.
- **kwargs
- Returns:
- record_batchesiterator of
RecordBatch
- record_batchesiterator of