pyarrow.dataset.Dataset¶
- class pyarrow.dataset.Dataset¶
Bases:
_Weakrefable
Collection of data fragments and potentially child datasets.
Arrow Datasets allow you to query against data that has been split across multiple files. This sharding of data may indicate partitioning, which can accelerate queries that only touch some partitions (files).
- __init__(*args, **kwargs)¶
Methods
__init__
(*args, **kwargs)count_rows
(self, **kwargs)Count rows matching the scanner filter.
get_fragments
(self, Expression filter=None)Returns an iterator over the fragments in this dataset.
head
(self, int num_rows, **kwargs)Load the first N rows of the dataset.
join
(self, right_dataset, keys[, ...])Perform a join between this dataset and another one.
replace_schema
(self, Schema schema)Return a copy of this Dataset with a different schema.
scanner
(self, **kwargs)Build a scan operation against the dataset.
take
(self, indices, **kwargs)Select rows of data by index.
to_batches
(self, **kwargs)Read the dataset as materialized record batches.
to_table
(self, **kwargs)Read the dataset to an Arrow table.
Attributes
An Expression which evaluates to true for all data viewed by this Dataset.
The common schema of the full Dataset
- count_rows(self, **kwargs)¶
Count rows matching the scanner filter.
- get_fragments(self, Expression filter=None)¶
Returns an iterator over the fragments in this dataset.
- Parameters:
- filter
Expression
, defaultNone
Return fragments matching the optional filter, either using the partition_expression or internal information like Parquet’s statistics.
- filter
- Returns:
- fragmentsiterator of
Fragment
- fragmentsiterator of
- head(self, int num_rows, **kwargs)¶
Load the first N rows of the dataset.
- join(self, right_dataset, keys, right_keys=None, join_type='left outer', left_suffix=None, right_suffix=None, coalesce_keys=True, use_threads=True)¶
Perform a join between this dataset and another one.
Result of the join will be a new dataset, where further operations can be applied.
- Parameters:
- right_datasetdataset
The dataset to join to the current one, acting as the right dataset in the join operation.
- keys
str
orlist
[str
] The columns from current dataset that should be used as keys of the join operation left side.
- right_keys
str
orlist
[str
], defaultNone
The columns from the right_dataset that should be used as keys on the join operation right side. When
None
use the same key names as the left dataset.- join_type
str
, default “left outer” The kind of join that should be performed, one of (“left semi”, “right semi”, “left anti”, “right anti”, “inner”, “left outer”, “right outer”, “full outer”)
- left_suffix
str
, defaultNone
Which suffix to add to right column names. This prevents confusion when the columns in left and right datasets have colliding names.
- right_suffix
str
, defaultNone
Which suffic to add to the left column names. This prevents confusion when the columns in left and right datasets have colliding names.
- coalesce_keysbool, default
True
If the duplicated keys should be omitted from one of the sides in the join result.
- use_threadsbool, default
True
Whenever to use multithreading or not.
- Returns:
- partition_expression¶
An Expression which evaluates to true for all data viewed by this Dataset.
- replace_schema(self, Schema schema)¶
Return a copy of this Dataset with a different schema.
The copy will view the same Fragments. If the new schema is not compatible with the original dataset’s schema then an error will be raised.
- Parameters:
- schema
Schema
The new dataset schema.
- schema
- scanner(self, **kwargs)¶
Build a scan operation against the dataset.
Data is not loaded immediately. Instead, this produces a Scanner, which exposes further operations (e.g. loading all data as a table, counting rows).
See the Scanner.from_dataset method for further information.
Examples
>>> import pyarrow as pa >>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021], ... 'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]}) >>> >>> import pyarrow.parquet as pq >>> pq.write_table(table, "dataset_scanner.parquet")
>>> import pyarrow.dataset as ds >>> dataset = ds.dataset("dataset_scanner.parquet")
Selecting a subset of the columns:
>>> dataset.scanner(columns=["year", "n_legs"]).to_table() pyarrow.Table year: int64 n_legs: int64 ---- year: [[2020,2022,2021,2022,2019,2021]] n_legs: [[2,2,4,4,5,100]]
Projecting selected columns using an expression:
>>> dataset.scanner(columns={ ... "n_legs_uint": ds.field("n_legs").cast("uint8"), ... }).to_table() pyarrow.Table n_legs_uint: uint8 ---- n_legs_uint: [[2,2,4,4,5,100]]
Filtering rows while scanning:
>>> dataset.scanner(filter=ds.field("year") > 2020).to_table() pyarrow.Table year: int64 n_legs: int64 animal: string ---- year: [[2022,2021,2022,2021]] n_legs: [[2,4,4,100]] animal: [["Parrot","Dog","Horse","Centipede"]]
- schema¶
The common schema of the full Dataset
- take(self, indices, **kwargs)¶
Select rows of data by index.
- Parameters:
- indices
Array
orarray-like
indices of rows to select in the dataset.
- **kwargs
dict
, optional See scanner() method for full parameter description.
- indices
- Returns:
- table
Table
- table
- to_batches(self, **kwargs)¶
Read the dataset as materialized record batches.
- Parameters:
- **kwargs
dict
, optional Arguments for Scanner.from_dataset.
- **kwargs
- Returns:
- record_batchesiterator of
RecordBatch
- record_batchesiterator of