pyarrow.dataset.InMemoryDataset#
- class pyarrow.dataset.InMemoryDataset(source, Schema schema=None)#
Bases:
pyarrow._dataset.Dataset
A Dataset wrapping in-memory data.
- Parameters
- source
The
data
for
this
dataset. Can be a RecordBatch, Table, list of RecordBatch/Table, iterable of RecordBatch, or a RecordBatchReader. If an iterable is provided, the schema must also be provided.
- schema
Schema
, optional Only required if passing an iterable as the source.
- source
- __init__(*args, **kwargs)#
Methods
__init__
(*args, **kwargs)count_rows
(self, **kwargs)Count rows matching the scanner filter.
get_fragments
(self, Expression filter=None)Returns an iterator over the fragments in this dataset.
head
(self, int num_rows, **kwargs)Load the first N rows of the dataset.
join
(self, right_dataset, keys[, ...])Perform a join between this dataset and another one.
replace_schema
(self, Schema schema)Return a copy of this Dataset with a different schema.
scanner
(self, **kwargs)Build a scan operation against the dataset.
take
(self, indices, **kwargs)Select rows of data by index.
to_batches
(self, **kwargs)Read the dataset as materialized record batches.
to_table
(self, **kwargs)Read the dataset to an Arrow table.
Attributes
An Expression which evaluates to true for all data viewed by this Dataset.
The common schema of the full Dataset
- count_rows(self, **kwargs)#
Count rows matching the scanner filter.
- get_fragments(self, Expression filter=None)#
Returns an iterator over the fragments in this dataset.
- Parameters
- filter
Expression
, defaultNone
Return fragments matching the optional filter, either using the partition_expression or internal information like Parquet’s statistics.
- filter
- Returns
- fragmentsiterator of
Fragment
- fragmentsiterator of
- head(self, int num_rows, **kwargs)#
Load the first N rows of the dataset.
- join(self, right_dataset, keys, right_keys=None, join_type='left outer', left_suffix=None, right_suffix=None, coalesce_keys=True, use_threads=True)#
Perform a join between this dataset and another one.
Result of the join will be a new dataset, where further operations can be applied.
- Parameters
- right_datasetdataset
The dataset to join to the current one, acting as the right dataset in the join operation.
- keys
str
orlist
[str
] The columns from current dataset that should be used as keys of the join operation left side.
- right_keys
str
orlist
[str
], defaultNone
The columns from the right_dataset that should be used as keys on the join operation right side. When
None
use the same key names as the left dataset.- join_type
str
, default “left outer” The kind of join that should be performed, one of (“left semi”, “right semi”, “left anti”, “right anti”, “inner”, “left outer”, “right outer”, “full outer”)
- left_suffix
str
, defaultNone
Which suffix to add to right column names. This prevents confusion when the columns in left and right datasets have colliding names.
- right_suffix
str
, defaultNone
Which suffic to add to the left column names. This prevents confusion when the columns in left and right datasets have colliding names.
- coalesce_keysbool, default
True
If the duplicated keys should be omitted from one of the sides in the join result.
- use_threadsbool, default
True
Whenever to use multithreading or not.
- Returns
- partition_expression#
An Expression which evaluates to true for all data viewed by this Dataset.
- replace_schema(self, Schema schema)#
Return a copy of this Dataset with a different schema.
The copy will view the same Fragments. If the new schema is not compatible with the original dataset’s schema then an error will be raised.
- Parameters
- schema
Schema
The new dataset schema.
- schema
- scanner(self, **kwargs)#
Build a scan operation against the dataset.
Data is not loaded immediately. Instead, this produces a Scanner, which exposes further operations (e.g. loading all data as a table, counting rows).
See the Scanner.from_dataset method for further information.
Examples
>>> import pyarrow.dataset as ds >>> dataset = ds.dataset("path/to/dataset")
Selecting a subset of the columns:
>>> dataset.scanner(columns=["A", "B"]).to_table()
Projecting selected columns using an expression:
>>> dataset.scanner(columns={ ... "A_int": ds.field("A").cast("int64"), ... }).to_table()
Filtering rows while scanning:
>>> dataset.scanner(filter=ds.field("A") > 0).to_table()
- schema#
The common schema of the full Dataset
- take(self, indices, **kwargs)#
Select rows of data by index.
- Parameters
- indices
Array
orarray-like
indices of rows to select in the dataset.
- **kwargs
dict
, optional See scanner() method for full parameter description.
- indices
- Returns
- table
Table
- table
- to_batches(self, **kwargs)#
Read the dataset as materialized record batches.
- Parameters
- **kwargs
dict
, optional Arguments for Scanner.from_dataset.
- **kwargs
- Returns
- record_batchesiterator of
RecordBatch
- record_batchesiterator of