pyarrow.dataset.InMemoryDataset#

class pyarrow.dataset.InMemoryDataset(source, Schema schema=None)#

Bases: pyarrow._dataset.Dataset

A Dataset wrapping in-memory data.

Parameters

sourceThe data for this dataset.: Can be a RecordBatch, Table, list of RecordBatch/Table, iterable of RecordBatch, or a RecordBatchReader. If an iterable is provided, the schema must also be provided.
schemaSchema, optional: Only required if passing an iterable as the source.

__init__(*args, **kwargs)#

Methods

`__init__`(args, *kwargs)
`count_rows`(self, **kwargs)	Count rows matching the scanner filter.
`get_fragments`(self, Expression filter=None)	Returns an iterator over the fragments in this dataset.
`head`(self, int num_rows, **kwargs)	Load the first N rows of the dataset.
`join`(self, right_dataset, keys[, ...])	Perform a join between this dataset and another one.
`replace_schema`(self, Schema schema)	Return a copy of this Dataset with a different schema.
`scanner`(self, **kwargs)	Build a scan operation against the dataset.
`take`(self, indices, **kwargs)	Select rows of data by index.
`to_batches`(self, **kwargs)	Read the dataset as materialized record batches.
`to_table`(self, **kwargs)	Read the dataset to an Arrow table.

Attributes

`partition_expression`	An Expression which evaluates to true for all data viewed by this Dataset.
`schema`	The common schema of the full Dataset

count_rows(self, **kwargs)#

Count rows matching the scanner filter.

Parameters

**kwargsdict, optional: See scanner() method for full parameter description.

Returns

countint

get_fragments(self, Expression filter=None)#

Returns an iterator over the fragments in this dataset.

Parameters

filterExpression, default None: Return fragments matching the optional filter, either using the partition_expression or internal information like Parquet’s statistics.

Returns

fragmentsiterator of Fragment

head(self, int num_rows, **kwargs)#

Load the first N rows of the dataset.

Parameters

num_rowsint: The number of rows to load.
**kwargsdict, optional: See scanner() method for full parameter description.

Returns

tableTable

join(self, right_dataset, keys, right_keys=None, join_type='left outer', left_suffix=None, right_suffix=None, coalesce_keys=True, use_threads=True)#

Perform a join between this dataset and another one.

Result of the join will be a new dataset, where further operations can be applied.

Parameters

right_datasetdataset: The dataset to join to the current one, acting as the right dataset in the join operation.
keysstr or list[str]: The columns from current dataset that should be used as keys of the join operation left side.
right_keysstr or list[str], default None: The columns from the right_dataset that should be used as keys on the join operation right side. When None use the same key names as the left dataset.
join_typestr, default “left outer”: The kind of join that should be performed, one of (“left semi”, “right semi”, “left anti”, “right anti”, “inner”, “left outer”, “right outer”, “full outer”)
left_suffixstr, default None: Which suffix to add to right column names. This prevents confusion when the columns in left and right datasets have colliding names.
right_suffixstr, default None: Which suffic to add to the left column names. This prevents confusion when the columns in left and right datasets have colliding names.
coalesce_keysbool, default True: If the duplicated keys should be omitted from one of the sides in the join result.
use_threadsbool, default True: Whenever to use multithreading or not.

Returns

InMemoryDataset

partition_expression#: An Expression which evaluates to true for all data viewed by this Dataset.

replace_schema(self, Schema schema)#

Return a copy of this Dataset with a different schema.

The copy will view the same Fragments. If the new schema is not compatible with the original dataset’s schema then an error will be raised.

Parameters

schemaSchema: The new dataset schema.

scanner(self, **kwargs)#

Build a scan operation against the dataset.

Data is not loaded immediately. Instead, this produces a Scanner, which exposes further operations (e.g. loading all data as a table, counting rows).

See the Scanner.from_dataset method for further information.

Parameters

**kwargsdict, optional: Arguments for Scanner.from_dataset.

Returns

scannerScanner

Examples

>>> import pyarrow.dataset as ds
>>> dataset = ds.dataset("path/to/dataset")

Selecting a subset of the columns:

>>> dataset.scanner(columns=["A", "B"]).to_table()

Projecting selected columns using an expression:

>>> dataset.scanner(columns={
...     "A_int": ds.field("A").cast("int64"),
... }).to_table()

Filtering rows while scanning:

>>> dataset.scanner(filter=ds.field("A") > 0).to_table()

schema#: The common schema of the full Dataset

take(self, indices, **kwargs)#

Select rows of data by index.

Parameters

indicesArray or array-like: indices of rows to select in the dataset.
**kwargsdict, optional: See scanner() method for full parameter description.

Returns

tableTable

to_batches(self, **kwargs)#

Read the dataset as materialized record batches.

Parameters

**kwargsdict, optional: Arguments for Scanner.from_dataset.

Returns

record_batchesiterator of RecordBatch

to_table(self, **kwargs)#

Read the dataset to an Arrow table.

Note that this method reads all the selected data from the dataset into memory.

Parameters

**kwargsdict, optional: Arguments for Scanner.from_dataset.

Returns

tableTable

pyarrow.dataset.Expression

Plasma In-Memory Object Store