pyarrow.dataset.Scanner¶

class pyarrow.dataset.Scanner¶

Bases: pyarrow.lib._Weakrefable

A materialized scan operation with context and options bound.

A scanner is the class that glues the scan tasks, data fragments and data sources together.

Parameters

dataset (Dataset) – Dataset to scan.
columns (list of str or dict, default None) – The columns to project. This can be a list of column names to include (order and duplicates will be preserved), or a dictionary with {new_column_name: expression} values for more advanced projections. The columns will be passed down to Datasets and corresponding data fragments to avoid loading, copying, and deserializing columns that will not be required further down the compute chain. By default all of the available columns are projected. Raises an exception if any of the referenced column names does not exist in the dataset’s Schema.
filter (Expression, default None) – Scan will return only the rows matching the filter. If possible the predicate will be pushed down to exploit the partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them.
batch_size (int, default 1M) – The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size.
use_threads (bool, default True) – If enabled, then maximum parallelism will be used determined by the number of available CPU cores.
memory_pool (MemoryPool, default None) – For memory allocations, if required. If not specified, uses the default pool.

__init__(*args, **kwargs)¶: Initialize self. See help(type(self)) for accurate signature.

Methods

`__init__`(args, *kwargs)	Initialize self.
`from_dataset`
`from_fragment`
`head`	Load the first N rows of the dataset.
`scan`	Returns a stream of ScanTasks
`scan_batches`	Consume a Scanner in record batches with corresponding fragments.
`take`	Select rows of data by index.
`to_batches`	Consume a Scanner in record batches.
`to_table`	Convert a Scanner into a Table.

head()¶

Load the first N rows of the dataset.

scan()¶

Returns a stream of ScanTasks

The caller is responsible to dispatch/schedule said tasks. Tasks should be safe to run in a concurrent fashion and outlive the iterator.

Deprecated since version 4.0.0: Use to_batches instead.

scan_batches()¶

Consume a Scanner in record batches with corresponding fragments.

Sequentially executes the ScanTasks as the returned generator gets consumed.

take()¶

Select rows of data by index.

Will only consume as many batches of the underlying dataset as needed. Otherwise, this is equivalent to to_table().take(indices).

to_batches()¶

Consume a Scanner in record batches.

Sequentially executes the ScanTasks as the returned generator gets consumed.

to_table()¶

Convert a Scanner into a Table.

Use this convenience utility with care. This will serially materialize the Scan result in memory before creating the Table.