pyarrow.dataset.Scanner

class pyarrow.dataset.Scanner

Bases: _Weakrefable

A materialized scan operation with context and options bound.

A scanner is the class that glues the scan tasks, data fragments and data sources together.

Parameters:
datasetDataset

Dataset to scan.

columnslist of str or dict, default None

The columns to project. This can be a list of column names to include (order and duplicates will be preserved), or a dictionary with {{new_column_name: expression}} values for more advanced projections.

The list of columns or expressions may use the special fields __batch_index (the index of the batch within the fragment), __fragment_index (the index of the fragment within the dataset), __last_in_fragment (whether the batch is last in fragment), and __filename (the name of the source file or a description of the source fragment).

The columns will be passed down to Datasets and corresponding data fragments to avoid loading, copying, and deserializing columns that will not be required further down the compute chain. By default all of the available columns are projected. Raises an exception if any of the referenced column names does not exist in the dataset’s Schema.

filterExpression, default None

Scan will return only the rows matching the filter. If possible the predicate will be pushed down to exploit the partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them.

batch_sizeint, default 128Ki

The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size.

batch_readaheadint, default 16

The number of batches to read ahead in a file. Increasing this number will increase RAM usage but could also improve IO utilization.

fragment_readaheadint, default 4

The number of files to read ahead. Increasing this number will increase RAM usage but could also improve IO utilization.

use_threadsbool, default True

If enabled, then maximum parallelism will be used determined by the number of available CPU cores.

use_asyncbool, default True

This flag is deprecated and is being kept for this release for backwards compatibility. It will be removed in the next release.

memory_poolMemoryPool, default None

For memory allocations, if required. If not specified, uses the default pool.

__init__(*args, **kwargs)

Methods

__init__(*args, **kwargs)

count_rows(self)

Count rows matching the scanner filter.

from_batches(source, Schema schema=None, ...)

Create a Scanner from an iterator of batches.

from_dataset(Dataset dataset, ...[, ...])

Create Scanner from Dataset,

from_fragment(Fragment fragment, ...[, ...])

Create Scanner from Fragment,

head(self, int num_rows)

Load the first N rows of the dataset.

scan_batches(self)

Consume a Scanner in record batches with corresponding fragments.

take(self, indices)

Select rows of data by index.

to_batches(self)

Consume a Scanner in record batches.

to_reader(self)

Consume this scanner as a RecordBatchReader.

to_table(self)

Convert a Scanner into a Table.

Attributes

dataset_schema

The schema with which batches will be read from fragments.

projected_schema

The materialized schema of the data, accounting for projections.

count_rows(self)

Count rows matching the scanner filter.

Returns:
countint
dataset_schema

The schema with which batches will be read from fragments.

static from_batches(source, Schema schema=None, bool use_threads=True, use_async=None, MemoryPool memory_pool=None, columns=None, Expression filter=None, int batch_size=_DEFAULT_BATCH_SIZE, FragmentScanOptions fragment_scan_options=None)

Create a Scanner from an iterator of batches.

This creates a scanner which can be used only once. It is intended to support writing a dataset (which takes a scanner) from a source which can be read only once (e.g. a RecordBatchReader or generator).

Parameters:
sourceIterator

The iterator of Batches.

schemaSchema

The schema of the batches.

columnslist of str or dict, default None

The columns to project.

filterExpression, default None

Scan will return only the rows matching the filter.

batch_sizeint, default 128Ki

The maximum row count for scanned record batches.

use_threadsbool, default True

If enabled, then maximum parallelism will be used determined by the number of available CPU cores.

use_asyncbool, default True

This flag is deprecated and is being kept for this release for backwards compatibility. It will be removed in the next release.

memory_poolMemoryPool, default None

For memory allocations, if required. If not specified, uses the default pool.

fragment_scan_optionsFragmentScanOptions

The fragment scan options.

static from_dataset(Dataset dataset, bool use_threads=True, use_async=None, MemoryPool memory_pool=None, columns=None, Expression filter=None, int batch_size=_DEFAULT_BATCH_SIZE, int batch_readahead=_DEFAULT_BATCH_READAHEAD, int fragment_readahead=_DEFAULT_FRAGMENT_READAHEAD, FragmentScanOptions fragment_scan_options=None)

Create Scanner from Dataset,

Parameters:
datasetDataset

Dataset to scan.

columnslist of str, default None

The columns to project. This can be a list of column names to include (order and duplicates will be preserved), or a dictionary with {new_column_name: expression} values for more advanced projections.

The list of columns or expressions may use the special fields __batch_index (the index of the batch within the fragment), __fragment_index (the index of the fragment within the dataset), __last_in_fragment (whether the batch is last in fragment), and __filename (the name of the source file or a description of the source fragment).

The columns will be passed down to Datasets and corresponding data fragments to avoid loading, copying, and deserializing columns that will not be required further down the compute chain. By default all of the available columns are projected. Raises an exception if any of the referenced column names does not exist in the dataset’s Schema.

filterExpression, default None

Scan will return only the rows matching the filter. If possible the predicate will be pushed down to exploit the partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them.

batch_sizeint, default 128Ki

The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size.

batch_readaheadint, default 16

The number of batches to read ahead in a file. This might not work for all file formats. Increasing this number will increase RAM usage but could also improve IO utilization.

fragment_readaheadint, default 4

The number of files to read ahead. Increasing this number will increase RAM usage but could also improve IO utilization.

use_threadsbool, default True

If enabled, then maximum parallelism will be used determined by the number of available CPU cores.

use_asyncbool, default True

This flag is deprecated and is being kept for this release for backwards compatibility. It will be removed in the next release.

memory_poolMemoryPool, default None

For memory allocations, if required. If not specified, uses the default pool.

fragment_scan_optionsFragmentScanOptions, default None

Options specific to a particular scan and fragment type, which can change between different scans of the same dataset.

static from_fragment(Fragment fragment, Schema schema=None, bool use_threads=True, use_async=None, MemoryPool memory_pool=None, columns=None, Expression filter=None, int batch_size=_DEFAULT_BATCH_SIZE, int batch_readahead=_DEFAULT_BATCH_READAHEAD, FragmentScanOptions fragment_scan_options=None)

Create Scanner from Fragment,

Parameters:
fragmentFragment

fragment to scan.

schemaSchema, optional

The schema of the fragment.

columnslist of str, default None

The columns to project. This can be a list of column names to include (order and duplicates will be preserved), or a dictionary with {new_column_name: expression} values for more advanced projections.

The list of columns or expressions may use the special fields __batch_index (the index of the batch within the fragment), __fragment_index (the index of the fragment within the dataset), __last_in_fragment (whether the batch is last in fragment), and __filename (the name of the source file or a description of the source fragment).

The columns will be passed down to Datasets and corresponding data fragments to avoid loading, copying, and deserializing columns that will not be required further down the compute chain. By default all of the available columns are projected. Raises an exception if any of the referenced column names does not exist in the dataset’s Schema.

filterExpression, default None

Scan will return only the rows matching the filter. If possible the predicate will be pushed down to exploit the partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them.

batch_sizeint, default 128Ki

The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size.

batch_readaheadint, default 16

The number of batches to read ahead in a file. This might not work for all file formats. Increasing this number will increase RAM usage but could also improve IO utilization.

use_threadsbool, default True

If enabled, then maximum parallelism will be used determined by the number of available CPU cores.

use_asyncbool, default True

This flag is deprecated and is being kept for this release for backwards compatibility. It will be removed in the next release.

memory_poolMemoryPool, default None

For memory allocations, if required. If not specified, uses the default pool.

fragment_scan_optionsFragmentScanOptions, default None

Options specific to a particular scan and fragment type, which can change between different scans of the same dataset.

head(self, int num_rows)

Load the first N rows of the dataset.

Parameters:
num_rowsint

The number of rows to load.

Returns:
Table
projected_schema

The materialized schema of the data, accounting for projections.

This is the schema of any data returned from the scanner.

scan_batches(self)

Consume a Scanner in record batches with corresponding fragments.

Returns:
record_batchesiterator of TaggedRecordBatch
take(self, indices)

Select rows of data by index.

Will only consume as many batches of the underlying dataset as needed. Otherwise, this is equivalent to to_table().take(indices).

Parameters:
indicesArray or array-like

indices of rows to select in the dataset.

Returns:
Table
to_batches(self)

Consume a Scanner in record batches.

Returns:
record_batchesiterator of RecordBatch
to_reader(self)

Consume this scanner as a RecordBatchReader.

Returns:
RecordBatchReader
to_table(self)

Convert a Scanner into a Table.

Use this convenience utility with care. This will serially materialize the Scan result in memory before creating the Table.

Returns:
Table