pyarrow.dataset.Scanner¶
- class pyarrow.dataset.Scanner¶
Bases:
pyarrow.lib._Weakrefable
A materialized scan operation with context and options bound.
A scanner is the class that glues the scan tasks, data fragments and data sources together.
- Parameters
- dataset
Dataset
Dataset to scan.
- columns
list
ofstr
ordict
, defaultNone
The columns to project. This can be a list of column names to include (order and duplicates will be preserved), or a dictionary with {new_column_name: expression} values for more advanced projections. The columns will be passed down to Datasets and corresponding data fragments to avoid loading, copying, and deserializing columns that will not be required further down the compute chain. By default all of the available columns are projected. Raises an exception if any of the referenced column names does not exist in the dataset’s Schema.
- filter
Expression
, defaultNone
Scan will return only the rows matching the filter. If possible the predicate will be pushed down to exploit the partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them.
- batch_size
int
, default 1M The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size.
- use_threadsbool, default
True
If enabled, then maximum parallelism will be used determined by the number of available CPU cores.
- use_asyncbool, default
True
This flag is deprecated and is being kept for this release for backwards compatibility. It will be removed in the next release.
- memory_pool
MemoryPool
, defaultNone
For memory allocations, if required. If not specified, uses the default pool.
- dataset
- __init__(*args, **kwargs)¶
Methods
__init__
(*args, **kwargs)count_rows
(self)Count rows matching the scanner filter.
from_batches
(source, Schema schema=None, ...)Create a Scanner from an iterator of batches.
from_dataset
(Dataset dataset, ...[, ...])Create Scanner from Dataset, refer to Scanner class doc for additional details on Scanner.
from_fragment
(Fragment fragment, ...[, ...])Create Scanner from Fragment, refer to Scanner class doc for additional details on Scanner.
head
(self, int num_rows)Load the first N rows of the dataset.
scan_batches
(self)Consume a Scanner in record batches with corresponding fragments.
take
(self, indices)Select rows of data by index.
to_batches
(self)Consume a Scanner in record batches.
to_reader
(self)Consume this scanner as a RecordBatchReader.
to_table
(self)Convert a Scanner into a Table.
Attributes
The schema with which batches will be read from fragments.
The materialized schema of the data, accounting for projections.
- dataset_schema¶
The schema with which batches will be read from fragments.
- static from_batches(source, Schema schema=None, bool use_threads=True, use_async=None, MemoryPool memory_pool=None, columns=None, Expression filter=None, int batch_size=_DEFAULT_BATCH_SIZE, FragmentScanOptions fragment_scan_options=None)¶
Create a Scanner from an iterator of batches.
This creates a scanner which can be used only once. It is intended to support writing a dataset (which takes a scanner) from a source which can be read only once (e.g. a RecordBatchReader or generator).
- Parameters
- sourceIterator
The iterator of Batches.
- schema
Schema
The schema of the batches.
- columns
list
ofstr
ordict
, defaultNone
The columns to project.
- filter
Expression
, defaultNone
Scan will return only the rows matching the filter.
- batch_size
int
, default 1M The maximum row count for scanned record batches.
- use_threadsbool, default
True
If enabled, then maximum parallelism will be used determined by the number of available CPU cores.
- use_asyncbool, default
True
This flag is deprecated and is being kept for this release for backwards compatibility. It will be removed in the next release.
- memory_pool
MemoryPool
, defaultNone
For memory allocations, if required. If not specified, uses the default pool.
- fragment_scan_options
FragmentScanOptions
The fragment scan options.
- static from_dataset(Dataset dataset, bool use_threads=True, use_async=None, MemoryPool memory_pool=None, columns=None, Expression filter=None, int batch_size=_DEFAULT_BATCH_SIZE, FragmentScanOptions fragment_scan_options=None)¶
Create Scanner from Dataset, refer to Scanner class doc for additional details on Scanner.
- Parameters
- dataset
Dataset
Dataset to scan.
- columns
list
ofstr
ordict
, defaultNone
The columns to project.
- filter
Expression
, defaultNone
Scan will return only the rows matching the filter.
- batch_size
int
, default 1M The maximum row count for scanned record batches.
- use_threadsbool, default
True
If enabled, then maximum parallelism will be used determined by the number of available CPU cores.
- use_asyncbool, default N/A
This flag is deprecated and is being kept for this release for backwards compatibility. It will be removed in the next release.
- memory_pool
MemoryPool
, defaultNone
For memory allocations, if required. If not specified, uses the default pool.
- fragment_scan_options
FragmentScanOptions
The fragment scan options.
- dataset
- static from_fragment(Fragment fragment, Schema schema=None, bool use_threads=True, use_async=None, MemoryPool memory_pool=None, columns=None, Expression filter=None, int batch_size=_DEFAULT_BATCH_SIZE, FragmentScanOptions fragment_scan_options=None)¶
Create Scanner from Fragment, refer to Scanner class doc for additional details on Scanner.
- Parameters
- fragment
Fragment
fragment to scan.
- schema
Schema
The schema of the fragment.
- columns
list
ofstr
ordict
, defaultNone
The columns to project.
- filter
Expression
, defaultNone
Scan will return only the rows matching the filter.
- batch_size
int
, default 1M The maximum row count for scanned record batches.
- use_threadsbool, default
True
If enabled, then maximum parallelism will be used determined by the number of available CPU cores.
- use_asyncbool, default N/A
This flag is deprecated and is being kept for this release for backwards compatibility. It will be removed in the next release.
- memory_pool
MemoryPool
, defaultNone
For memory allocations, if required. If not specified, uses the default pool.
- fragment_scan_options
FragmentScanOptions
The fragment scan options.
- fragment
- projected_schema¶
The materialized schema of the data, accounting for projections.
This is the schema of any data returned from the scanner.
- scan_batches(self)¶
Consume a Scanner in record batches with corresponding fragments.
- Returns
- record_batchesiterator of
TaggedRecordBatch
- record_batchesiterator of
- take(self, indices)¶
Select rows of data by index.
Will only consume as many batches of the underlying dataset as needed. Otherwise, this is equivalent to
to_table().take(indices)
.- Returns
- to_batches(self)¶
Consume a Scanner in record batches.
- Returns
- record_batchesiterator of
RecordBatch
- record_batchesiterator of
- to_reader(self)¶
Consume this scanner as a RecordBatchReader.