pyarrow.dataset.Fragment#
- class pyarrow.dataset.Fragment#
Bases:
_Weakrefable
Fragment of data from a Dataset.
- __init__(*args, **kwargs)#
Methods
__init__
(*args, **kwargs)count_rows
(self, Expression filter=None, ...)Count rows matching the scanner filter.
head
(self, int num_rows[, columns])Load the first N rows of the fragment.
scanner
(self, Schema schema=None[, columns])Build a scan operation against the fragment.
take
(self, indices[, columns])Select rows of data by index.
to_batches
(self, Schema schema=None[, columns])Read the fragment as materialized record batches.
to_table
(self, Schema schema=None[, columns])Convert this Fragment into a Table.
Attributes
An Expression which evaluates to true for all data viewed by this Fragment.
Return the physical schema of this Fragment.
- count_rows(self, Expression filter=None, int batch_size=_DEFAULT_BATCH_SIZE, int batch_readahead=_DEFAULT_BATCH_READAHEAD, int fragment_readahead=_DEFAULT_FRAGMENT_READAHEAD, FragmentScanOptions fragment_scan_options=None, bool use_threads=True, MemoryPool memory_pool=None)#
Count rows matching the scanner filter.
- Parameters:
- filter
Expression
, defaultNone
Scan will return only the rows matching the filter. If possible the predicate will be pushed down to exploit the partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them.
- batch_size
int
, default 131_072 The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size.
- batch_readahead
int
, default 16 The number of batches to read ahead in a file. This might not work for all file formats. Increasing this number will increase RAM usage but could also improve IO utilization.
- fragment_readahead
int
, default 4 The number of files to read ahead. Increasing this number will increase RAM usage but could also improve IO utilization.
- fragment_scan_options
FragmentScanOptions
, defaultNone
Options specific to a particular scan and fragment type, which can change between different scans of the same dataset.
- use_threadsbool, default
True
If enabled, then maximum parallelism will be used determined by the number of available CPU cores.
- memory_pool
MemoryPool
, defaultNone
For memory allocations, if required. If not specified, uses the default pool.
- filter
- Returns:
- count
int
- count
- head(self, int num_rows, columns=None, Expression filter=None, int batch_size=_DEFAULT_BATCH_SIZE, int batch_readahead=_DEFAULT_BATCH_READAHEAD, int fragment_readahead=_DEFAULT_FRAGMENT_READAHEAD, FragmentScanOptions fragment_scan_options=None, bool use_threads=True, MemoryPool memory_pool=None)#
Load the first N rows of the fragment.
- Parameters:
- num_rows
int
The number of rows to load.
- columns
list
ofstr
, defaultNone
The columns to project. This can be a list of column names to include (order and duplicates will be preserved), or a dictionary with {new_column_name: expression} values for more advanced projections.
The list of columns or expressions may use the special fields __batch_index (the index of the batch within the fragment), __fragment_index (the index of the fragment within the dataset), __last_in_fragment (whether the batch is last in fragment), and __filename (the name of the source file or a description of the source fragment).
The columns will be passed down to Datasets and corresponding data fragments to avoid loading, copying, and deserializing columns that will not be required further down the compute chain. By default all of the available columns are projected. Raises an exception if any of the referenced column names does not exist in the dataset’s Schema.
- filter
Expression
, defaultNone
Scan will return only the rows matching the filter. If possible the predicate will be pushed down to exploit the partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them.
- batch_size
int
, default 131_072 The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size.
- batch_readahead
int
, default 16 The number of batches to read ahead in a file. This might not work for all file formats. Increasing this number will increase RAM usage but could also improve IO utilization.
- fragment_readahead
int
, default 4 The number of files to read ahead. Increasing this number will increase RAM usage but could also improve IO utilization.
- fragment_scan_options
FragmentScanOptions
, defaultNone
Options specific to a particular scan and fragment type, which can change between different scans of the same dataset.
- use_threadsbool, default
True
If enabled, then maximum parallelism will be used determined by the number of available CPU cores.
- memory_pool
MemoryPool
, defaultNone
For memory allocations, if required. If not specified, uses the default pool.
- num_rows
- Returns:
- partition_expression#
An Expression which evaluates to true for all data viewed by this Fragment.
- physical_schema#
Return the physical schema of this Fragment. This schema can be different from the dataset read schema.
- scanner(self, Schema schema=None, columns=None, Expression filter=None, int batch_size=_DEFAULT_BATCH_SIZE, int batch_readahead=_DEFAULT_BATCH_READAHEAD, int fragment_readahead=_DEFAULT_FRAGMENT_READAHEAD, FragmentScanOptions fragment_scan_options=None, bool use_threads=True, MemoryPool memory_pool=None)#
Build a scan operation against the fragment.
Data is not loaded immediately. Instead, this produces a Scanner, which exposes further operations (e.g. loading all data as a table, counting rows).
- Parameters:
- schema
Schema
Schema to use for scanning. This is used to unify a Fragment to its Dataset’s schema. If not specified this will use the Fragment’s physical schema which might differ for each Fragment.
- columns
list
ofstr
, defaultNone
The columns to project. This can be a list of column names to include (order and duplicates will be preserved), or a dictionary with {new_column_name: expression} values for more advanced projections.
The list of columns or expressions may use the special fields __batch_index (the index of the batch within the fragment), __fragment_index (the index of the fragment within the dataset), __last_in_fragment (whether the batch is last in fragment), and __filename (the name of the source file or a description of the source fragment).
The columns will be passed down to Datasets and corresponding data fragments to avoid loading, copying, and deserializing columns that will not be required further down the compute chain. By default all of the available columns are projected. Raises an exception if any of the referenced column names does not exist in the dataset’s Schema.
- filter
Expression
, defaultNone
Scan will return only the rows matching the filter. If possible the predicate will be pushed down to exploit the partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them.
- batch_size
int
, default 131_072 The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size.
- batch_readahead
int
, default 16 The number of batches to read ahead in a file. This might not work for all file formats. Increasing this number will increase RAM usage but could also improve IO utilization.
- fragment_readahead
int
, default 4 The number of files to read ahead. Increasing this number will increase RAM usage but could also improve IO utilization.
- fragment_scan_options
FragmentScanOptions
, defaultNone
Options specific to a particular scan and fragment type, which can change between different scans of the same dataset.
- use_threadsbool, default
True
If enabled, then maximum parallelism will be used determined by the number of available CPU cores.
- memory_pool
MemoryPool
, defaultNone
For memory allocations, if required. If not specified, uses the default pool.
- schema
- Returns:
- scanner
Scanner
- scanner
- take(self, indices, columns=None, Expression filter=None, int batch_size=_DEFAULT_BATCH_SIZE, int batch_readahead=_DEFAULT_BATCH_READAHEAD, int fragment_readahead=_DEFAULT_FRAGMENT_READAHEAD, FragmentScanOptions fragment_scan_options=None, bool use_threads=True, MemoryPool memory_pool=None)#
Select rows of data by index.
- Parameters:
- indices
Array
orarray-like
The indices of row to select in the dataset.
- columns
list
ofstr
, defaultNone
The columns to project. This can be a list of column names to include (order and duplicates will be preserved), or a dictionary with {new_column_name: expression} values for more advanced projections.
The list of columns or expressions may use the special fields __batch_index (the index of the batch within the fragment), __fragment_index (the index of the fragment within the dataset), __last_in_fragment (whether the batch is last in fragment), and __filename (the name of the source file or a description of the source fragment).
The columns will be passed down to Datasets and corresponding data fragments to avoid loading, copying, and deserializing columns that will not be required further down the compute chain. By default all of the available columns are projected. Raises an exception if any of the referenced column names does not exist in the dataset’s Schema.
- filter
Expression
, defaultNone
Scan will return only the rows matching the filter. If possible the predicate will be pushed down to exploit the partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them.
- batch_size
int
, default 131_072 The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size.
- batch_readahead
int
, default 16 The number of batches to read ahead in a file. This might not work for all file formats. Increasing this number will increase RAM usage but could also improve IO utilization.
- fragment_readahead
int
, default 4 The number of files to read ahead. Increasing this number will increase RAM usage but could also improve IO utilization.
- fragment_scan_options
FragmentScanOptions
, defaultNone
Options specific to a particular scan and fragment type, which can change between different scans of the same dataset.
- use_threadsbool, default
True
If enabled, then maximum parallelism will be used determined by the number of available CPU cores.
- memory_pool
MemoryPool
, defaultNone
For memory allocations, if required. If not specified, uses the default pool.
- indices
- Returns:
- to_batches(self, Schema schema=None, columns=None, Expression filter=None, int batch_size=_DEFAULT_BATCH_SIZE, int batch_readahead=_DEFAULT_BATCH_READAHEAD, int fragment_readahead=_DEFAULT_FRAGMENT_READAHEAD, FragmentScanOptions fragment_scan_options=None, bool use_threads=True, MemoryPool memory_pool=None)#
Read the fragment as materialized record batches.
- Parameters:
- schema
Schema
, optional Concrete schema to use for scanning.
- columns
list
ofstr
, defaultNone
The columns to project. This can be a list of column names to include (order and duplicates will be preserved), or a dictionary with {new_column_name: expression} values for more advanced projections.
The list of columns or expressions may use the special fields __batch_index (the index of the batch within the fragment), __fragment_index (the index of the fragment within the dataset), __last_in_fragment (whether the batch is last in fragment), and __filename (the name of the source file or a description of the source fragment).
The columns will be passed down to Datasets and corresponding data fragments to avoid loading, copying, and deserializing columns that will not be required further down the compute chain. By default all of the available columns are projected. Raises an exception if any of the referenced column names does not exist in the dataset’s Schema.
- filter
Expression
, defaultNone
Scan will return only the rows matching the filter. If possible the predicate will be pushed down to exploit the partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them.
- batch_size
int
, default 131_072 The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size.
- batch_readahead
int
, default 16 The number of batches to read ahead in a file. This might not work for all file formats. Increasing this number will increase RAM usage but could also improve IO utilization.
- fragment_readahead
int
, default 4 The number of files to read ahead. Increasing this number will increase RAM usage but could also improve IO utilization.
- fragment_scan_options
FragmentScanOptions
, defaultNone
Options specific to a particular scan and fragment type, which can change between different scans of the same dataset.
- use_threadsbool, default
True
If enabled, then maximum parallelism will be used determined by the number of available CPU cores.
- memory_pool
MemoryPool
, defaultNone
For memory allocations, if required. If not specified, uses the default pool.
- schema
- Returns:
- record_batchesiterator of
RecordBatch
- record_batchesiterator of
- to_table(self, Schema schema=None, columns=None, Expression filter=None, int batch_size=_DEFAULT_BATCH_SIZE, int batch_readahead=_DEFAULT_BATCH_READAHEAD, int fragment_readahead=_DEFAULT_FRAGMENT_READAHEAD, FragmentScanOptions fragment_scan_options=None, bool use_threads=True, MemoryPool memory_pool=None)#
Convert this Fragment into a Table.
Use this convenience utility with care. This will serially materialize the Scan result in memory before creating the Table.
- Parameters:
- schema
Schema
, optional Concrete schema to use for scanning.
- columns
list
ofstr
, defaultNone
The columns to project. This can be a list of column names to include (order and duplicates will be preserved), or a dictionary with {new_column_name: expression} values for more advanced projections.
The list of columns or expressions may use the special fields __batch_index (the index of the batch within the fragment), __fragment_index (the index of the fragment within the dataset), __last_in_fragment (whether the batch is last in fragment), and __filename (the name of the source file or a description of the source fragment).
The columns will be passed down to Datasets and corresponding data fragments to avoid loading, copying, and deserializing columns that will not be required further down the compute chain. By default all of the available columns are projected. Raises an exception if any of the referenced column names does not exist in the dataset’s Schema.
- filter
Expression
, defaultNone
Scan will return only the rows matching the filter. If possible the predicate will be pushed down to exploit the partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them.
- batch_size
int
, default 131_072 The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size.
- batch_readahead
int
, default 16 The number of batches to read ahead in a file. This might not work for all file formats. Increasing this number will increase RAM usage but could also improve IO utilization.
- fragment_readahead
int
, default 4 The number of files to read ahead. Increasing this number will increase RAM usage but could also improve IO utilization.
- fragment_scan_options
FragmentScanOptions
, defaultNone
Options specific to a particular scan and fragment type, which can change between different scans of the same dataset.
- use_threadsbool, default
True
If enabled, then maximum parallelism will be used determined by the number of available CPU cores.
- memory_pool
MemoryPool
, defaultNone
For memory allocations, if required. If not specified, uses the default pool.
- schema
- Returns:
- table
Table
- table