pyarrow.dataset.Fragment¶
- class pyarrow.dataset.Fragment¶
- Bases: - pyarrow.lib._Weakrefable- Fragment of data from a Dataset. - __init__(*args, **kwargs)¶
 - Methods - __init__(*args, **kwargs)- count_rows(self, **kwargs)- Count rows matching the scanner filter. - head(self, int num_rows, **kwargs)- Load the first N rows of the fragment. - scanner(self, Schema schema=None, **kwargs)- Builds a scan operation against the dataset. - take(self, indices, **kwargs)- Select rows of data by index. - to_batches(self, Schema schema=None, **kwargs)- Read the fragment as materialized record batches. - to_table(self, Schema schema=None, **kwargs)- Convert this Fragment into a Table. - Attributes - An Expression which evaluates to true for all data viewed by this Fragment. - Return the physical schema of this Fragment. - count_rows(self, **kwargs)¶
- Count rows matching the scanner filter. - See scanner method parameters documentation. - Returns
- countint
 
- count
 
 - head(self, int num_rows, **kwargs)¶
- Load the first N rows of the fragment. - See scanner method parameters documentation. - Returns
 
 - partition_expression¶
- An Expression which evaluates to true for all data viewed by this Fragment. 
 - physical_schema¶
- Return the physical schema of this Fragment. This schema can be different from the dataset read schema. 
 - scanner(self, Schema schema=None, **kwargs)¶
- Builds a scan operation against the dataset. - Data is not loaded immediately. Instead, this produces a Scanner, which exposes further operations (e.g. loading all data as a table, counting rows). - Parameters
- schemaSchema
- Schema to use for scanning. This is used to unify a Fragment to it’s Dataset’s schema. If not specified this will use the Fragment’s physical schema which might differ for each Fragment. 
- columnslistofstr, defaultNone
- The columns to project. This can be a list of column names to include (order and duplicates will be preserved), or a dictionary with {new_column_name: expression} values for more advanced projections. The columns will be passed down to Datasets and corresponding data fragments to avoid loading, copying, and deserializing columns that will not be required further down the compute chain. By default all of the available columns are projected. Raises an exception if any of the referenced column names does not exist in the dataset’s Schema. 
- filterExpression, defaultNone
- Scan will return only the rows matching the filter. If possible the predicate will be pushed down to exploit the partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them. 
- batch_sizeint, default 1M
- The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size. 
- use_threadsbool, default True
- If enabled, then maximum parallelism will be used determined by the number of available CPU cores. 
- memory_poolMemoryPool, defaultNone
- For memory allocations, if required. If not specified, uses the default pool. 
- fragment_scan_optionsFragmentScanOptions, defaultNone
- Options specific to a particular scan and fragment type, which can change between different scans of the same dataset. 
 
- schema
- Returns
- scannerScanner
 
- scanner
 
 - take(self, indices, **kwargs)¶
- Select rows of data by index. - See scanner method parameters documentation. - Returns
 
 - to_batches(self, Schema schema=None, **kwargs)¶
- Read the fragment as materialized record batches. - See scanner method parameters documentation. - Returns
- record_batchesiterator of RecordBatch
 
- record_batchesiterator of 
 
 
