pyarrow.dataset.FileSystemDataset¶
- class pyarrow.dataset.FileSystemDataset(fragments, Schema schema, FileFormat format, FileSystem filesystem=None, root_partition=None)¶
Bases:
pyarrow._dataset.Dataset
A Dataset of file fragments.
A FileSystemDataset is composed of one or more FileFragment.
- Parameters
- fragments
list
[Fragments
] List of fragments to consume.
- schema
Schema
The top-level schema of the Dataset.
- format
FileFormat
File format of the fragments, currently only ParquetFileFormat, IpcFileFormat, and CsvFileFormat are supported.
- filesystem
FileSystem
FileSystem of the fragments.
- root_partition
Expression
, optional The top-level partition of the DataDataset.
- fragments
- __init__(*args, **kwargs)¶
Methods
__init__
(*args, **kwargs)count_rows
(self, **kwargs)Count rows matching the scanner filter.
from_paths
(type cls, paths[, schema, ...])A Dataset created from a list of paths on a particular filesystem.
get_fragments
(self, Expression filter=None)Returns an iterator over the fragments in this dataset.
head
(self, int num_rows, **kwargs)Load the first N rows of the dataset.
replace_schema
(self, Schema schema)Return a copy of this Dataset with a different schema.
scanner
(self, **kwargs)Builds a scan operation against the dataset.
take
(self, indices, **kwargs)Select rows of data by index.
to_batches
(self, **kwargs)Read the dataset as materialized record batches.
to_table
(self, **kwargs)Read the dataset to an arrow table.
Attributes
List of the files
The FileFormat of this source.
An Expression which evaluates to true for all data viewed by this Dataset.
The partitioning of the Dataset source, if discovered.
The common schema of the full Dataset
- count_rows(self, **kwargs)¶
Count rows matching the scanner filter.
See scanner method parameters documentation.
- Returns
- count
int
- count
- files¶
List of the files
- filesystem¶
- format¶
The FileFormat of this source.
- from_paths(type cls, paths, schema=None, format=None, filesystem=None, partitions=None, root_partition=None)¶
A Dataset created from a list of paths on a particular filesystem.
- Parameters
- paths
list
ofstr
List of file paths to create the fragments from.
- schema
Schema
The top-level schema of the DataDataset.
- format
FileFormat
File format to create fragments from, currently only ParquetFileFormat, IpcFileFormat, and CsvFileFormat are supported.
- filesystem
FileSystem
The filesystem which files are from.
- partitions
list
[Expression
], optional Attach additional partition information for the file paths.
- root_partition
Expression
, optional The top-level partition of the DataDataset.
- paths
- get_fragments(self, Expression filter=None)¶
Returns an iterator over the fragments in this dataset.
- Parameters
- filter
Expression
, defaultNone
Return fragments matching the optional filter, either using the partition_expression or internal information like Parquet’s statistics.
- filter
- Returns
- fragmentsiterator of
Fragment
- fragmentsiterator of
- head(self, int num_rows, **kwargs)¶
Load the first N rows of the dataset.
See scanner method parameters documentation.
- Returns
- partition_expression¶
An Expression which evaluates to true for all data viewed by this Dataset.
- partitioning¶
The partitioning of the Dataset source, if discovered.
If the FileSystemDataset is created using the
dataset()
factory function with a partitioning specified, this will return the finalized Partitioning object from the dataset discovery. In all other cases, this returns None.
- replace_schema(self, Schema schema)¶
Return a copy of this Dataset with a different schema.
The copy will view the same Fragments. If the new schema is not compatible with the original dataset’s schema then an error will be raised.
- scanner(self, **kwargs)¶
Builds a scan operation against the dataset.
Data is not loaded immediately. Instead, this produces a Scanner, which exposes further operations (e.g. loading all data as a table, counting rows).
- Parameters
- columns
list
ofstr
, defaultNone
The columns to project. This can be a list of column names to include (order and duplicates will be preserved), or a dictionary with {new_column_name: expression} values for more advanced projections. The columns will be passed down to Datasets and corresponding data fragments to avoid loading, copying, and deserializing columns that will not be required further down the compute chain. By default all of the available columns are projected. Raises an exception if any of the referenced column names does not exist in the dataset’s Schema.
- filter
Expression
, defaultNone
Scan will return only the rows matching the filter. If possible the predicate will be pushed down to exploit the partition information or internal metadata found in the data source, e.g. Parquet statistics. Otherwise filters the loaded RecordBatches before yielding them.
- batch_size
int
, default 1M The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size.
- use_threadsbool, default
True
If enabled, then maximum parallelism will be used determined by the number of available CPU cores.
- use_asyncbool, default
True
This flag is deprecated and is being kept for this release for backwards compatibility. It will be removed in the next release.
- memory_pool
MemoryPool
, defaultNone
For memory allocations, if required. If not specified, uses the default pool.
- fragment_scan_options
FragmentScanOptions
, defaultNone
Options specific to a particular scan and fragment type, which can change between different scans of the same dataset.
- columns
- Returns
- scanner
Scanner
- scanner
Examples
>>> import pyarrow.dataset as ds >>> dataset = ds.dataset("path/to/dataset")
Selecting a subset of the columns:
>>> dataset.scanner(columns=["A", "B"]).to_table()
Projecting selected columns using an expression:
>>> dataset.scanner(columns={ ... "A_int": ds.field("A").cast("int64"), ... }).to_table()
Filtering rows while scanning:
>>> dataset.scanner(filter=ds.field("A") > 0).to_table()
- schema¶
The common schema of the full Dataset
- take(self, indices, **kwargs)¶
Select rows of data by index.
See scanner method parameters documentation.
- Returns
- to_batches(self, **kwargs)¶
Read the dataset as materialized record batches.
See scanner method parameters documentation.
- Returns
- record_batchesiterator of
RecordBatch
- record_batchesiterator of