pyarrow.dataset.FileSystemDataset¶
- class pyarrow.dataset.FileSystemDataset(fragments, Schema schema, FileFormat format, FileSystem filesystem=None, root_partition=None)¶
Bases:
Dataset
A Dataset of file fragments.
A FileSystemDataset is composed of one or more FileFragment.
- Parameters:
- fragments
list
[Fragments
] List of fragments to consume.
- schema
Schema
The top-level schema of the Dataset.
- format
FileFormat
File format of the fragments, currently only ParquetFileFormat, IpcFileFormat, and CsvFileFormat are supported.
- filesystem
FileSystem
FileSystem of the fragments.
- root_partition
Expression
, optional The top-level partition of the DataDataset.
- fragments
- __init__(*args, **kwargs)¶
Methods
__init__
(*args, **kwargs)count_rows
(self, **kwargs)Count rows matching the scanner filter.
filter
(self, expression)Apply a row filter to the dataset.
from_paths
(type cls, paths[, schema, ...])A Dataset created from a list of paths on a particular filesystem.
get_fragments
(self, Expression filter=None)Returns an iterator over the fragments in this dataset.
head
(self, int num_rows, **kwargs)Load the first N rows of the dataset.
join
(self, right_dataset, keys[, ...])Perform a join between this dataset and another one.
replace_schema
(self, Schema schema)Return a copy of this Dataset with a different schema.
scanner
(self, **kwargs)Build a scan operation against the dataset.
sort_by
(self, sorting, **kwargs)Sort the Dataset by one or multiple columns.
take
(self, indices, **kwargs)Select rows of data by index.
to_batches
(self, **kwargs)Read the dataset as materialized record batches.
to_table
(self, **kwargs)Read the dataset to an Arrow table.
Attributes
List of the files
The FileFormat of this source.
An Expression which evaluates to true for all data viewed by this Dataset.
The partitioning of the Dataset source, if discovered.
The common schema of the full Dataset
- count_rows(self, **kwargs)¶
Count rows matching the scanner filter.
- files¶
List of the files
- filesystem¶
- filter(self, expression)¶
Apply a row filter to the dataset.
- Parameters:
- expression
Expression
The filter that should be applied to the dataset.
- expression
- Returns:
- format¶
The FileFormat of this source.
- from_paths(type cls, paths, schema=None, format=None, filesystem=None, partitions=None, root_partition=None)¶
A Dataset created from a list of paths on a particular filesystem.
- Parameters:
- paths
list
ofstr
List of file paths to create the fragments from.
- schema
Schema
The top-level schema of the DataDataset.
- format
FileFormat
File format to create fragments from, currently only ParquetFileFormat, IpcFileFormat, and CsvFileFormat are supported.
- filesystem
FileSystem
The filesystem which files are from.
- partitions
list
[Expression
], optional Attach additional partition information for the file paths.
- root_partition
Expression
, optional The top-level partition of the DataDataset.
- paths
- get_fragments(self, Expression filter=None)¶
Returns an iterator over the fragments in this dataset.
- Parameters:
- filter
Expression
, defaultNone
Return fragments matching the optional filter, either using the partition_expression or internal information like Parquet’s statistics.
- filter
- Returns:
- fragmentsiterator of
Fragment
- fragmentsiterator of
- head(self, int num_rows, **kwargs)¶
Load the first N rows of the dataset.
- join(self, right_dataset, keys, right_keys=None, join_type='left outer', left_suffix=None, right_suffix=None, coalesce_keys=True, use_threads=True)¶
Perform a join between this dataset and another one.
Result of the join will be a new dataset, where further operations can be applied.
- Parameters:
- right_datasetdataset
The dataset to join to the current one, acting as the right dataset in the join operation.
- keys
str
orlist
[str
] The columns from current dataset that should be used as keys of the join operation left side.
- right_keys
str
orlist
[str
], defaultNone
The columns from the right_dataset that should be used as keys on the join operation right side. When
None
use the same key names as the left dataset.- join_type
str
, default “left outer” The kind of join that should be performed, one of (“left semi”, “right semi”, “left anti”, “right anti”, “inner”, “left outer”, “right outer”, “full outer”)
- left_suffix
str
, defaultNone
Which suffix to add to right column names. This prevents confusion when the columns in left and right datasets have colliding names.
- right_suffix
str
, defaultNone
Which suffic to add to the left column names. This prevents confusion when the columns in left and right datasets have colliding names.
- coalesce_keysbool, default
True
If the duplicated keys should be omitted from one of the sides in the join result.
- use_threadsbool, default
True
Whenever to use multithreading or not.
- Returns:
- partition_expression¶
An Expression which evaluates to true for all data viewed by this Dataset.
- partitioning¶
The partitioning of the Dataset source, if discovered.
If the FileSystemDataset is created using the
dataset()
factory function with a partitioning specified, this will return the finalized Partitioning object from the dataset discovery. In all other cases, this returns None.
- replace_schema(self, Schema schema)¶
Return a copy of this Dataset with a different schema.
The copy will view the same Fragments. If the new schema is not compatible with the original dataset’s schema then an error will be raised.
- Parameters:
- schema
Schema
The new dataset schema.
- schema
- scanner(self, **kwargs)¶
Build a scan operation against the dataset.
Data is not loaded immediately. Instead, this produces a Scanner, which exposes further operations (e.g. loading all data as a table, counting rows).
See the
Scanner.from_dataset()
method for further information.Examples
>>> import pyarrow as pa >>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021], ... 'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]}) >>> >>> import pyarrow.parquet as pq >>> pq.write_table(table, "dataset_scanner.parquet")
>>> import pyarrow.dataset as ds >>> dataset = ds.dataset("dataset_scanner.parquet")
Selecting a subset of the columns:
>>> dataset.scanner(columns=["year", "n_legs"]).to_table() pyarrow.Table year: int64 n_legs: int64 ---- year: [[2020,2022,2021,2022,2019,2021]] n_legs: [[2,2,4,4,5,100]]
Projecting selected columns using an expression:
>>> dataset.scanner(columns={ ... "n_legs_uint": ds.field("n_legs").cast("uint8"), ... }).to_table() pyarrow.Table n_legs_uint: uint8 ---- n_legs_uint: [[2,2,4,4,5,100]]
Filtering rows while scanning:
>>> dataset.scanner(filter=ds.field("year") > 2020).to_table() pyarrow.Table year: int64 n_legs: int64 animal: string ---- year: [[2022,2021,2022,2021]] n_legs: [[2,4,4,100]] animal: [["Parrot","Dog","Horse","Centipede"]]
- schema¶
The common schema of the full Dataset
- sort_by(self, sorting, **kwargs)¶
Sort the Dataset by one or multiple columns.
- Parameters:
- Returns:
InMemoryDataset
A new dataset sorted according to the sort keys.
- take(self, indices, **kwargs)¶
Select rows of data by index.
- Parameters:
- indices
Array
orarray-like
indices of rows to select in the dataset.
- **kwargs
dict
, optional See scanner() method for full parameter description.
- indices
- Returns:
- table
Table
- table
- to_batches(self, **kwargs)¶
Read the dataset as materialized record batches.
- Parameters:
- **kwargs
dict
, optional Arguments for Scanner.from_dataset.
- **kwargs
- Returns:
- record_batchesiterator of
RecordBatch
- record_batchesiterator of