pyarrow.dataset.FileSystemDataset¶

class pyarrow.dataset.FileSystemDataset(fragments, Schema schema, FileFormat format, FileSystem filesystem=None, root_partition=None)¶

Bases: Dataset

A Dataset of file fragments.

A FileSystemDataset is composed of one or more FileFragment.

Parameters:

fragmentslist[Fragments]: List of fragments to consume.
schemaSchema: The top-level schema of the Dataset.
formatFileFormat: File format of the fragments, currently only ParquetFileFormat, IpcFileFormat, and CsvFileFormat are supported.
filesystemFileSystem: FileSystem of the fragments.
root_partitionExpression, optional: The top-level partition of the DataDataset.

__init__(*args, **kwargs)¶

Methods

`__init__`(args, *kwargs)
`count_rows`(self, **kwargs)	Count rows matching the scanner filter.
`filter`(self, expression)	Apply a row filter to the dataset.
`from_paths`(type cls, paths[, schema, ...])	A Dataset created from a list of paths on a particular filesystem.
`get_fragments`(self, Expression filter=None)	Returns an iterator over the fragments in this dataset.
`head`(self, int num_rows, **kwargs)	Load the first N rows of the dataset.
`join`(self, right_dataset, keys[, ...])	Perform a join between this dataset and another one.
`replace_schema`(self, Schema schema)	Return a copy of this Dataset with a different schema.
`scanner`(self, **kwargs)	Build a scan operation against the dataset.
`sort_by`(self, sorting, **kwargs)	Sort the Dataset by one or multiple columns.
`take`(self, indices, **kwargs)	Select rows of data by index.
`to_batches`(self, **kwargs)	Read the dataset as materialized record batches.
`to_table`(self, **kwargs)	Read the dataset to an Arrow table.

Attributes

`files`	List of the files
`filesystem`
`format`	The FileFormat of this source.
`partition_expression`	An Expression which evaluates to true for all data viewed by this Dataset.
`partitioning`	The partitioning of the Dataset source, if discovered.
`schema`	The common schema of the full Dataset

count_rows(self, **kwargs)¶

Count rows matching the scanner filter.

Parameters:

**kwargsdict, optional: See scanner() method for full parameter description.

Returns:

countint

files¶: List of the files

filesystem¶

filter(self, expression)¶

Apply a row filter to the dataset.

Parameters:

expressionExpression: The filter that should be applied to the dataset.

Returns:

Dataset

format¶: The FileFormat of this source.

from_paths(type cls, paths, schema=None, format=None, filesystem=None, partitions=None, root_partition=None)¶

A Dataset created from a list of paths on a particular filesystem.

Parameters:

pathslist of str: List of file paths to create the fragments from.
schemaSchema: The top-level schema of the DataDataset.
formatFileFormat: File format to create fragments from, currently only ParquetFileFormat, IpcFileFormat, and CsvFileFormat are supported.
filesystemFileSystem: The filesystem which files are from.
partitionslist[Expression], optional: Attach additional partition information for the file paths.
root_partitionExpression, optional: The top-level partition of the DataDataset.

get_fragments(self, Expression filter=None)¶

Returns an iterator over the fragments in this dataset.

Parameters:

filterExpression, default None: Return fragments matching the optional filter, either using the partition_expression or internal information like Parquet’s statistics.

Returns:

fragmentsiterator of Fragment

head(self, int num_rows, **kwargs)¶

Load the first N rows of the dataset.

Parameters:

num_rowsint: The number of rows to load.
**kwargsdict, optional: See scanner() method for full parameter description.

Returns:

tableTable

join(self, right_dataset, keys, right_keys=None, join_type='left outer', left_suffix=None, right_suffix=None, coalesce_keys=True, use_threads=True)¶

Perform a join between this dataset and another one.

Result of the join will be a new dataset, where further operations can be applied.

Parameters:

right_datasetdataset: The dataset to join to the current one, acting as the right dataset in the join operation.
keysstr or list[str]: The columns from current dataset that should be used as keys of the join operation left side.
right_keysstr or list[str], default None: The columns from the right_dataset that should be used as keys on the join operation right side. When None use the same key names as the left dataset.
join_typestr, default “left outer”: The kind of join that should be performed, one of (“left semi”, “right semi”, “left anti”, “right anti”, “inner”, “left outer”, “right outer”, “full outer”)
left_suffixstr, default None: Which suffix to add to right column names. This prevents confusion when the columns in left and right datasets have colliding names.
right_suffixstr, default None: Which suffic to add to the left column names. This prevents confusion when the columns in left and right datasets have colliding names.
coalesce_keysbool, default True: If the duplicated keys should be omitted from one of the sides in the join result.
use_threadsbool, default True: Whenever to use multithreading or not.

Returns:

InMemoryDataset

partition_expression¶: An Expression which evaluates to true for all data viewed by this Dataset.

partitioning¶

The partitioning of the Dataset source, if discovered.

If the FileSystemDataset is created using the dataset() factory function with a partitioning specified, this will return the finalized Partitioning object from the dataset discovery. In all other cases, this returns None.

replace_schema(self, Schema schema)¶

Return a copy of this Dataset with a different schema.

The copy will view the same Fragments. If the new schema is not compatible with the original dataset’s schema then an error will be raised.

Parameters:

schemaSchema: The new dataset schema.

scanner(self, **kwargs)¶

Build a scan operation against the dataset.

Data is not loaded immediately. Instead, this produces a Scanner, which exposes further operations (e.g. loading all data as a table, counting rows).

See the Scanner.from_dataset() method for further information.

Parameters:

**kwargsdict, optional: Arguments for Scanner.from_dataset.

Returns:

scannerScanner

Examples

>>> import pyarrow as pa
>>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
...                   'n_legs': [2, 2, 4, 4, 5, 100],
...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
...                              "Brittle stars", "Centipede"]})
>>>
>>> import pyarrow.parquet as pq
>>> pq.write_table(table, "dataset_scanner.parquet")

>>> import pyarrow.dataset as ds
>>> dataset = ds.dataset("dataset_scanner.parquet")

Selecting a subset of the columns:

>>> dataset.scanner(columns=["year", "n_legs"]).to_table()
pyarrow.Table
year: int64
n_legs: int64
----
year: [[2020,2022,2021,2022,2019,2021]]
n_legs: [[2,2,4,4,5,100]]

Projecting selected columns using an expression:

>>> dataset.scanner(columns={
...     "n_legs_uint": ds.field("n_legs").cast("uint8"),
... }).to_table()
pyarrow.Table
n_legs_uint: uint8
----
n_legs_uint: [[2,2,4,4,5,100]]

Filtering rows while scanning:

>>> dataset.scanner(filter=ds.field("year") > 2020).to_table()
pyarrow.Table
year: int64
n_legs: int64
animal: string
----
year: [[2022,2021,2022,2021]]
n_legs: [[2,4,4,100]]
animal: [["Parrot","Dog","Horse","Centipede"]]

schema¶: The common schema of the full Dataset

sort_by(self, sorting, **kwargs)¶

Sort the Dataset by one or multiple columns.

Parameters:

sortingstr or list[tuple(name, order)]: Name of the column to use to sort (ascending), or a list of multiple sorting conditions where each entry is a tuple with column name and sorting order (“ascending” or “descending”)
**kwargsdict, optional: Additional sorting options. As allowed by SortOptions

Returns:

InMemoryDataset: A new dataset sorted according to the sort keys.

take(self, indices, **kwargs)¶

Select rows of data by index.

Parameters:

indicesArray or array-like: indices of rows to select in the dataset.
**kwargsdict, optional: See scanner() method for full parameter description.

Returns:

tableTable

to_batches(self, **kwargs)¶

Read the dataset as materialized record batches.

Parameters:

**kwargsdict, optional: Arguments for Scanner.from_dataset.

Returns:

record_batchesiterator of RecordBatch

to_table(self, **kwargs)¶

Read the dataset to an Arrow table.

Note that this method reads all the selected data from the dataset into memory.

Parameters:

**kwargsdict, optional: Arguments for Scanner.from_dataset.

Returns:

tableTable

pyarrow.dataset.Dataset

pyarrow.dataset.FileSystemFactoryOptions