pyarrow.dataset.dataset#

pyarrow.dataset.dataset(source, schema=None, format=None, filesystem=None, partitioning=None, partition_base_dir=None, exclude_invalid_files=None, ignore_prefixes=None)[source]#

Open a dataset.

Datasets provides functionality to efficiently work with tabular, potentially larger than memory and multi-file dataset.

  • A unified interface for different sources, like Parquet and Feather

  • Discovery of sources (crawling directories, handle directory-based partitioned datasets, basic schema normalization)

  • Optimized reading with predicate pushdown (filtering rows), projection (selecting columns), parallel reading or fine-grained managing of tasks.

Note that this is the high-level API, to have more control over the dataset construction use the low-level API classes (FileSystemDataset, FilesystemDatasetFactory, etc.)

Parameters:
sourcepath, list of paths, dataset, list of datasets, (list of) RecordBatch or Table, iterable of RecordBatch, RecordBatchReader, or URI
Path pointing to a single file:

Open a FileSystemDataset from a single file.

Path pointing to a directory:

The directory gets discovered recursively according to a partitioning scheme if given.

List of file paths:

Create a FileSystemDataset from explicitly given files. The files must be located on the same filesystem given by the filesystem parameter. Note that in contrary of construction from a single file, passing URIs as paths is not allowed.

List of datasets:

A nested UnionDataset gets constructed, it allows arbitrary composition of other datasets. Note that additional keyword arguments are not allowed.

(List of) batches or tables, iterable of batches, or RecordBatchReader:

Create an InMemoryDataset. If an iterable or empty list is given, a schema must also be given. If an iterable or RecordBatchReader is given, the resulting dataset can only be scanned once; further attempts will raise an error.

schemaSchema, optional

Optionally provide the Schema for the Dataset, in which case it will not be inferred from the source.

formatFileFormat or str

Currently “parquet”, “ipc”/”arrow”/”feather”, “csv”, “json”, and “orc” are supported. For Feather, only version 2 files are supported.

filesystemFileSystem or URI str, default None

If a single path is given as source and filesystem is None, then the filesystem will be inferred from the path. If an URI string is passed, then a filesystem object is constructed using the URI’s optional path component as a directory prefix. See the examples below. Note that the URIs on Windows must follow ‘file:///C:…’ or ‘file:/C:…’ patterns.

partitioningPartitioning, PartitioningFactory, str, list of str

The partitioning scheme specified with the partitioning() function. A flavor string can be used as shortcut, and with a list of field names a DirectoryPartitioning will be inferred.

partition_base_dirstr, optional

For the purposes of applying the partitioning, paths will be stripped of the partition_base_dir. Files not matching the partition_base_dir prefix will be skipped for partitioning discovery. The ignored files will still be part of the Dataset, but will not have partition information.

exclude_invalid_filesbool, optional (default True)

If True, invalid files will be excluded (file format specific check). This will incur IO for each files in a serial and single threaded fashion. Disabling this feature will skip the IO, but unsupported files may be present in the Dataset (resulting in an error at scan time).

ignore_prefixeslist, optional

Files matching any of these prefixes will be ignored by the discovery process. This is matched to the basename of a path. By default this is [‘.’, ‘_’]. Note that discovery happens only if a directory is passed as source.

Returns:
datasetDataset

Either a FileSystemDataset or a UnionDataset depending on the source parameter.

Examples

Creating an example Table:

>>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
...                   'n_legs': [2, 2, 4, 4, 5, 100],
...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
...                              "Brittle stars", "Centipede"]})
>>> pq.write_table(table, "file.parquet")

Opening a single file:

>>> import pyarrow.dataset as ds
>>> dataset = ds.dataset("file.parquet", format="parquet")
>>> dataset.to_table()
pyarrow.Table
year: int64
n_legs: int64
animal: string
----
year: [[2020,2022,2021,2022,2019,2021]]
n_legs: [[2,2,4,4,5,100]]
animal: [["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]]

Opening a single file with an explicit schema:

>>> myschema = pa.schema([
...     ('n_legs', pa.int64()),
...     ('animal', pa.string())])
>>> dataset = ds.dataset("file.parquet", schema=myschema, format="parquet")
>>> dataset.to_table()
pyarrow.Table
n_legs: int64
animal: string
----
n_legs: [[2,2,4,4,5,100]]
animal: [["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]]

Opening a dataset for a single directory:

>>> ds.write_dataset(table, "partitioned_dataset", format="parquet",
...                  partitioning=['year'])
>>> dataset = ds.dataset("partitioned_dataset", format="parquet")
>>> dataset.to_table()
pyarrow.Table
n_legs: int64
animal: string
----
n_legs: [[5],[2],[4,100],[2,4]]
animal: [["Brittle stars"],["Flamingo"],...["Parrot","Horse"]]

For a single directory from a S3 bucket:

>>> ds.dataset("s3://mybucket/nyc-taxi/",
...            format="parquet") 

Opening a dataset from a list of relatives local paths:

>>> dataset = ds.dataset([
...     "partitioned_dataset/2019/part-0.parquet",
...     "partitioned_dataset/2020/part-0.parquet",
...     "partitioned_dataset/2021/part-0.parquet",
... ], format='parquet')
>>> dataset.to_table()
pyarrow.Table
n_legs: int64
animal: string
----
n_legs: [[5],[2],[4,100]]
animal: [["Brittle stars"],["Flamingo"],["Dog","Centipede"]]

With filesystem provided:

>>> paths = [
...     'part0/data.parquet',
...     'part1/data.parquet',
...     'part3/data.parquet',
... ]
>>> ds.dataset(paths, filesystem='file:///directory/prefix,
...            format='parquet') 

Which is equivalent with:

>>> fs = SubTreeFileSystem("/directory/prefix",
...                        LocalFileSystem()) 
>>> ds.dataset(paths, filesystem=fs, format='parquet') 

With a remote filesystem URI:

>>> paths = [
...     'nested/directory/part0/data.parquet',
...     'nested/directory/part1/data.parquet',
...     'nested/directory/part3/data.parquet',
... ]
>>> ds.dataset(paths, filesystem='s3://bucket/',
...            format='parquet') 

Similarly to the local example, the directory prefix may be included in the filesystem URI:

>>> ds.dataset(paths, filesystem='s3://bucket/nested/directory',
...         format='parquet') 

Construction of a nested dataset:

>>> ds.dataset([
...     dataset("s3://old-taxi-data", format="parquet"),
...     dataset("local/path/to/data", format="ipc")
... ])