pyarrow.dataset.dataset

pyarrow.dataset.dataset(source, schema=None, format=None, filesystem=None, partitioning=None, partition_base_dir=None, exclude_invalid_files=None, ignore_prefixes=None)[source]

Open a dataset.

Datasets provides functionality to efficiently work with tabular, potentially larger than memory and multi-file dataset.

  • A unified interface for different sources, like Parquet and Feather

  • Discovery of sources (crawling directories, handle directory-based partitioned datasets, basic schema normalization)

  • Optimized reading with predicate pushdown (filtering rows), projection (selecting columns), parallel reading or fine-grained managing of tasks.

Note that this is the high-level API, to have more control over the dataset construction use the low-level API classes (FileSystemDataset, FilesystemDatasetFactory, etc.)

Parameters
  • source (path, list of paths, dataset, list of datasets, (list of) batchesor tables, iterable of batches, RecordBatchReader, or URI) –

    Path pointing to a single file:

    Open a FileSystemDataset from a single file.

    Path pointing to a directory:

    The directory gets discovered recursively according to a partitioning scheme if given.

    List of file paths:

    Create a FileSystemDataset from explicitly given files. The files must be located on the same filesystem given by the filesystem parameter. Note that in contrary of construction from a single file, passing URIs as paths is not allowed.

    List of datasets:

    A nested UnionDataset gets constructed, it allows arbitrary composition of other datasets. Note that additional keyword arguments are not allowed.

    (List of) batches or tables, iterable of batches, or RecordBatchReader:

    Create an InMemoryDataset. If an iterable or empty list is given, a schema must also be given. If an iterable or RecordBatchReader is given, the resulting dataset can only be scanned once; further attempts will raise an error.

  • schema (Schema, optional) – Optionally provide the Schema for the Dataset, in which case it will not be inferred from the source.

  • format (FileFormat or str) – Currently “parquet” and “ipc”/”arrow”/”feather” are supported. For Feather, only version 2 files are supported.

  • filesystem (FileSystem or URI string, default None) – If a single path is given as source and filesystem is None, then the filesystem will be inferred from the path. If an URI string is passed, then a filesystem object is constructed using the URI’s optional path component as a directory prefix. See the examples below. Note that the URIs on Windows must follow ‘file:///C:…’ or ‘file:/C:…’ patterns.

  • partitioning (Partitioning, PartitioningFactory, str, list of str) – The partitioning scheme specified with the partitioning() function. A flavor string can be used as shortcut, and with a list of field names a DirectionaryPartitioning will be inferred.

  • partition_base_dir (str, optional) – For the purposes of applying the partitioning, paths will be stripped of the partition_base_dir. Files not matching the partition_base_dir prefix will be skipped for partitioning discovery. The ignored files will still be part of the Dataset, but will not have partition information.

  • exclude_invalid_files (bool, optional (default True)) – If True, invalid files will be excluded (file format specific check). This will incur IO for each files in a serial and single threaded fashion. Disabling this feature will skip the IO, but unsupported files may be present in the Dataset (resulting in an error at scan time).

  • ignore_prefixes (list, optional) – Files matching any of these prefixes will be ignored by the discovery process. This is matched to the basename of a path. By default this is [‘.’, ‘_’]. Note that discovery happens only if a directory is passed as source.

Returns

dataset (Dataset) – Either a FileSystemDataset or a UnionDataset depending on the source parameter.

Examples

Opening a single file:

>>> dataset("path/to/file.parquet", format="parquet")

Opening a single file with an explicit schema:

>>> dataset("path/to/file.parquet", schema=myschema, format="parquet")

Opening a dataset for a single directory:

>>> dataset("path/to/nyc-taxi/", format="parquet")
>>> dataset("s3://mybucket/nyc-taxi/", format="parquet")

Opening a dataset from a list of relatives local paths:

>>> dataset([
...     "part0/data.parquet",
...     "part1/data.parquet",
...     "part3/data.parquet",
... ], format='parquet')

With filesystem provided:

>>> paths = [
...     'part0/data.parquet',
...     'part1/data.parquet',
...     'part3/data.parquet',
... ]
>>> dataset(paths, filesystem='file:///directory/prefix, format='parquet')

Which is equivalent with:

>>> fs = SubTreeFileSystem("/directory/prefix", LocalFileSystem())
>>> dataset(paths, filesystem=fs, format='parquet')

With a remote filesystem URI:

>>> paths = [
...     'nested/directory/part0/data.parquet',
...     'nested/directory/part1/data.parquet',
...     'nested/directory/part3/data.parquet',
... ]
>>> dataset(paths, filesystem='s3://bucket/', format='parquet')

Similarly to the local example, the directory prefix may be included in the filesystem URI:

>>> dataset(paths, filesystem='s3://bucket/nested/directory',
...         format='parquet')

Construction of a nested dataset:

>>> dataset([
...     dataset("s3://old-taxi-data", format="parquet"),
...     dataset("local/path/to/data", format="ipc")
... ])