pyarrow.dataset.dataset¶
-
pyarrow.dataset.
dataset
(source, schema=None, format=None, filesystem=None, partitioning=None, partition_base_dir=None, exclude_invalid_files=None, ignore_prefixes=None)[source]¶ Open a dataset.
Datasets provides functionality to efficiently work with tabular, potentially larger than memory and multi-file dataset.
A unified interface for different sources, like Parquet and Feather
Discovery of sources (crawling directories, handle directory-based partitioned datasets, basic schema normalization)
Optimized reading with predicate pushdown (filtering rows), projection (selecting columns), parallel reading or fine-grained managing of tasks.
Note that this is the high-level API, to have more control over the dataset construction use the low-level API classes (FileSystemDataset, FilesystemDatasetFactory, etc.)
- Parameters
source (path, list of paths, dataset, list of datasets, (list of) batchesor tables, iterable of batches, RecordBatchReader, or URI) –
- Path pointing to a single file:
Open a FileSystemDataset from a single file.
- Path pointing to a directory:
The directory gets discovered recursively according to a partitioning scheme if given.
- List of file paths:
Create a FileSystemDataset from explicitly given files. The files must be located on the same filesystem given by the filesystem parameter. Note that in contrary of construction from a single file, passing URIs as paths is not allowed.
- List of datasets:
A nested UnionDataset gets constructed, it allows arbitrary composition of other datasets. Note that additional keyword arguments are not allowed.
- (List of) batches or tables, iterable of batches, or RecordBatchReader:
Create an InMemoryDataset. If an iterable or empty list is given, a schema must also be given. If an iterable or RecordBatchReader is given, the resulting dataset can only be scanned once; further attempts will raise an error.
schema (Schema, optional) – Optionally provide the Schema for the Dataset, in which case it will not be inferred from the source.
format (FileFormat or str) – Currently “parquet” and “ipc”/”arrow”/”feather” are supported. For Feather, only version 2 files are supported.
filesystem (FileSystem or URI string, default None) – If a single path is given as source and filesystem is None, then the filesystem will be inferred from the path. If an URI string is passed, then a filesystem object is constructed using the URI’s optional path component as a directory prefix. See the examples below. Note that the URIs on Windows must follow ‘file:///C:…’ or ‘file:/C:…’ patterns.
partitioning (Partitioning, PartitioningFactory, str, list of str) – The partitioning scheme specified with the
partitioning()
function. A flavor string can be used as shortcut, and with a list of field names a DirectionaryPartitioning will be inferred.partition_base_dir (str, optional) – For the purposes of applying the partitioning, paths will be stripped of the partition_base_dir. Files not matching the partition_base_dir prefix will be skipped for partitioning discovery. The ignored files will still be part of the Dataset, but will not have partition information.
exclude_invalid_files (bool, optional (default True)) – If True, invalid files will be excluded (file format specific check). This will incur IO for each files in a serial and single threaded fashion. Disabling this feature will skip the IO, but unsupported files may be present in the Dataset (resulting in an error at scan time).
ignore_prefixes (list, optional) – Files matching any of these prefixes will be ignored by the discovery process. This is matched to the basename of a path. By default this is [‘.’, ‘_’]. Note that discovery happens only if a directory is passed as source.
- Returns
dataset (Dataset) – Either a FileSystemDataset or a UnionDataset depending on the source parameter.
Examples
Opening a single file:
>>> dataset("path/to/file.parquet", format="parquet")
Opening a single file with an explicit schema:
>>> dataset("path/to/file.parquet", schema=myschema, format="parquet")
Opening a dataset for a single directory:
>>> dataset("path/to/nyc-taxi/", format="parquet") >>> dataset("s3://mybucket/nyc-taxi/", format="parquet")
Opening a dataset from a list of relatives local paths:
>>> dataset([ ... "part0/data.parquet", ... "part1/data.parquet", ... "part3/data.parquet", ... ], format='parquet')
With filesystem provided:
>>> paths = [ ... 'part0/data.parquet', ... 'part1/data.parquet', ... 'part3/data.parquet', ... ] >>> dataset(paths, filesystem='file:///directory/prefix, format='parquet')
Which is equivalent with:
>>> fs = SubTreeFileSystem("/directory/prefix", LocalFileSystem()) >>> dataset(paths, filesystem=fs, format='parquet')
With a remote filesystem URI:
>>> paths = [ ... 'nested/directory/part0/data.parquet', ... 'nested/directory/part1/data.parquet', ... 'nested/directory/part3/data.parquet', ... ] >>> dataset(paths, filesystem='s3://bucket/', format='parquet')
Similarly to the local example, the directory prefix may be included in the filesystem URI:
>>> dataset(paths, filesystem='s3://bucket/nested/directory', ... format='parquet')
Construction of a nested dataset:
>>> dataset([ ... dataset("s3://old-taxi-data", format="parquet"), ... dataset("local/path/to/data", format="ipc") ... ])