pyarrow.dataset.dataset#
- pyarrow.dataset.dataset(source, schema=None, format=None, filesystem=None, partitioning=None, partition_base_dir=None, exclude_invalid_files=None, ignore_prefixes=None)[source]#
Open a dataset.
Datasets provides functionality to efficiently work with tabular, potentially larger than memory and multi-file dataset.
A unified interface for different sources, like Parquet and Feather
Discovery of sources (crawling directories, handle directory-based partitioned datasets, basic schema normalization)
Optimized reading with predicate pushdown (filtering rows), projection (selecting columns), parallel reading or fine-grained managing of tasks.
Note that this is the high-level API, to have more control over the dataset construction use the low-level API classes (FileSystemDataset, FilesystemDatasetFactory, etc.)
- Parameters
- sourcepath,
list
of paths, dataset,list
of datasets, (list
of)RecordBatch
orTable
, iterable ofRecordBatch
, RecordBatchReader, or URI - Path pointing to a single file:
Open a FileSystemDataset from a single file.
- Path pointing to a directory:
The directory gets discovered recursively according to a partitioning scheme if given.
- List of file paths:
Create a FileSystemDataset from explicitly given files. The files must be located on the same filesystem given by the filesystem parameter. Note that in contrary of construction from a single file, passing URIs as paths is not allowed.
- List of datasets:
A nested UnionDataset gets constructed, it allows arbitrary composition of other datasets. Note that additional keyword arguments are not allowed.
- (List of) batches or tables, iterable of batches, or RecordBatchReader:
Create an InMemoryDataset. If an iterable or empty list is given, a schema must also be given. If an iterable or RecordBatchReader is given, the resulting dataset can only be scanned once; further attempts will raise an error.
- schema
Schema
, optional Optionally provide the Schema for the Dataset, in which case it will not be inferred from the source.
- format
FileFormat
orstr
Currently “parquet”, “ipc”/”arrow”/”feather”, “csv”, and “orc” are supported. For Feather, only version 2 files are supported.
- filesystem
FileSystem
or URIstr
, defaultNone
If a single path is given as source and filesystem is None, then the filesystem will be inferred from the path. If an URI string is passed, then a filesystem object is constructed using the URI’s optional path component as a directory prefix. See the examples below. Note that the URIs on Windows must follow ‘file:///C:…’ or ‘file:/C:…’ patterns.
- partitioning
Partitioning
,PartitioningFactory
,str
,list
ofstr
The partitioning scheme specified with the
partitioning()
function. A flavor string can be used as shortcut, and with a list of field names a DirectionaryPartitioning will be inferred.- partition_base_dir
str
, optional For the purposes of applying the partitioning, paths will be stripped of the partition_base_dir. Files not matching the partition_base_dir prefix will be skipped for partitioning discovery. The ignored files will still be part of the Dataset, but will not have partition information.
- exclude_invalid_filesbool, optional (default
True
) If True, invalid files will be excluded (file format specific check). This will incur IO for each files in a serial and single threaded fashion. Disabling this feature will skip the IO, but unsupported files may be present in the Dataset (resulting in an error at scan time).
- ignore_prefixes
list
, optional Files matching any of these prefixes will be ignored by the discovery process. This is matched to the basename of a path. By default this is [‘.’, ‘_’]. Note that discovery happens only if a directory is passed as source.
- sourcepath,
- Returns
- dataset
Dataset
Either a FileSystemDataset or a UnionDataset depending on the source parameter.
- dataset
Examples
Opening a single file:
>>> dataset("path/to/file.parquet", format="parquet")
Opening a single file with an explicit schema:
>>> dataset("path/to/file.parquet", schema=myschema, format="parquet")
Opening a dataset for a single directory:
>>> dataset("path/to/nyc-taxi/", format="parquet") >>> dataset("s3://mybucket/nyc-taxi/", format="parquet")
Opening a dataset from a list of relatives local paths:
>>> dataset([ ... "part0/data.parquet", ... "part1/data.parquet", ... "part3/data.parquet", ... ], format='parquet')
With filesystem provided:
>>> paths = [ ... 'part0/data.parquet', ... 'part1/data.parquet', ... 'part3/data.parquet', ... ] >>> dataset(paths, filesystem='file:///directory/prefix, format='parquet')
Which is equivalent with:
>>> fs = SubTreeFileSystem("/directory/prefix", LocalFileSystem()) >>> dataset(paths, filesystem=fs, format='parquet')
With a remote filesystem URI:
>>> paths = [ ... 'nested/directory/part0/data.parquet', ... 'nested/directory/part1/data.parquet', ... 'nested/directory/part3/data.parquet', ... ] >>> dataset(paths, filesystem='s3://bucket/', format='parquet')
Similarly to the local example, the directory prefix may be included in the filesystem URI:
>>> dataset(paths, filesystem='s3://bucket/nested/directory', ... format='parquet')
Construction of a nested dataset:
>>> dataset([ ... dataset("s3://old-taxi-data", format="parquet"), ... dataset("local/path/to/data", format="ipc") ... ])