pyarrow.dataset.dataset¶
- pyarrow.dataset.dataset(source, schema=None, format=None, filesystem=None, partitioning=None, partition_base_dir=None, exclude_invalid_files=None, ignore_prefixes=None)[source]¶
- Open a dataset. - Datasets provides functionality to efficiently work with tabular, potentially larger than memory and multi-file dataset. - A unified interface for different sources, like Parquet and Feather 
- Discovery of sources (crawling directories, handle directory-based partitioned datasets, basic schema normalization) 
- Optimized reading with predicate pushdown (filtering rows), projection (selecting columns), parallel reading or fine-grained managing of tasks. 
 - Note that this is the high-level API, to have more control over the dataset construction use the low-level API classes (FileSystemDataset, FilesystemDatasetFactory, etc.) - Parameters
- sourcepath, listof paths, dataset,listof datasets, (listof)RecordBatchorTable, iterable ofRecordBatch, RecordBatchReader, or URI
- Path pointing to a single file:
- Open a FileSystemDataset from a single file. 
- Path pointing to a directory:
- The directory gets discovered recursively according to a partitioning scheme if given. 
- List of file paths:
- Create a FileSystemDataset from explicitly given files. The files must be located on the same filesystem given by the filesystem parameter. Note that in contrary of construction from a single file, passing URIs as paths is not allowed. 
- List of datasets:
- A nested UnionDataset gets constructed, it allows arbitrary composition of other datasets. Note that additional keyword arguments are not allowed. 
- (List of) batches or tables, iterable of batches, or RecordBatchReader:
- Create an InMemoryDataset. If an iterable or empty list is given, a schema must also be given. If an iterable or RecordBatchReader is given, the resulting dataset can only be scanned once; further attempts will raise an error. 
 
- schemaSchema, optional
- Optionally provide the Schema for the Dataset, in which case it will not be inferred from the source. 
- formatFileFormatorstr
- Currently “parquet”, “ipc”/”arrow”/”feather”, “csv”, and “orc” are supported. For Feather, only version 2 files are supported. 
- filesystemFileSystemor URIstr, defaultNone
- If a single path is given as source and filesystem is None, then the filesystem will be inferred from the path. If an URI string is passed, then a filesystem object is constructed using the URI’s optional path component as a directory prefix. See the examples below. Note that the URIs on Windows must follow ‘file:///C:…’ or ‘file:/C:…’ patterns. 
- partitioningPartitioning,PartitioningFactory,str,listofstr
- The partitioning scheme specified with the - partitioning()function. A flavor string can be used as shortcut, and with a list of field names a DirectionaryPartitioning will be inferred.
- partition_base_dirstr, optional
- For the purposes of applying the partitioning, paths will be stripped of the partition_base_dir. Files not matching the partition_base_dir prefix will be skipped for partitioning discovery. The ignored files will still be part of the Dataset, but will not have partition information. 
- exclude_invalid_filesbool, optional (default True)
- If True, invalid files will be excluded (file format specific check). This will incur IO for each files in a serial and single threaded fashion. Disabling this feature will skip the IO, but unsupported files may be present in the Dataset (resulting in an error at scan time). 
- ignore_prefixeslist, optional
- Files matching any of these prefixes will be ignored by the discovery process. This is matched to the basename of a path. By default this is [‘.’, ‘_’]. Note that discovery happens only if a directory is passed as source. 
 
- sourcepath, 
- Returns
- datasetDataset
- Either a FileSystemDataset or a UnionDataset depending on the source parameter. 
 
- dataset
 - Examples - Opening a single file: - >>> dataset("path/to/file.parquet", format="parquet") - Opening a single file with an explicit schema: - >>> dataset("path/to/file.parquet", schema=myschema, format="parquet") - Opening a dataset for a single directory: - >>> dataset("path/to/nyc-taxi/", format="parquet") >>> dataset("s3://mybucket/nyc-taxi/", format="parquet") - Opening a dataset from a list of relatives local paths: - >>> dataset([ ... "part0/data.parquet", ... "part1/data.parquet", ... "part3/data.parquet", ... ], format='parquet') - With filesystem provided: - >>> paths = [ ... 'part0/data.parquet', ... 'part1/data.parquet', ... 'part3/data.parquet', ... ] >>> dataset(paths, filesystem='file:///directory/prefix, format='parquet') - Which is equivalent with: - >>> fs = SubTreeFileSystem("/directory/prefix", LocalFileSystem()) >>> dataset(paths, filesystem=fs, format='parquet') - With a remote filesystem URI: - >>> paths = [ ... 'nested/directory/part0/data.parquet', ... 'nested/directory/part1/data.parquet', ... 'nested/directory/part3/data.parquet', ... ] >>> dataset(paths, filesystem='s3://bucket/', format='parquet') - Similarly to the local example, the directory prefix may be included in the filesystem URI: - >>> dataset(paths, filesystem='s3://bucket/nested/directory', ... format='parquet') - Construction of a nested dataset: - >>> dataset([ ... dataset("s3://old-taxi-data", format="parquet"), ... dataset("local/path/to/data", format="ipc") ... ]) 
