Dataset file formats — FileFormat • Arrow R Package

A FileFormat holds information about how to read and parse the files included in a Dataset. There are subclasses corresponding to the supported file formats (ParquetFileFormat and IpcFileFormat).

Factory

FileFormat$create() takes the following arguments:

format: A string identifier of the file format. Currently supported values:
- "parquet"
- "ipc"/"arrow"/"feather", all aliases for each other; for Feather, note that only version 2 files are supported
- "csv"/"text", aliases for the same thing (because comma is the default delimiter for text files
- "tsv", equivalent to passing format = "text", delimiter = "\t"
...: Additional format-specific options

`format = "parquet"``:
- use_buffered_stream: Read files through buffered input streams rather than loading entire row groups at once. This may be enabled to reduce memory overhead. Disabled by default.
- buffer_size: Size of buffered stream, if enabled. Default is 8KB.
- dict_columns: Names of columns which should be read as dictionaries.
format = "text": see CsvReadOptions. Note that you can specify them either with the Arrow C++ library naming ("delimiter", "quoting", etc.) or the readr-style naming used in read_csv_arrow() ("delim", "quote", etc.)

It returns the appropriate subclass of FileFormat (e.g. ParquetFileFormat)