A FileFormat
holds information about how to read and parse the files
included in a Dataset
. There are subclasses corresponding to the supported
file formats (ParquetFileFormat
and IpcFileFormat
).
FileFormat$create()
takes the following arguments:
format
: A string identifier of the file format. Currently supported values:
"parquet"
"ipc"/"arrow"/"feather", all aliases for each other; for Feather, note that only version 2 files are supported
"csv"/"text", aliases for the same thing (because comma is the default delimiter for text files
"tsv", equivalent to passing format = "text", delimiter = "\t"
...
: Additional format-specific options
`format = "parquet"``:
use_buffered_stream
: Read files through buffered input streams rather than
loading entire row groups at once. This may be enabled
to reduce memory overhead. Disabled by default.
buffer_size
: Size of buffered stream, if enabled. Default is 8KB.
dict_columns
: Names of columns which should be read as dictionaries.
format = "text"
: see CsvReadOptions. Note that you can specify them either
with the Arrow C++ library naming ("delimiter", "quoting", etc.) or the
readr
-style naming used in read_csv_arrow()
("delim", "quote", etc.)
It returns the appropriate subclass of FileFormat
(e.g. ParquetFileFormat
)