Read a CSV or other delimited file with ArrowSource:
These functions uses the Arrow C++ CSV reader to read into a
Arrow C++ options have been mapped to argument names that follow those of
col_select was inspired by
read_delim_arrow( file, delim = ",", quote = "\"", escape_double = TRUE, escape_backslash = FALSE, schema = NULL, col_names = TRUE, col_types = NULL, col_select = NULL, na = c("", "NA"), quoted_na = TRUE, skip_empty_rows = TRUE, skip = 0L, parse_options = NULL, convert_options = NULL, read_options = NULL, as_data_frame = TRUE, timestamp_parsers = NULL ) read_csv_arrow( file, quote = "\"", escape_double = TRUE, escape_backslash = FALSE, schema = NULL, col_names = TRUE, col_types = NULL, col_select = NULL, na = c("", "NA"), quoted_na = TRUE, skip_empty_rows = TRUE, skip = 0L, parse_options = NULL, convert_options = NULL, read_options = NULL, as_data_frame = TRUE, timestamp_parsers = NULL ) read_tsv_arrow( file, quote = "\"", escape_double = TRUE, escape_backslash = FALSE, schema = NULL, col_names = TRUE, col_types = NULL, col_select = NULL, na = c("", "NA"), quoted_na = TRUE, skip_empty_rows = TRUE, skip = 0L, parse_options = NULL, convert_options = NULL, read_options = NULL, as_data_frame = TRUE, timestamp_parsers = NULL )
A character file name or URI,
rawvector, an Arrow input stream, or a
FileSystemwith path (
SubTreeFileSystem). If a file name, a memory-mapped Arrow InputStream will be opened and closed when finished; compression will be detected from the file extension and handled automatically. If an input stream is provided, it will be left open.
Single character used to separate fields within a record.
Single character used to quote strings.
Does the file escape quotes by doubling them? i.e. If this option is
TRUE, the value
""""represents a single quote,
Does the file use backslashes to escape special characters? This is more general than
escape_doubleas backslashes can be used to escape the delimiter character, the quote character, or to add special characters like
Schema that describes the table. If provided, it will be used to satisfy both
TRUE, the first row of the input will be used as the column names and will not be included in the data frame. If
FALSE, column names will be generated by Arrow, starting with "f0", "f1", ..., "fN". Alternatively, you can specify a character vector of column names.
A compact string representation of the column types, an Arrow Schema, or
NULL(the default) to infer types from the data.
A character vector of strings to interpret as missing values.
Should missing values inside quotes be treated as missing values (the default) or strings. (Note that this is different from the the Arrow C++ default for the corresponding convert option,
Should blank rows be ignored altogether? If
TRUE, blank rows will not be represented at all. If
FALSE, they will be filled with missings.
Number of lines to skip before reading data.
see file reader options. If given, this overrides any parsing options provided in other arguments (e.g.
Should the function return a
data.frame(default) or an Arrow Table?
User-defined timestamp parsers. If more than one parser is specified, the CSV conversion logic will try parsing values starting from the beginning of this vector. Possible values are:
read_tsv_arrow() are wrappers around
read_delim_arrow() that specify a delimiter.
Note that not all
readr options are currently implemented here. Please file
an issue if you encounter one that
arrow should support.
If you need to control Arrow-specific reader parameters that don't have an
readr::read_csv(), you can either provide them in the
read_options arguments, or you can
use CsvTableReader directly for lower-level access.
By default, the CSV reader will infer the column names and data types from the file, but there are a few ways you can specify them directly.
One way is to provide an Arrow Schema in the
which is an ordered map of column name to type.
When provided, it satisfies both the
This is good if you know all of this information up front.
You can also pass a
Schema to the
col_types argument. If you do this,
column names will still be inferred from the file unless you also specify
col_names. In either case, the column names in the
Schema must match the
data's column names, whether they are explicitly provided or inferred. That
Schema does not have to reference all columns: those omitted
will have their types inferred.
Alternatively, you can declare column types by providing the compact string representation
readr uses to the
col_types argument. This means you provide a
single string, one character per column, where the characters map to Arrow
types analogously to the
readr type mapping:
unitarg is set to the default value
"?": infer the type from the data
If you use the compact string representation for
col_types, you must also
Regardless of how types are specified, all columns with a
null() type will
Note that if you are specifying column names, whether by
col_names, and the CSV file has a header row that would otherwise be used
to idenfity column names, you'll need to add
skip = 1 to skip that row.
tf <- tempfile() on.exit(unlink(tf)) write.csv(mtcars, file = tf) df <- read_csv_arrow(tf) dim(df) #>  32 12 # Can select columns df <- read_csv_arrow(tf, col_select = starts_with("d")) # Specifying column types and names write.csv(data.frame(x = c(1, 3), y = c(2, 4)), file = tf, row.names = FALSE) read_csv_arrow(tf, schema = schema(x = int32(), y = utf8()), skip = 1) #> # A tibble: 2 x 2 #> x y #> <int> <chr> #> 1 1 2 #> 2 3 4 read_csv_arrow(tf, col_types = schema(y = utf8())) #> # A tibble: 2 x 2 #> x y #> <int> <chr> #> 1 1 2 #> 2 3 4 read_csv_arrow(tf, col_types = "ic", col_names = c("x", "y"), skip = 1) #> # A tibble: 2 x 2 #> x y #> <int> <chr> #> 1 1 2 #> 2 3 4 # Note that if a timestamp column contains time zones, # the string "T" `col_types` specification won't work. # To parse timestamps with time zones, provide a [Schema] to `col_types` # and specify the time zone in the type object: tf <- tempfile() write.csv(data.frame(x = "1970-01-01T12:00:00+12:00"), file = tf, row.names = FALSE) read_csv_arrow( tf, col_types = schema(x = timestamp(unit = "us", timezone = "UTC")) ) #> # A tibble: 1 x 1 #> x #> <dttm> #> 1 1970-01-01 00:00:00