Reading and Writing CSV files#
Arrow provides a fast CSV reader allowing ingestion of external data as Arrow tables.
See also
Basic usage#
A CSV file is read from a InputStream
.
#include "arrow/csv/api.h"
{
// ...
arrow::io::IOContext io_context = arrow::io::default_io_context();
std::shared_ptr<arrow::io::InputStream> input = ...;
auto read_options = arrow::csv::ReadOptions::Defaults();
auto parse_options = arrow::csv::ParseOptions::Defaults();
auto convert_options = arrow::csv::ConvertOptions::Defaults();
// Instantiate TableReader from input stream and options
auto maybe_reader =
arrow::csv::TableReader::Make(io_context,
input,
read_options,
parse_options,
convert_options);
if (!maybe_reader.ok()) {
// Handle TableReader instantiation error...
}
std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
// Read table from CSV file
auto maybe_table = reader->Read();
if (!maybe_table.ok()) {
// Handle CSV read error
// (for example a CSV syntax error or failed type conversion)
}
std::shared_ptr<arrow::Table> table = *maybe_table;
}
A CSV file is written to a OutputStream
.
#include <arrow/csv/api.h>
{
// Oneshot write
// ...
std::shared_ptr<arrow::io::OutputStream> output = ...;
auto write_options = arrow::csv::WriteOptions::Defaults();
if (WriteCSV(table, write_options, output.get()).ok()) {
// Handle writer error...
}
}
{
// Write incrementally
// ...
std::shared_ptr<arrow::io::OutputStream> output = ...;
auto write_options = arrow::csv::WriteOptions::Defaults();
auto maybe_writer = arrow::csv::MakeCSVWriter(output, schema, write_options);
if (!maybe_writer.ok()) {
// Handle writer instantiation error...
}
std::shared_ptr<arrow::ipc::RecordBatchWriter> writer = *maybe_writer;
// Write batches...
if (!writer->WriteRecordBatch(*batch).ok()) {
// Handle write error...
}
if (!writer->Close().ok()) {
// Handle close error...
}
if (!output->Close().ok()) {
// Handle file close error...
}
}
Note
The writer does not yet support all Arrow types.
Column names#
There are three possible ways to infer column names from the CSV file:
By default, the column names are read from the first row in the CSV file
If
ReadOptions::column_names
is set, it forces the column names in the table to these values (the first row in the CSV file is read as data)If
ReadOptions::autogenerate_column_names
is true, column names will be autogenerated with the pattern “f0”, “f1”… (the first row in the CSV file is read as data)
Column selection#
By default, Arrow reads all columns in the CSV file. You can narrow the
selection of columns with the ConvertOptions::include_columns
option. If some columns in ConvertOptions::include_columns
are missing from the CSV file, an error will be emitted unless
ConvertOptions::include_missing_columns
is true, in which case
the missing columns are assumed to contain all-null values.
Interaction with column names#
If both ReadOptions::column_names
and
ConvertOptions::include_columns
are specified,
the ReadOptions::column_names
are assumed to map to CSV columns,
and ConvertOptions::include_columns
is a subset of those column
names that will part of the Arrow Table.
Data types#
By default, the CSV reader infers the most appropriate data type for each column. Type inference considers the following data types, in order:
Null
Int64
Boolean
Date32
Time32 (with seconds unit)
Timestamp (with seconds unit)
Timestamp (with nanoseconds unit)
Float64
Dictionary<String> (if
ConvertOptions::auto_dict_encode
is true)Dictionary<Binary> (if
ConvertOptions::auto_dict_encode
is true)String
Binary
It is possible to override type inference for select columns by setting
the ConvertOptions::column_types
option. Explicit data types
can be chosen from the following list:
Null
All Integer types
Float32 and Float64
Decimal128
Boolean
Date32 and Date64
Time32 and Time64
Timestamp
Binary and Large Binary
String and Large String (with optional UTF8 input validation)
Fixed-Size Binary
Dictionary with index type Int32 and value type one of the following: Binary, String, LargeBinary, LargeString, Int32, UInt32, Int64, UInt64, Float32, Float64, Decimal128
Other data types do not support conversion from CSV values and will error out.
Dictionary inference#
If type inference is enabled and ConvertOptions::auto_dict_encode
is true, the CSV reader first tries to convert string-like columns to a
dictionary-encoded string-like array. It switches to a plain string-like
array when the threshold in ConvertOptions::auto_dict_max_cardinality
is reached.
Timestamp inference/parsing#
If type inference is enabled, the CSV reader first tries to interpret
string-like columns as timestamps. If all rows have some zone offset
(e.g. Z
or +0100
), even if the offsets are inconsistent, then the
inferred type will be UTC timestamp. If no rows have a zone offset, then the
inferred type will be timestamp without timezone. A mix of rows with/without
offsets will result in a string column.
If the type is explicitly specified as a timestamp with/without timezone, then
the reader will error on values without/with zone offsets in that column. Note
that this means it isn’t currently possible to have the reader parse a column
of timestamps without zone offsets as local times in a particular timezone;
instead, parse the column as timestamp without timezone, then convert the
values afterwards using the assume_timezone
compute function.
Specified Type |
Input CSV |
Result Type |
---|---|---|
(inferred) |
|
timestamp[s] |
|
timestamp[s, UTC] |
|
|
||
2021-01-01T00:00:00
2021-01-01T00:00:00Z
|
string |
|
timestamp[s] |
|
timestamp[s] |
|
(error) |
|
|
||
2021-01-01T00:00:00
2021-01-01T00:00:00Z
|
||
timestamp[s, UTC] |
|
(error) |
|
timestamp[s, UTC] |
|
|
||
2021-01-01T00:00:00
2021-01-01T00:00:00Z
|
(error) |
|
timestamp[s, America/New_York] |
|
(error) |
|
timestamp[s, America/New_York] |
|
|
||
2021-01-01T00:00:00
2021-01-01T00:00:00Z
|
(error) |
Nulls#
Null values are recognized from the spellings stored in
ConvertOptions::null_values
. The ConvertOptions::Defaults()
factory method will initialize a number of conventional null spellings such
as N/A
.
Character encoding#
CSV files are expected to be encoded in UTF8. However, non-UTF8 data is accepted for Binary columns.
Write Options#
The format of written CSV files can be customized via WriteOptions
.
Currently few options are available; more will be added in future releases.
Performance#
By default, the CSV reader will parallelize reads in order to exploit all
CPU cores on your machine. You can change this setting in
ReadOptions::use_threads
. A reasonable expectation is at least
100 MB/s per core on a performant desktop or laptop computer (measured in
source CSV bytes, not target Arrow data bytes).