Reading and Writing CSV files#

Arrow provides a fast CSV reader allowing ingestion of external data as Arrow tables.

Basic usage#

A CSV file is read from a InputStream.

#include "arrow/csv/api.h"

{
   // ...
   arrow::io::IOContext io_context = arrow::io::default_io_context();
   std::shared_ptr<arrow::io::InputStream> input = ...;

   auto read_options = arrow::csv::ReadOptions::Defaults();
   auto parse_options = arrow::csv::ParseOptions::Defaults();
   auto convert_options = arrow::csv::ConvertOptions::Defaults();

   // Instantiate TableReader from input stream and options
   auto maybe_reader =
     arrow::csv::TableReader::Make(io_context,
                                   input,
                                   read_options,
                                   parse_options,
                                   convert_options);
   if (!maybe_reader.ok()) {
      // Handle TableReader instantiation error...
   }
   std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;

   // Read table from CSV file
   auto maybe_table = reader->Read();
   if (!maybe_table.ok()) {
      // Handle CSV read error
      // (for example a CSV syntax error or failed type conversion)
   }
   std::shared_ptr<arrow::Table> table = *maybe_table;
}

A CSV file is written to a OutputStream.

#include <arrow/csv/api.h>
{
    // Oneshot write
    // ...
    std::shared_ptr<arrow::io::OutputStream> output = ...;
    auto write_options = arrow::csv::WriteOptions::Defaults();
    if (WriteCSV(table, write_options, output.get()).ok()) {
        // Handle writer error...
    }
}
{
    // Write incrementally
    // ...
    std::shared_ptr<arrow::io::OutputStream> output = ...;
    auto write_options = arrow::csv::WriteOptions::Defaults();
    auto maybe_writer = arrow::csv::MakeCSVWriter(output, schema, write_options);
    if (!maybe_writer.ok()) {
        // Handle writer instantiation error...
    }
    std::shared_ptr<arrow::ipc::RecordBatchWriter> writer = *maybe_writer;

    // Write batches...
    if (!writer->WriteRecordBatch(*batch).ok()) {
        // Handle write error...
    }

    if (!writer->Close().ok()) {
        // Handle close error...
    }
    if (!output->Close().ok()) {
        // Handle file close error...
    }
}

Note

The writer does not yet support all Arrow types.

Column names#

There are three possible ways to infer column names from the CSV file:

  • By default, the column names are read from the first row in the CSV file

  • If ReadOptions::column_names is set, it forces the column names in the table to these values (the first row in the CSV file is read as data)

  • If ReadOptions::autogenerate_column_names is true, column names will be autogenerated with the pattern “f0”, “f1”… (the first row in the CSV file is read as data)

Column selection#

By default, Arrow reads all columns in the CSV file. You can narrow the selection of columns with the ConvertOptions::include_columns option. If some columns in ConvertOptions::include_columns are missing from the CSV file, an error will be emitted unless ConvertOptions::include_missing_columns is true, in which case the missing columns are assumed to contain all-null values.

Interaction with column names#

If both ReadOptions::column_names and ConvertOptions::include_columns are specified, the ReadOptions::column_names are assumed to map to CSV columns, and ConvertOptions::include_columns is a subset of those column names that will part of the Arrow Table.

Data types#

By default, the CSV reader infers the most appropriate data type for each column. Type inference considers the following data types, in order:

It is possible to override type inference for select columns by setting the ConvertOptions::column_types option. Explicit data types can be chosen from the following list:

  • Null

  • All Integer types

  • Float32 and Float64

  • Decimal128

  • Boolean

  • Date32 and Date64

  • Time32 and Time64

  • Timestamp

  • Binary and Large Binary

  • String and Large String (with optional UTF8 input validation)

  • Fixed-Size Binary

  • Dictionary with index type Int32 and value type one of the following: Binary, String, LargeBinary, LargeString, Int32, UInt32, Int64, UInt64, Float32, Float64, Decimal128

Other data types do not support conversion from CSV values and will error out.

Dictionary inference#

If type inference is enabled and ConvertOptions::auto_dict_encode is true, the CSV reader first tries to convert string-like columns to a dictionary-encoded string-like array. It switches to a plain string-like array when the threshold in ConvertOptions::auto_dict_max_cardinality is reached.

Timestamp inference/parsing#

If type inference is enabled, the CSV reader first tries to interpret string-like columns as timestamps. If all rows have some zone offset (e.g. Z or +0100), even if the offsets are inconsistent, then the inferred type will be UTC timestamp. If no rows have a zone offset, then the inferred type will be timestamp without timezone. A mix of rows with/without offsets will result in a string column.

If the type is explicitly specified as a timestamp with/without timezone, then the reader will error on values without/with zone offsets in that column. Note that this means it isn’t currently possible to have the reader parse a column of timestamps without zone offsets as local times in a particular timezone; instead, parse the column as timestamp without timezone, then convert the values afterwards using the assume_timezone compute function.

Specified Type

Input CSV

Result Type

(inferred)

2021-01-01T00:00:00

timestamp[s]

2021-01-01T00:00:00Z

timestamp[s, UTC]

2021-01-01T00:00:00+0100

2021-01-01T00:00:00
2021-01-01T00:00:00Z

string

timestamp[s]

2021-01-01T00:00:00

timestamp[s]

2021-01-01T00:00:00Z

(error)

2021-01-01T00:00:00+0100

2021-01-01T00:00:00
2021-01-01T00:00:00Z

timestamp[s, UTC]

2021-01-01T00:00:00

(error)

2021-01-01T00:00:00Z

timestamp[s, UTC]

2021-01-01T00:00:00+0100

2021-01-01T00:00:00
2021-01-01T00:00:00Z

(error)

timestamp[s, America/New_York]

2021-01-01T00:00:00

(error)

2021-01-01T00:00:00Z

timestamp[s, America/New_York]

2021-01-01T00:00:00+0100

2021-01-01T00:00:00
2021-01-01T00:00:00Z

(error)

Nulls#

Null values are recognized from the spellings stored in ConvertOptions::null_values. The ConvertOptions::Defaults() factory method will initialize a number of conventional null spellings such as N/A.

Character encoding#

CSV files are expected to be encoded in UTF8. However, non-UTF8 data is accepted for Binary columns.

Write Options#

The format of written CSV files can be customized via WriteOptions. Currently few options are available; more will be added in future releases.

Performance#

By default, the CSV reader will parallelize reads in order to exploit all CPU cores on your machine. You can change this setting in ReadOptions::use_threads. A reasonable expectation is at least 100 MB/s per core on a performant desktop or laptop computer (measured in source CSV bytes, not target Arrow data bytes).