Arrow File I/O#

Apache Arrow provides file I/O functions to facilitate use of Arrow from the start to end of an application. In this article, you will:

Read an Arrow file into a RecordBatch and write it back out afterwards
Read a CSV file into a Table and write it back out afterwards
Read a Parquet file into a Table and write it back out afterwards

Pre-requisites#

Before continuing, make sure you have:

An Arrow installation, which you can set up here: Using Arrow C++ in your own project
An understanding of basic Arrow data structures from Basic Arrow Data Structures
A directory to run the final application in – this program will generate some files, so be prepared for that.

Setup#

Before writing out some file I/O, we need to fill in a couple gaps:

We need to include necessary headers.
A main() is needed to glue things together.
We need files to play with.

Includes#

Before writing C++ code, we need some includes. We’ll get iostream for output, then import Arrow’s I/O functionality for each file type we’ll work with in this article:

#include <arrow/api.h>
#include <arrow/csv/api.h>
#include <arrow/io/api.h>
#include <arrow/ipc/api.h>
#include <parquet/arrow/reader.h>
#include <parquet/arrow/writer.h>

#include <iostream>

Main()#

For our glue, we’ll use the main() pattern from the previous tutorial on data structures:

int main() {
  arrow::Status st = RunMain();
  if (!st.ok()) {
    std::cerr << st << std::endl;
    return 1;
  }
  return 0;
}

Which, like when we used it before, is paired with a RunMain():

arrow::Status RunMain() {

Generating Files for Reading#

We need some files to actually play with. In practice, you’ll likely have some input for your own application. Here, however, we want to explore doing I/O for the sake of it, so let’s generate some files to make this easy to follow. To create those, we’ll define a helper function that we’ll run first. Feel free to read through this, but the concepts used will be explained later in this article. Note that we’re using the day/month/year data from the previous tutorial. For now, just copy the function in:

arrow::Status GenInitialFile() {
  // Make a couple 8-bit integer arrays and a 16-bit integer array -- just like
  // basic Arrow example.
  arrow::Int8Builder int8builder;
  int8_t days_raw[5] = {1, 12, 17, 23, 28};
  ARROW_RETURN_NOT_OK(int8builder.AppendValues(days_raw, 5));
  std::shared_ptr<arrow::Array> days;
  ARROW_ASSIGN_OR_RAISE(days, int8builder.Finish());

  int8_t months_raw[5] = {1, 3, 5, 7, 1};
  ARROW_RETURN_NOT_OK(int8builder.AppendValues(months_raw, 5));
  std::shared_ptr<arrow::Array> months;
  ARROW_ASSIGN_OR_RAISE(months, int8builder.Finish());

  arrow::Int16Builder int16builder;
  int16_t years_raw[5] = {1990, 2000, 1995, 2000, 1995};
  ARROW_RETURN_NOT_OK(int16builder.AppendValues(years_raw, 5));
  std::shared_ptr<arrow::Array> years;
  ARROW_ASSIGN_OR_RAISE(years, int16builder.Finish());

  // Get a vector of our Arrays
  std::vector<std::shared_ptr<arrow::Array>> columns = {days, months, years};

  // Make a schema to initialize the Table with
  std::shared_ptr<arrow::Field> field_day, field_month, field_year;
  std::shared_ptr<arrow::Schema> schema;

  field_day = arrow::field("Day", arrow::int8());
  field_month = arrow::field("Month", arrow::int8());
  field_year = arrow::field("Year", arrow::int16());

  schema = arrow::schema({field_day, field_month, field_year});
  // With the schema and data, create a Table
  std::shared_ptr<arrow::Table> table;
  table = arrow::Table::Make(schema, columns);

  // Write out test files in IPC, CSV, and Parquet for the example to use.
  std::shared_ptr<arrow::io::FileOutputStream> outfile;
  ARROW_ASSIGN_OR_RAISE(outfile, arrow::io::FileOutputStream::Open("test_in.arrow"));
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::ipc::RecordBatchWriter> ipc_writer,
                        arrow::ipc::MakeFileWriter(outfile, schema));
  ARROW_RETURN_NOT_OK(ipc_writer->WriteTable(*table));
  ARROW_RETURN_NOT_OK(ipc_writer->Close());

  ARROW_ASSIGN_OR_RAISE(outfile, arrow::io::FileOutputStream::Open("test_in.csv"));
  ARROW_ASSIGN_OR_RAISE(auto csv_writer,
                        arrow::csv::MakeCSVWriter(outfile, table->schema()));
  ARROW_RETURN_NOT_OK(csv_writer->WriteTable(*table));
  ARROW_RETURN_NOT_OK(csv_writer->Close());

  ARROW_ASSIGN_OR_RAISE(outfile, arrow::io::FileOutputStream::Open("test_in.parquet"));
  PARQUET_THROW_NOT_OK(
      parquet::arrow::WriteTable(*table, arrow::default_memory_pool(), outfile, 5));

  return arrow::Status::OK();
}

To get the files for the rest of your code to function, make sure to call GenInitialFile() as the very first line in RunMain() to initialize the environment:

  // Generate initial files for each format with a helper function -- don't worry,
  // we'll also write a table in this example.
  ARROW_RETURN_NOT_OK(GenInitialFile());

I/O with Arrow Files#

We’re going to go through this step by step, reading then writing, as follows:

Reading a file
1. Open the file
2. Bind file to ipc::RecordBatchFileReader
3. Read file to RecordBatch
Writing a file
1. Get a io::FileOutputStream
2. Write to file from RecordBatch

Opening a File#

To actually read a file, we need to get some sort of way to point to it. In Arrow, that means we’re going to get a io::ReadableFile object – much like an ArrayBuilder can clear and make new arrays, we can reassign this to new files, so we’ll use this instance throughout the examples:

  // First, we have to set up a ReadableFile object, which just lets us point our
  // readers to the right data on disk. We'll be reusing this object, and rebinding
  // it to multiple files throughout the example.
  std::shared_ptr<arrow::io::ReadableFile> infile;

A io::ReadableFile does little alone – we actually have it bind to a file with io::ReadableFile::Open(). For our purposes here, the default arguments suffice:

  // Get "test_in.arrow" into our file pointer
  ARROW_ASSIGN_OR_RAISE(infile, arrow::io::ReadableFile::Open(
                                    "test_in.arrow", arrow::default_memory_pool()));

Opening an Arrow file Reader#

An io::ReadableFile is too generic to offer all functionality to read an Arrow file. We need to use it to get an ipc::RecordBatchFileReader object. This object implements all the logic needed to read an Arrow file with correct formatting. We get one through ipc::RecordBatchFileReader::Open():

  // Open up the file with the IPC features of the library, gives us a reader object.
  ARROW_ASSIGN_OR_RAISE(auto ipc_reader, arrow::ipc::RecordBatchFileReader::Open(infile));

Reading an Open Arrow File to RecordBatch#

We have to use a RecordBatch to read an Arrow file, so we’ll get a RecordBatch. Once we have that, we can actually read the file. Arrow files can have multiple RecordBatches, so we must pass an index. This file only has one, so pass 0:

  // Using the reader, we can read Record Batches. Note that this is specific to IPC;
  // for other formats, we focus on Tables, but here, RecordBatches are used.
  std::shared_ptr<arrow::RecordBatch> rbatch;
  ARROW_ASSIGN_OR_RAISE(rbatch, ipc_reader->ReadRecordBatch(0));

Prepare a FileOutputStream#

For output, we need a io::FileOutputStream. Just like our io::ReadableFile, we’ll be reusing this, so be ready for that. We open files the same way as when reading:

  // Just like with input, we get an object for the output file.
  std::shared_ptr<arrow::io::FileOutputStream> outfile;
  // Bind it to "test_out.arrow"
  ARROW_ASSIGN_OR_RAISE(outfile, arrow::io::FileOutputStream::Open("test_out.arrow"));

Write Arrow File from RecordBatch#

Now, we grab our RecordBatch we read into previously, and use it, along with our target file, to create a ipc::RecordBatchWriter. The ipc::RecordBatchWriter needs two things:

the target file
the Schema for our RecordBatch (in case we need to write more RecordBatches of the same format.)

The Schema comes from our existing RecordBatch and the target file is the output stream we just created.

  // Set up a writer with the output file -- and the schema! We're defining everything
  // here, loading to fire.
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::ipc::RecordBatchWriter> ipc_writer,
                        arrow::ipc::MakeFileWriter(outfile, rbatch->schema()));

We can just call ipc::RecordBatchWriter::WriteRecordBatch() with our RecordBatch to fill up our file:

  // Write the record batch.
  ARROW_RETURN_NOT_OK(ipc_writer->WriteRecordBatch(*rbatch));

For IPC in particular, the writer has to be closed since it anticipates more than one batch may be written. To do that:

  // Specifically for IPC, the writer needs to be explicitly closed.
  ARROW_RETURN_NOT_OK(ipc_writer->Close());

Now we’ve read and written an IPC file!

I/O with CSV#

We’re going to go through this step by step, reading then writing, as follows:

Reading a file
1. Open the file
2. Prepare Table
3. Read File using csv::TableReader
Writing a file
1. Get a io::FileOutputStream
2. Write to file from Table

Opening a CSV File#

For a CSV file, we need to open a io::ReadableFile, just like an Arrow file, and reuse our io::ReadableFile object from before to do so:

  // Bind our input file to "test_in.csv"
  ARROW_ASSIGN_OR_RAISE(infile, arrow::io::ReadableFile::Open("test_in.csv"));

Preparing a Table#

CSV can be read into a Table, so declare a pointer to a Table:

  std::shared_ptr<arrow::Table> csv_table;

Read a CSV File to Table#

The CSV reader has option structs which need to be passed – luckily, there are defaults for these which we can pass directly. For reference on the other options, go here: File Formats. without any special delimiters and is small, so we can make our reader with defaults:

  // The CSV reader has several objects for various options. For now, we'll use defaults.
  ARROW_ASSIGN_OR_RAISE(
      auto csv_reader,
      arrow::csv::TableReader::Make(
          arrow::io::default_io_context(), infile, arrow::csv::ReadOptions::Defaults(),
          arrow::csv::ParseOptions::Defaults(), arrow::csv::ConvertOptions::Defaults()));

With the CSV reader primed, we can use its csv::TableReader::Read() method to fill our Table:

  // Read the table.
  ARROW_ASSIGN_OR_RAISE(csv_table, csv_reader->Read())

Write a CSV File from Table#

CSV writing to Table looks exactly like IPC writing to RecordBatch, except with our Table, and using ipc::RecordBatchWriter::WriteTable() instead of ipc::RecordBatchWriter::WriteRecordBatch(). Note that the same writer class is used – we’re writing with ipc::RecordBatchWriter::WriteTable() because we have a Table. We’ll target a file, use our Table’s Schema, and then write the Table:

  // Bind our output file to "test_out.csv"
  ARROW_ASSIGN_OR_RAISE(outfile, arrow::io::FileOutputStream::Open("test_out.csv"));
  // The CSV writer has simpler defaults, review API documentation for more complex usage.
  ARROW_ASSIGN_OR_RAISE(auto csv_writer,
                        arrow::csv::MakeCSVWriter(outfile, csv_table->schema()));
  ARROW_RETURN_NOT_OK(csv_writer->WriteTable(*csv_table));
  // Not necessary, but a safe practice.
  ARROW_RETURN_NOT_OK(csv_writer->Close());

Now, we’ve read and written a CSV file!

File I/O with Parquet#

We’re going to go through this step by step, reading then writing, as follows:

Reading a file
1. Open the file
2. Prepare parquet::arrow::FileReader
3. Read file to Table
Writing a file
1. Write Table to file

Opening a Parquet File#

Once more, this file format, Parquet, needs a io::ReadableFile, which we already have, and for the io::ReadableFile::Open() method to be called on a file:

  // Bind our input file to "test_in.parquet"
  ARROW_ASSIGN_OR_RAISE(infile, arrow::io::ReadableFile::Open("test_in.parquet"));

Setting up a Parquet Reader#

As always, we need a Reader to actually read the file. We’ve been getting Readers for each file format from the Arrow namespace. This time, we enter the Parquet namespace to get the parquet::arrow::FileReader:

  std::unique_ptr<parquet::arrow::FileReader> reader;

Now, to set up our reader, we call parquet::arrow::OpenFile(). Yes, this is necessary even though we used io::ReadableFile::Open(). Note that we pass our parquet::arrow::FileReader by reference, instead of assigning to it in output:

  // Note that Parquet's OpenFile() takes the reader by reference, rather than returning
  // a reader.
  PARQUET_ASSIGN_OR_THROW(reader,
                          parquet::arrow::OpenFile(infile, arrow::default_memory_pool()));

Reading a Parquet File to Table#

With a prepared parquet::arrow::FileReader in hand, we can read to a Table, except we must pass the Table by reference instead of outputting to it:

  std::shared_ptr<arrow::Table> parquet_table;
  // Read the table.
  PARQUET_THROW_NOT_OK(reader->ReadTable(&parquet_table));

Writing a Parquet File from Table#

For single-shot writes, writing a Parquet file does not need a writer object. Instead, we give it our table, point to the memory pool it will use for any necessary memory consumption, tell it where to write, and the chunk size if it needs to break up the file at all:

  // Parquet writing does not need a declared writer object. Just get the output
  // file bound, then pass in the table, memory pool, output, and chunk size for
  // breaking up the Table on-disk.
  ARROW_ASSIGN_OR_RAISE(outfile, arrow::io::FileOutputStream::Open("test_out.parquet"));
  PARQUET_THROW_NOT_OK(parquet::arrow::WriteTable(
      *parquet_table, arrow::default_memory_pool(), outfile, 5));

Ending Program#

At the end, we just return Status::OK(), so the main() knows that we’re done, and that everything’s okay. Just like in the first tutorial.

  return arrow::Status::OK();
}

With that, you’ve read and written IPC, CSV, and Parquet in Arrow, and can properly load data and write output! Now, we can move into processing data with compute functions in the next article.

Refer to the below for a copy of the complete code:

// (Doc section: Includes)
#include <arrow/api.h>
#include <arrow/csv/api.h>
#include <arrow/io/api.h>
#include <arrow/ipc/api.h>
#include <parquet/arrow/reader.h>
#include <parquet/arrow/writer.h>

#include <iostream>
// (Doc section: Includes)

// (Doc section: GenInitialFile)
arrow::Status GenInitialFile() {
  // Make a couple 8-bit integer arrays and a 16-bit integer array -- just like
  // basic Arrow example.
  arrow::Int8Builder int8builder;
  int8_t days_raw[5] = {1, 12, 17, 23, 28};
  ARROW_RETURN_NOT_OK(int8builder.AppendValues(days_raw, 5));
  std::shared_ptr<arrow::Array> days;
  ARROW_ASSIGN_OR_RAISE(days, int8builder.Finish());

  int8_t months_raw[5] = {1, 3, 5, 7, 1};
  ARROW_RETURN_NOT_OK(int8builder.AppendValues(months_raw, 5));
  std::shared_ptr<arrow::Array> months;
  ARROW_ASSIGN_OR_RAISE(months, int8builder.Finish());

  arrow::Int16Builder int16builder;
  int16_t years_raw[5] = {1990, 2000, 1995, 2000, 1995};
  ARROW_RETURN_NOT_OK(int16builder.AppendValues(years_raw, 5));
  std::shared_ptr<arrow::Array> years;
  ARROW_ASSIGN_OR_RAISE(years, int16builder.Finish());

  // Get a vector of our Arrays
  std::vector<std::shared_ptr<arrow::Array>> columns = {days, months, years};

  // Make a schema to initialize the Table with
  std::shared_ptr<arrow::Field> field_day, field_month, field_year;
  std::shared_ptr<arrow::Schema> schema;

  field_day = arrow::field("Day", arrow::int8());
  field_month = arrow::field("Month", arrow::int8());
  field_year = arrow::field("Year", arrow::int16());

  schema = arrow::schema({field_day, field_month, field_year});
  // With the schema and data, create a Table
  std::shared_ptr<arrow::Table> table;
  table = arrow::Table::Make(schema, columns);

  // Write out test files in IPC, CSV, and Parquet for the example to use.
  std::shared_ptr<arrow::io::FileOutputStream> outfile;
  ARROW_ASSIGN_OR_RAISE(outfile, arrow::io::FileOutputStream::Open("test_in.arrow"));
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::ipc::RecordBatchWriter> ipc_writer,
                        arrow::ipc::MakeFileWriter(outfile, schema));
  ARROW_RETURN_NOT_OK(ipc_writer->WriteTable(*table));
  ARROW_RETURN_NOT_OK(ipc_writer->Close());

  ARROW_ASSIGN_OR_RAISE(outfile, arrow::io::FileOutputStream::Open("test_in.csv"));
  ARROW_ASSIGN_OR_RAISE(auto csv_writer,
                        arrow::csv::MakeCSVWriter(outfile, table->schema()));
  ARROW_RETURN_NOT_OK(csv_writer->WriteTable(*table));
  ARROW_RETURN_NOT_OK(csv_writer->Close());

  ARROW_ASSIGN_OR_RAISE(outfile, arrow::io::FileOutputStream::Open("test_in.parquet"));
  PARQUET_THROW_NOT_OK(
      parquet::arrow::WriteTable(*table, arrow::default_memory_pool(), outfile, 5));

  return arrow::Status::OK();
}
// (Doc section: GenInitialFile)

// (Doc section: RunMain)
arrow::Status RunMain() {
  // (Doc section: RunMain)
  // (Doc section: Gen Files)
  // Generate initial files for each format with a helper function -- don't worry,
  // we'll also write a table in this example.
  ARROW_RETURN_NOT_OK(GenInitialFile());
  // (Doc section: Gen Files)

  // (Doc section: ReadableFile Definition)
  // First, we have to set up a ReadableFile object, which just lets us point our
  // readers to the right data on disk. We'll be reusing this object, and rebinding
  // it to multiple files throughout the example.
  std::shared_ptr<arrow::io::ReadableFile> infile;
  // (Doc section: ReadableFile Definition)
  // (Doc section: Arrow ReadableFile Open)
  // Get "test_in.arrow" into our file pointer
  ARROW_ASSIGN_OR_RAISE(infile, arrow::io::ReadableFile::Open(
                                    "test_in.arrow", arrow::default_memory_pool()));
  // (Doc section: Arrow ReadableFile Open)
  // (Doc section: Arrow Read Open)
  // Open up the file with the IPC features of the library, gives us a reader object.
  ARROW_ASSIGN_OR_RAISE(auto ipc_reader, arrow::ipc::RecordBatchFileReader::Open(infile));
  // (Doc section: Arrow Read Open)
  // (Doc section: Arrow Read)
  // Using the reader, we can read Record Batches. Note that this is specific to IPC;
  // for other formats, we focus on Tables, but here, RecordBatches are used.
  std::shared_ptr<arrow::RecordBatch> rbatch;
  ARROW_ASSIGN_OR_RAISE(rbatch, ipc_reader->ReadRecordBatch(0));
  // (Doc section: Arrow Read)

  // (Doc section: Arrow Write Open)
  // Just like with input, we get an object for the output file.
  std::shared_ptr<arrow::io::FileOutputStream> outfile;
  // Bind it to "test_out.arrow"
  ARROW_ASSIGN_OR_RAISE(outfile, arrow::io::FileOutputStream::Open("test_out.arrow"));
  // (Doc section: Arrow Write Open)
  // (Doc section: Arrow Writer)
  // Set up a writer with the output file -- and the schema! We're defining everything
  // here, loading to fire.
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::ipc::RecordBatchWriter> ipc_writer,
                        arrow::ipc::MakeFileWriter(outfile, rbatch->schema()));
  // (Doc section: Arrow Writer)
  // (Doc section: Arrow Write)
  // Write the record batch.
  ARROW_RETURN_NOT_OK(ipc_writer->WriteRecordBatch(*rbatch));
  // (Doc section: Arrow Write)
  // (Doc section: Arrow Close)
  // Specifically for IPC, the writer needs to be explicitly closed.
  ARROW_RETURN_NOT_OK(ipc_writer->Close());
  // (Doc section: Arrow Close)

  // (Doc section: CSV Read Open)
  // Bind our input file to "test_in.csv"
  ARROW_ASSIGN_OR_RAISE(infile, arrow::io::ReadableFile::Open("test_in.csv"));
  // (Doc section: CSV Read Open)
  // (Doc section: CSV Table Declare)
  std::shared_ptr<arrow::Table> csv_table;
  // (Doc section: CSV Table Declare)
  // (Doc section: CSV Reader Make)
  // The CSV reader has several objects for various options. For now, we'll use defaults.
  ARROW_ASSIGN_OR_RAISE(
      auto csv_reader,
      arrow::csv::TableReader::Make(
          arrow::io::default_io_context(), infile, arrow::csv::ReadOptions::Defaults(),
          arrow::csv::ParseOptions::Defaults(), arrow::csv::ConvertOptions::Defaults()));
  // (Doc section: CSV Reader Make)
  // (Doc section: CSV Read)
  // Read the table.
  ARROW_ASSIGN_OR_RAISE(csv_table, csv_reader->Read())
  // (Doc section: CSV Read)

  // (Doc section: CSV Write)
  // Bind our output file to "test_out.csv"
  ARROW_ASSIGN_OR_RAISE(outfile, arrow::io::FileOutputStream::Open("test_out.csv"));
  // The CSV writer has simpler defaults, review API documentation for more complex usage.
  ARROW_ASSIGN_OR_RAISE(auto csv_writer,
                        arrow::csv::MakeCSVWriter(outfile, csv_table->schema()));
  ARROW_RETURN_NOT_OK(csv_writer->WriteTable(*csv_table));
  // Not necessary, but a safe practice.
  ARROW_RETURN_NOT_OK(csv_writer->Close());
  // (Doc section: CSV Write)

  // (Doc section: Parquet Read Open)
  // Bind our input file to "test_in.parquet"
  ARROW_ASSIGN_OR_RAISE(infile, arrow::io::ReadableFile::Open("test_in.parquet"));
  // (Doc section: Parquet Read Open)
  // (Doc section: Parquet FileReader)
  std::unique_ptr<parquet::arrow::FileReader> reader;
  // (Doc section: Parquet FileReader)
  // (Doc section: Parquet OpenFile)
  // Note that Parquet's OpenFile() takes the reader by reference, rather than returning
  // a reader.
  PARQUET_ASSIGN_OR_THROW(reader,
                          parquet::arrow::OpenFile(infile, arrow::default_memory_pool()));
  // (Doc section: Parquet OpenFile)

  // (Doc section: Parquet Read)
  std::shared_ptr<arrow::Table> parquet_table;
  // Read the table.
  PARQUET_THROW_NOT_OK(reader->ReadTable(&parquet_table));
  // (Doc section: Parquet Read)

  // (Doc section: Parquet Write)
  // Parquet writing does not need a declared writer object. Just get the output
  // file bound, then pass in the table, memory pool, output, and chunk size for
  // breaking up the Table on-disk.
  ARROW_ASSIGN_OR_RAISE(outfile, arrow::io::FileOutputStream::Open("test_out.parquet"));
  PARQUET_THROW_NOT_OK(parquet::arrow::WriteTable(
      *parquet_table, arrow::default_memory_pool(), outfile, 5));
  // (Doc section: Parquet Write)
  // (Doc section: Return)
  return arrow::Status::OK();
}
// (Doc section: Return)

// (Doc section: Main)
int main() {
  arrow::Status st = RunMain();
  if (!st.ok()) {
    std::cerr << st << std::endl;
    return 1;
  }
  return 0;
}
// (Doc section: Main)