Arrow File I/O#
Apache Arrow provides file I/O functions to facilitate use of Arrow from the start to end of an application. In this article, you will:
Read an Arrow file into a
RecordBatch
and write it back out afterwardsRead a CSV file into a
Table
and write it back out afterwardsRead a Parquet file into a
Table
and write it back out afterwards
Pre-requisites#
Before continuing, make sure you have:
An Arrow installation, which you can set up here: Using Arrow C++ in your own project
An understanding of basic Arrow data structures from Basic Arrow Data Structures
A directory to run the final application in – this program will generate some files, so be prepared for that.
Setup#
Before writing out some file I/O, we need to fill in a couple gaps:
We need to include necessary headers.
A
main()
is needed to glue things together.We need files to play with.
Includes#
Before writing C++ code, we need some includes. We’ll get iostream
for output, then import Arrow’s
I/O functionality for each file type we’ll work with in this article:
#include <arrow/api.h>
#include <arrow/csv/api.h>
#include <arrow/io/api.h>
#include <arrow/ipc/api.h>
#include <parquet/arrow/reader.h>
#include <parquet/arrow/writer.h>
#include <iostream>
Main()#
For our glue, we’ll use the main()
pattern from the previous tutorial on
data structures:
int main() {
arrow::Status st = RunMain();
if (!st.ok()) {
std::cerr << st << std::endl;
return 1;
}
return 0;
}
Which, like when we used it before, is paired with a RunMain()
:
arrow::Status RunMain() {
Generating Files for Reading#
We need some files to actually play with. In practice, you’ll likely have some input for your own application. Here, however, we want to explore doing I/O for the sake of it, so let’s generate some files to make this easy to follow. To create those, we’ll define a helper function that we’ll run first. Feel free to read through this, but the concepts used will be explained later in this article. Note that we’re using the day/month/year data from the previous tutorial. For now, just copy the function in:
arrow::Status GenInitialFile() {
// Make a couple 8-bit integer arrays and a 16-bit integer array -- just like
// basic Arrow example.
arrow::Int8Builder int8builder;
int8_t days_raw[5] = {1, 12, 17, 23, 28};
ARROW_RETURN_NOT_OK(int8builder.AppendValues(days_raw, 5));
std::shared_ptr<arrow::Array> days;
ARROW_ASSIGN_OR_RAISE(days, int8builder.Finish());
int8_t months_raw[5] = {1, 3, 5, 7, 1};
ARROW_RETURN_NOT_OK(int8builder.AppendValues(months_raw, 5));
std::shared_ptr<arrow::Array> months;
ARROW_ASSIGN_OR_RAISE(months, int8builder.Finish());
arrow::Int16Builder int16builder;
int16_t years_raw[5] = {1990, 2000, 1995, 2000, 1995};
ARROW_RETURN_NOT_OK(int16builder.AppendValues(years_raw, 5));
std::shared_ptr<arrow::Array> years;
ARROW_ASSIGN_OR_RAISE(years, int16builder.Finish());
// Get a vector of our Arrays
std::vector<std::shared_ptr<arrow::Array>> columns = {days, months, years};
// Make a schema to initialize the Table with
std::shared_ptr<arrow::Field> field_day, field_month, field_year;
std::shared_ptr<arrow::Schema> schema;
field_day = arrow::field("Day", arrow::int8());
field_month = arrow::field("Month", arrow::int8());
field_year = arrow::field("Year", arrow::int16());
schema = arrow::schema({field_day, field_month, field_year});
// With the schema and data, create a Table
std::shared_ptr<arrow::Table> table;
table = arrow::Table::Make(schema, columns);
// Write out test files in IPC, CSV, and Parquet for the example to use.
std::shared_ptr<arrow::io::FileOutputStream> outfile;
ARROW_ASSIGN_OR_RAISE(outfile, arrow::io::FileOutputStream::Open("test_in.arrow"));
ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::ipc::RecordBatchWriter> ipc_writer,
arrow::ipc::MakeFileWriter(outfile, schema));
ARROW_RETURN_NOT_OK(ipc_writer->WriteTable(*table));
ARROW_RETURN_NOT_OK(ipc_writer->Close());
ARROW_ASSIGN_OR_RAISE(outfile, arrow::io::FileOutputStream::Open("test_in.csv"));
ARROW_ASSIGN_OR_RAISE(auto csv_writer,
arrow::csv::MakeCSVWriter(outfile, table->schema()));
ARROW_RETURN_NOT_OK(csv_writer->WriteTable(*table));
ARROW_RETURN_NOT_OK(csv_writer->Close());
ARROW_ASSIGN_OR_RAISE(outfile, arrow::io::FileOutputStream::Open("test_in.parquet"));
PARQUET_THROW_NOT_OK(
parquet::arrow::WriteTable(*table, arrow::default_memory_pool(), outfile, 5));
return arrow::Status::OK();
}
To get the files for the rest of your code to function, make sure to
call GenInitialFile()
as the very first line in RunMain()
to initialize
the environment:
// Generate initial files for each format with a helper function -- don't worry,
// we'll also write a table in this example.
ARROW_RETURN_NOT_OK(GenInitialFile());
I/O with Arrow Files#
We’re going to go through this step by step, reading then writing, as follows:
Reading a file
Open the file
Bind file to
ipc::RecordBatchFileReader
Read file to
RecordBatch
Writing a file
Get a
io::FileOutputStream
Write to file from
RecordBatch
Opening a File#
To actually read a file, we need to get some sort of way to point to it.
In Arrow, that means we’re going to get a io::ReadableFile
object – much
like an ArrayBuilder
can clear and make new arrays, we can reassign this
to new files, so we’ll use this instance throughout the examples:
// First, we have to set up a ReadableFile object, which just lets us point our
// readers to the right data on disk. We'll be reusing this object, and rebinding
// it to multiple files throughout the example.
std::shared_ptr<arrow::io::ReadableFile> infile;
A io::ReadableFile
does little alone – we actually have it bind to a file
with io::ReadableFile::Open()
. For
our purposes here, the default arguments suffice:
// Get "test_in.arrow" into our file pointer
ARROW_ASSIGN_OR_RAISE(infile, arrow::io::ReadableFile::Open(
"test_in.arrow", arrow::default_memory_pool()));
Opening an Arrow file Reader#
An io::ReadableFile
is too generic to offer all functionality to read an Arrow file.
We need to use it to get an ipc::RecordBatchFileReader
object. This object implements
all the logic needed to read an Arrow file with correct formatting. We get one through
ipc::RecordBatchFileReader::Open()
:
// Open up the file with the IPC features of the library, gives us a reader object.
ARROW_ASSIGN_OR_RAISE(auto ipc_reader, arrow::ipc::RecordBatchFileReader::Open(infile));
Reading an Open Arrow File to RecordBatch#
We have to use a RecordBatch
to read an Arrow file, so we’ll get a
RecordBatch
. Once we have that, we can actually read the file. Arrow
files can have multiple RecordBatches
, so we must pass an index. This
file only has one, so pass 0:
// Using the reader, we can read Record Batches. Note that this is specific to IPC;
// for other formats, we focus on Tables, but here, RecordBatches are used.
std::shared_ptr<arrow::RecordBatch> rbatch;
ARROW_ASSIGN_OR_RAISE(rbatch, ipc_reader->ReadRecordBatch(0));
Prepare a FileOutputStream#
For output, we need a io::FileOutputStream
. Just like our io::ReadableFile
,
we’ll be reusing this, so be ready for that. We open files the same way
as when reading:
// Just like with input, we get an object for the output file.
std::shared_ptr<arrow::io::FileOutputStream> outfile;
// Bind it to "test_out.arrow"
ARROW_ASSIGN_OR_RAISE(outfile, arrow::io::FileOutputStream::Open("test_out.arrow"));
Write Arrow File from RecordBatch#
Now, we grab our RecordBatch
we read into previously, and use it, along
with our target file, to create a ipc::RecordBatchWriter
. The
ipc::RecordBatchWriter
needs two things:
the target file
the
Schema
for ourRecordBatch
(in case we need to write moreRecordBatches
of the same format.)
The Schema
comes from our existing RecordBatch
and the target file is
the output stream we just created.
// Set up a writer with the output file -- and the schema! We're defining everything
// here, loading to fire.
ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::ipc::RecordBatchWriter> ipc_writer,
arrow::ipc::MakeFileWriter(outfile, rbatch->schema()));
We can just call ipc::RecordBatchWriter::WriteRecordBatch()
with our RecordBatch
to fill up our
file:
// Write the record batch.
ARROW_RETURN_NOT_OK(ipc_writer->WriteRecordBatch(*rbatch));
For IPC in particular, the writer has to be closed since it anticipates more than one batch may be written. To do that:
// Specifically for IPC, the writer needs to be explicitly closed.
ARROW_RETURN_NOT_OK(ipc_writer->Close());
Now we’ve read and written an IPC file!
I/O with CSV#
We’re going to go through this step by step, reading then writing, as follows:
Reading a file
Open the file
Prepare Table
Read File using
csv::TableReader
Writing a file
Get a
io::FileOutputStream
Write to file from
Table
Opening a CSV File#
For a CSV file, we need to open a io::ReadableFile
, just like an Arrow file,
and reuse our io::ReadableFile
object from before to do so:
// Bind our input file to "test_in.csv"
ARROW_ASSIGN_OR_RAISE(infile, arrow::io::ReadableFile::Open("test_in.csv"));
Preparing a Table#
CSV can be read into a Table
, so declare a pointer to a Table
:
std::shared_ptr<arrow::Table> csv_table;
Read a CSV File to Table#
The CSV reader has option structs which need to be passed – luckily, there are defaults for these which we can pass directly. For reference on the other options, go here: File Formats. without any special delimiters and is small, so we can make our reader with defaults:
// The CSV reader has several objects for various options. For now, we'll use defaults.
ARROW_ASSIGN_OR_RAISE(
auto csv_reader,
arrow::csv::TableReader::Make(
arrow::io::default_io_context(), infile, arrow::csv::ReadOptions::Defaults(),
arrow::csv::ParseOptions::Defaults(), arrow::csv::ConvertOptions::Defaults()));
With the CSV reader primed, we can use its csv::TableReader::Read()
method to fill our
Table
:
// Read the table.
ARROW_ASSIGN_OR_RAISE(csv_table, csv_reader->Read())
Write a CSV File from Table#
CSV writing to Table
looks exactly like IPC writing to RecordBatch
,
except with our Table
, and using ipc::RecordBatchWriter::WriteTable()
instead of
ipc::RecordBatchWriter::WriteRecordBatch()
. Note that the same writer class is used –
we’re writing with ipc::RecordBatchWriter::WriteTable()
because we have a Table
. We’ll target
a file, use our Table’s
Schema
, and then write the Table
:
// Bind our output file to "test_out.csv"
ARROW_ASSIGN_OR_RAISE(outfile, arrow::io::FileOutputStream::Open("test_out.csv"));
// The CSV writer has simpler defaults, review API documentation for more complex usage.
ARROW_ASSIGN_OR_RAISE(auto csv_writer,
arrow::csv::MakeCSVWriter(outfile, csv_table->schema()));
ARROW_RETURN_NOT_OK(csv_writer->WriteTable(*csv_table));
// Not necessary, but a safe practice.
ARROW_RETURN_NOT_OK(csv_writer->Close());
Now, we’ve read and written a CSV file!
File I/O with Parquet#
We’re going to go through this step by step, reading then writing, as follows:
Reading a file
Open the file
Prepare
parquet::arrow::FileReader
Read file to
Table
Writing a file
Write
Table
to file
Opening a Parquet File#
Once more, this file format, Parquet, needs a io::ReadableFile
, which we
already have, and for the io::ReadableFile::Open()
method to be called on a file:
// Bind our input file to "test_in.parquet"
ARROW_ASSIGN_OR_RAISE(infile, arrow::io::ReadableFile::Open("test_in.parquet"));
Setting up a Parquet Reader#
As always, we need a Reader to actually read the file. We’ve been
getting Readers for each file format from the Arrow namespace. This
time, we enter the Parquet namespace to get the parquet::arrow::FileReader
:
std::unique_ptr<parquet::arrow::FileReader> reader;
Now, to set up our reader, we call parquet::arrow::OpenFile()
. Yes, this is necessary
even though we used io::ReadableFile::Open()
. Note that we pass our
parquet::arrow::FileReader
by reference, instead of assigning to it in output:
// Note that Parquet's OpenFile() takes the reader by reference, rather than returning
// a reader.
PARQUET_ASSIGN_OR_THROW(reader,
parquet::arrow::OpenFile(infile, arrow::default_memory_pool()));
Reading a Parquet File to Table#
With a prepared parquet::arrow::FileReader
in hand, we can read to a
Table
, except we must pass the Table
by reference instead of outputting to it:
std::shared_ptr<arrow::Table> parquet_table;
// Read the table.
PARQUET_THROW_NOT_OK(reader->ReadTable(&parquet_table));
Writing a Parquet File from Table#
For single-shot writes, writing a Parquet file does not need a writer object. Instead, we give it our table, point to the memory pool it will use for any necessary memory consumption, tell it where to write, and the chunk size if it needs to break up the file at all:
// Parquet writing does not need a declared writer object. Just get the output
// file bound, then pass in the table, memory pool, output, and chunk size for
// breaking up the Table on-disk.
ARROW_ASSIGN_OR_RAISE(outfile, arrow::io::FileOutputStream::Open("test_out.parquet"));
PARQUET_THROW_NOT_OK(parquet::arrow::WriteTable(
*parquet_table, arrow::default_memory_pool(), outfile, 5));
Ending Program#
At the end, we just return Status::OK()
, so the main()
knows that
we’re done, and that everything’s okay. Just like in the first tutorial.
return arrow::Status::OK();
}
With that, you’ve read and written IPC, CSV, and Parquet in Arrow, and can properly load data and write output! Now, we can move into processing data with compute functions in the next article.
Refer to the below for a copy of the complete code:
19// (Doc section: Includes)
20#include <arrow/api.h>
21#include <arrow/csv/api.h>
22#include <arrow/io/api.h>
23#include <arrow/ipc/api.h>
24#include <parquet/arrow/reader.h>
25#include <parquet/arrow/writer.h>
26
27#include <iostream>
28// (Doc section: Includes)
29
30// (Doc section: GenInitialFile)
31arrow::Status GenInitialFile() {
32 // Make a couple 8-bit integer arrays and a 16-bit integer array -- just like
33 // basic Arrow example.
34 arrow::Int8Builder int8builder;
35 int8_t days_raw[5] = {1, 12, 17, 23, 28};
36 ARROW_RETURN_NOT_OK(int8builder.AppendValues(days_raw, 5));
37 std::shared_ptr<arrow::Array> days;
38 ARROW_ASSIGN_OR_RAISE(days, int8builder.Finish());
39
40 int8_t months_raw[5] = {1, 3, 5, 7, 1};
41 ARROW_RETURN_NOT_OK(int8builder.AppendValues(months_raw, 5));
42 std::shared_ptr<arrow::Array> months;
43 ARROW_ASSIGN_OR_RAISE(months, int8builder.Finish());
44
45 arrow::Int16Builder int16builder;
46 int16_t years_raw[5] = {1990, 2000, 1995, 2000, 1995};
47 ARROW_RETURN_NOT_OK(int16builder.AppendValues(years_raw, 5));
48 std::shared_ptr<arrow::Array> years;
49 ARROW_ASSIGN_OR_RAISE(years, int16builder.Finish());
50
51 // Get a vector of our Arrays
52 std::vector<std::shared_ptr<arrow::Array>> columns = {days, months, years};
53
54 // Make a schema to initialize the Table with
55 std::shared_ptr<arrow::Field> field_day, field_month, field_year;
56 std::shared_ptr<arrow::Schema> schema;
57
58 field_day = arrow::field("Day", arrow::int8());
59 field_month = arrow::field("Month", arrow::int8());
60 field_year = arrow::field("Year", arrow::int16());
61
62 schema = arrow::schema({field_day, field_month, field_year});
63 // With the schema and data, create a Table
64 std::shared_ptr<arrow::Table> table;
65 table = arrow::Table::Make(schema, columns);
66
67 // Write out test files in IPC, CSV, and Parquet for the example to use.
68 std::shared_ptr<arrow::io::FileOutputStream> outfile;
69 ARROW_ASSIGN_OR_RAISE(outfile, arrow::io::FileOutputStream::Open("test_in.arrow"));
70 ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::ipc::RecordBatchWriter> ipc_writer,
71 arrow::ipc::MakeFileWriter(outfile, schema));
72 ARROW_RETURN_NOT_OK(ipc_writer->WriteTable(*table));
73 ARROW_RETURN_NOT_OK(ipc_writer->Close());
74
75 ARROW_ASSIGN_OR_RAISE(outfile, arrow::io::FileOutputStream::Open("test_in.csv"));
76 ARROW_ASSIGN_OR_RAISE(auto csv_writer,
77 arrow::csv::MakeCSVWriter(outfile, table->schema()));
78 ARROW_RETURN_NOT_OK(csv_writer->WriteTable(*table));
79 ARROW_RETURN_NOT_OK(csv_writer->Close());
80
81 ARROW_ASSIGN_OR_RAISE(outfile, arrow::io::FileOutputStream::Open("test_in.parquet"));
82 PARQUET_THROW_NOT_OK(
83 parquet::arrow::WriteTable(*table, arrow::default_memory_pool(), outfile, 5));
84
85 return arrow::Status::OK();
86}
87// (Doc section: GenInitialFile)
88
89// (Doc section: RunMain)
90arrow::Status RunMain() {
91 // (Doc section: RunMain)
92 // (Doc section: Gen Files)
93 // Generate initial files for each format with a helper function -- don't worry,
94 // we'll also write a table in this example.
95 ARROW_RETURN_NOT_OK(GenInitialFile());
96 // (Doc section: Gen Files)
97
98 // (Doc section: ReadableFile Definition)
99 // First, we have to set up a ReadableFile object, which just lets us point our
100 // readers to the right data on disk. We'll be reusing this object, and rebinding
101 // it to multiple files throughout the example.
102 std::shared_ptr<arrow::io::ReadableFile> infile;
103 // (Doc section: ReadableFile Definition)
104 // (Doc section: Arrow ReadableFile Open)
105 // Get "test_in.arrow" into our file pointer
106 ARROW_ASSIGN_OR_RAISE(infile, arrow::io::ReadableFile::Open(
107 "test_in.arrow", arrow::default_memory_pool()));
108 // (Doc section: Arrow ReadableFile Open)
109 // (Doc section: Arrow Read Open)
110 // Open up the file with the IPC features of the library, gives us a reader object.
111 ARROW_ASSIGN_OR_RAISE(auto ipc_reader, arrow::ipc::RecordBatchFileReader::Open(infile));
112 // (Doc section: Arrow Read Open)
113 // (Doc section: Arrow Read)
114 // Using the reader, we can read Record Batches. Note that this is specific to IPC;
115 // for other formats, we focus on Tables, but here, RecordBatches are used.
116 std::shared_ptr<arrow::RecordBatch> rbatch;
117 ARROW_ASSIGN_OR_RAISE(rbatch, ipc_reader->ReadRecordBatch(0));
118 // (Doc section: Arrow Read)
119
120 // (Doc section: Arrow Write Open)
121 // Just like with input, we get an object for the output file.
122 std::shared_ptr<arrow::io::FileOutputStream> outfile;
123 // Bind it to "test_out.arrow"
124 ARROW_ASSIGN_OR_RAISE(outfile, arrow::io::FileOutputStream::Open("test_out.arrow"));
125 // (Doc section: Arrow Write Open)
126 // (Doc section: Arrow Writer)
127 // Set up a writer with the output file -- and the schema! We're defining everything
128 // here, loading to fire.
129 ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::ipc::RecordBatchWriter> ipc_writer,
130 arrow::ipc::MakeFileWriter(outfile, rbatch->schema()));
131 // (Doc section: Arrow Writer)
132 // (Doc section: Arrow Write)
133 // Write the record batch.
134 ARROW_RETURN_NOT_OK(ipc_writer->WriteRecordBatch(*rbatch));
135 // (Doc section: Arrow Write)
136 // (Doc section: Arrow Close)
137 // Specifically for IPC, the writer needs to be explicitly closed.
138 ARROW_RETURN_NOT_OK(ipc_writer->Close());
139 // (Doc section: Arrow Close)
140
141 // (Doc section: CSV Read Open)
142 // Bind our input file to "test_in.csv"
143 ARROW_ASSIGN_OR_RAISE(infile, arrow::io::ReadableFile::Open("test_in.csv"));
144 // (Doc section: CSV Read Open)
145 // (Doc section: CSV Table Declare)
146 std::shared_ptr<arrow::Table> csv_table;
147 // (Doc section: CSV Table Declare)
148 // (Doc section: CSV Reader Make)
149 // The CSV reader has several objects for various options. For now, we'll use defaults.
150 ARROW_ASSIGN_OR_RAISE(
151 auto csv_reader,
152 arrow::csv::TableReader::Make(
153 arrow::io::default_io_context(), infile, arrow::csv::ReadOptions::Defaults(),
154 arrow::csv::ParseOptions::Defaults(), arrow::csv::ConvertOptions::Defaults()));
155 // (Doc section: CSV Reader Make)
156 // (Doc section: CSV Read)
157 // Read the table.
158 ARROW_ASSIGN_OR_RAISE(csv_table, csv_reader->Read())
159 // (Doc section: CSV Read)
160
161 // (Doc section: CSV Write)
162 // Bind our output file to "test_out.csv"
163 ARROW_ASSIGN_OR_RAISE(outfile, arrow::io::FileOutputStream::Open("test_out.csv"));
164 // The CSV writer has simpler defaults, review API documentation for more complex usage.
165 ARROW_ASSIGN_OR_RAISE(auto csv_writer,
166 arrow::csv::MakeCSVWriter(outfile, csv_table->schema()));
167 ARROW_RETURN_NOT_OK(csv_writer->WriteTable(*csv_table));
168 // Not necessary, but a safe practice.
169 ARROW_RETURN_NOT_OK(csv_writer->Close());
170 // (Doc section: CSV Write)
171
172 // (Doc section: Parquet Read Open)
173 // Bind our input file to "test_in.parquet"
174 ARROW_ASSIGN_OR_RAISE(infile, arrow::io::ReadableFile::Open("test_in.parquet"));
175 // (Doc section: Parquet Read Open)
176 // (Doc section: Parquet FileReader)
177 std::unique_ptr<parquet::arrow::FileReader> reader;
178 // (Doc section: Parquet FileReader)
179 // (Doc section: Parquet OpenFile)
180 // Note that Parquet's OpenFile() takes the reader by reference, rather than returning
181 // a reader.
182 PARQUET_ASSIGN_OR_THROW(reader,
183 parquet::arrow::OpenFile(infile, arrow::default_memory_pool()));
184 // (Doc section: Parquet OpenFile)
185
186 // (Doc section: Parquet Read)
187 std::shared_ptr<arrow::Table> parquet_table;
188 // Read the table.
189 PARQUET_THROW_NOT_OK(reader->ReadTable(&parquet_table));
190 // (Doc section: Parquet Read)
191
192 // (Doc section: Parquet Write)
193 // Parquet writing does not need a declared writer object. Just get the output
194 // file bound, then pass in the table, memory pool, output, and chunk size for
195 // breaking up the Table on-disk.
196 ARROW_ASSIGN_OR_RAISE(outfile, arrow::io::FileOutputStream::Open("test_out.parquet"));
197 PARQUET_THROW_NOT_OK(parquet::arrow::WriteTable(
198 *parquet_table, arrow::default_memory_pool(), outfile, 5));
199 // (Doc section: Parquet Write)
200 // (Doc section: Return)
201 return arrow::Status::OK();
202}
203// (Doc section: Return)
204
205// (Doc section: Main)
206int main() {
207 arrow::Status st = RunMain();
208 if (!st.ok()) {
209 std::cerr << st << std::endl;
210 return 1;
211 }
212 return 0;
213}
214// (Doc section: Main)