Reading JSON files#
Line-separated JSON files can either be read as a single Arrow Table
with a TableReader
or streamed as RecordBatches with a
StreamingReader
.
Both of these readers require an arrow::io::InputStream
instance
representing the input file. Their behavior can be customized using a
combination of ReadOptions
, ParseOptions
, and
other parameters.
See also
TableReader#
TableReader
reads an entire file in one shot as a Table
. Each
independent JSON object in the input file is converted to a row in
the output table.
#include "arrow/json/api.h"
{
// ...
arrow::MemoryPool* pool = default_memory_pool();
std::shared_ptr<arrow::io::InputStream> input = ...;
auto read_options = arrow::json::ReadOptions::Defaults();
auto parse_options = arrow::json::ParseOptions::Defaults();
// Instantiate TableReader from input stream and options
auto maybe_reader = arrow::json::TableReader::Make(pool, input, read_options, parse_options);
if (!maybe_reader.ok()) {
// Handle TableReader instantiation error...
}
auto reader = *maybe_reader;
// Read table from JSON file
auto maybe_table = reader->Read();
if (!maybe_table.ok()) {
// Handle JSON read error
// (for example a JSON syntax error or failed type conversion)
}
auto table = *maybe_table;
}
StreamingReader#
StreamingReader
reads a file incrementally from blocks of a roughly equal byte size, each yielding a
RecordBatch
. Each independent JSON object in a block
is converted to a row in the output batch.
All batches adhere to a consistent Schema
, which is
derived from the first loaded batch. Alternatively, an explicit schema
may be passed via ParseOptions
.
#include "arrow/json/api.h"
{
// ...
auto read_options = arrow::json::ReadOptions::Defaults();
auto parse_options = arrow::json::ParseOptions::Defaults();
std::shared_ptr<arrow::io::InputStream> stream;
auto result = arrow::json::StreamingReader::Make(stream,
read_options,
parse_options);
if (!result.ok()) {
// Handle instantiation error
}
std::shared_ptr<arrow::json::StreamingReader> reader = *result;
for (arrow::Result<std::shared_ptr<arrow::RecordBatch>> maybe_batch : *reader) {
if (!maybe_batch.ok()) {
// Handle read/parse error
}
std::shared_ptr<arrow::RecordBatch> batch = *maybe_batch;
// Operate on each batch...
}
}
Data types#
Since JSON values are typed, the possible Arrow data types on output depend on the input value types. Top-level JSON values should always be objects. The fields of top-level objects are taken to represent columns in the Arrow data. For each name/value pair in a JSON object, there are two possible modes of deciding the output data type:
if the name is in
ParseOptions::explicit_schema
, conversion of the JSON value to the corresponding Arrow data type is attempted;otherwise, the Arrow data type is determined via type inference on the JSON value, trying out a number of Arrow data types in order.
The following tables show the possible combinations for each of those two modes.
JSON value type |
Allowed Arrow data types |
---|---|
Null |
Any (including Null) |
Number |
All Integer types, Float32, Float64, Date32, Date64, Time32, Time64 |
Boolean |
Boolean |
String |
Binary, LargeBinary, String, LargeString, Timestamp |
Array |
List |
Object (nested) |
Struct |
JSON value type |
Inferred Arrow data types (in order) |
---|---|
Null |
Null, any other |
Number |
Int64, Float64 |
Boolean |
Boolean |
String |
Timestamp (with seconds unit), String |
Array |
List |
Object (nested) |
Struct |