Reading JSON files#
Line-separated JSON files can either be read as a single Arrow Table
with a TableReader or streamed as RecordBatches with a
StreamingReader.
Both of these readers require an arrow::io::InputStream instance
representing the input file. Their behavior can be customized using a
combination of ReadOptions, ParseOptions, and
other parameters.
See also
TableReader#
TableReader reads an entire file in one shot as a Table. Each
independent JSON object in the input file is converted to a row in
the output table.
#include "arrow/json/api.h"
{
   // ...
   arrow::MemoryPool* pool = default_memory_pool();
   std::shared_ptr<arrow::io::InputStream> input = ...;
   auto read_options = arrow::json::ReadOptions::Defaults();
   auto parse_options = arrow::json::ParseOptions::Defaults();
   // Instantiate TableReader from input stream and options
   auto maybe_reader = arrow::json::TableReader::Make(pool, input, read_options, parse_options);
   if (!maybe_reader.ok()) {
      // Handle TableReader instantiation error...
   }
   auto reader = *maybe_reader;
   // Read table from JSON file
   auto maybe_table = reader->Read();
   if (!maybe_table.ok()) {
      // Handle JSON read error
      // (for example a JSON syntax error or failed type conversion)
   }
   auto table = *maybe_table;
}
StreamingReader#
StreamingReader reads a file incrementally from blocks of a roughly equal byte size, each yielding a
RecordBatch. Each independent JSON object in a block
is converted to a row in the output batch.
All batches adhere to a consistent Schema, which is
derived from the first loaded batch. Alternatively, an explicit schema
may be passed via ParseOptions.
#include "arrow/json/api.h"
{
   // ...
   auto read_options = arrow::json::ReadOptions::Defaults();
   auto parse_options = arrow::json::ParseOptions::Defaults();
   std::shared_ptr<arrow::io::InputStream> stream;
   auto result = arrow::json::StreamingReader::Make(stream,
                                                    read_options,
                                                    parse_options);
   if (!result.ok()) {
      // Handle instantiation error
   }
   std::shared_ptr<arrow::json::StreamingReader> reader = *result;
   for (arrow::Result<std::shared_ptr<arrow::RecordBatch>> maybe_batch : *reader) {
      if (!maybe_batch.ok()) {
         // Handle read/parse error
      }
      std::shared_ptr<arrow::RecordBatch> batch = *maybe_batch;
      // Operate on each batch...
   }
}
Data types#
Since JSON values are typed, the possible Arrow data types on output depend on the input value types. Top-level JSON values should always be objects. The fields of top-level objects are taken to represent columns in the Arrow data. For each name/value pair in a JSON object, there are two possible modes of deciding the output data type:
- if the name is in - ParseOptions::explicit_schema, conversion of the JSON value to the corresponding Arrow data type is attempted;
- otherwise, the Arrow data type is determined via type inference on the JSON value, trying out a number of Arrow data types in order. 
The following tables show the possible combinations for each of those two modes.
| JSON value type | Allowed Arrow data types | 
|---|---|
| Null | Any (including Null) | 
| Number | All Integer types, Float32, Float64, Date32, Date64, Time32, Time64 | 
| Boolean | Boolean | 
| String | Binary, LargeBinary, String, LargeString, Timestamp | 
| Array | List | 
| Object (nested) | Struct | 
| JSON value type | Inferred Arrow data types (in order) | 
|---|---|
| Null | Null, any other | 
| Number | Int64, Float64 | 
| Boolean | Boolean | 
| String | Timestamp (with seconds unit), String | 
| Array | List | 
| Object (nested) | Struct | 
 
    