File Formats

CSV

struct ReadOptions

Public Members

bool use_threads = true

Whether to use the global CPU thread pool.

int32_t block_size = 1 << 20

Block size we request from the IO layer; also determines the size of chunks when use_threads is true.

int32_t skip_rows = 0

Number of header rows to skip (not including the row of column names, if any)

std::vector<std::string> column_names

Column names for the target table.

If empty, fall back on autogenerate_column_names.

bool autogenerate_column_names = false

Whether to autogenerate column names if column_names is empty.

If true, column names will be of the form “f0”, “f1”… If false, column names will be read from the first CSV row after skip_rows.

Public Static Functions

static ReadOptions Defaults()

Create read options with default values.

struct ParseOptions

Public Members

char delimiter = ','

Field delimiter.

bool quoting = true

Whether quoting is used.

char quote_char = '"'

Quoting character (if quoting is true)

bool double_quote = true

Whether a quote inside a value is double-quoted.

bool escaping = false

Whether escaping is used.

char escape_char = kDefaultEscapeChar

Escaping character (if escaping is true)

bool newlines_in_values = false

Whether values are allowed to contain CR (0x0d) and LF (0x0a) characters.

bool ignore_empty_lines = true

Whether empty lines are ignored.

If false, an empty line represents a single empty value (assuming a one-column CSV file).

Public Static Functions

static ParseOptions Defaults()

Create parsing options with default values.

struct ConvertOptions

Public Members

bool check_utf8 = true

Whether to check UTF8 validity of string columns.

std::unordered_map<std::string, std::shared_ptr<DataType>> column_types

Optional per-column types (disabling type inference on those columns)

std::vector<std::string> null_values

Recognized spellings for null values.

std::vector<std::string> true_values

Recognized spellings for boolean true values.

std::vector<std::string> false_values

Recognized spellings for boolean false values.

bool strings_can_be_null = false

Whether string / binary columns can have null values.

If true, then strings in “null_values” are considered null for string columns. If false, then all strings are valid string values.

std::vector<std::string> include_columns

If non-empty, indicates the names of columns from the CSV file that should be actually read and converted (in the vector’s order).

Columns not in this vector will be ignored.

bool include_missing_columns = false

If false, columns in include_columns but not in the CSV file will error out.

If true, columns in include_columns but not in the CSV file will produce a column of nulls (whose type is selected using column_types, or null by default) This option is ignored if include_columns is empty.

Public Static Functions

static ConvertOptions Defaults()

Create conversion options with default values, including conventional values for null_values, true_values and false_values

class TableReader

A class that reads an entire CSV file into a Arrow Table.

Public Functions

virtual Status Read(std::shared_ptr<Table> *out) = 0

Read the entire CSV file and convert it to a Arrow Table.

Public Static Functions

static Status Make(MemoryPool *pool, std::shared_ptr<io::InputStream> input, const ReadOptions&, const ParseOptions&, const ConvertOptions&, std::shared_ptr<TableReader> *out)

Create a TableReader instance.

Line-separated JSON

enum arrow::json::UnexpectedFieldBehavior

Values:

Ignore

Unexpected JSON fields are ignored.

Error

Unexpected JSON fields error out.

InferType

Unexpected JSON fields are type-inferred and included in the output.

struct ReadOptions

Public Members

bool use_threads = true

Whether to use the global CPU thread pool.

int32_t block_size = 1 << 20

Block size we request from the IO layer; also determines the size of chunks when use_threads is true.

Public Static Functions

static ReadOptions Defaults()

Create read options with default values.

struct ParseOptions

Public Members

std::shared_ptr<Schema> explicit_schema

Optional explicit schema (disables type inference on those fields)

bool newlines_in_values = false

Whether objects may be printed across multiple lines (for example pretty-printed)

If true, parsing may be slower.

UnexpectedFieldBehavior unexpected_field_behavior = UnexpectedFieldBehavior::InferType

How JSON fields outside of explicit_schema (if given) are treated.

Public Static Functions

static ParseOptions Defaults()

Create parsing options with default values.

class TableReader

A class that reads an entire JSON file into a Arrow Table.

The file is expected to consist of individual line-separated JSON objects

Public Functions

virtual Status Read(std::shared_ptr<Table> *out) = 0

Read the entire JSON file and convert it to a Arrow Table.

Public Static Functions

static Status Make(MemoryPool *pool, std::shared_ptr<io::InputStream> input, const ReadOptions&, const ParseOptions&, std::shared_ptr<TableReader> *out)

Create a TableReader instance.

Parquet reader

class ReaderProperties
class ArrowReaderProperties

EXPERIMENTAL: Properties for configuring FileReader behavior.

class ParquetFileReader
class FileReader

Arrow read adapter class for deserializing Parquet files as Arrow row batches.

This interfaces caters for different use cases and thus provides different interfaces. In its most simplistic form, we cater for a user that wants to read the whole Parquet at once with the FileReader::ReadTable method.

More advanced users that also want to implement parallelism on top of each single Parquet files should do this on the RowGroup level. For this, they can call FileReader::RowGroup(i)->ReadTable to receive only the specified RowGroup as a table.

In the most advanced situation, where a consumer wants to independently read RowGroups in parallel and consume each column individually, they can call FileReader::RowGroup(i)->Column(j)->Read and receive an arrow::Column instance.

Public Functions

virtual arrow::Status GetSchema(std::shared_ptr<arrow::Schema> *out) = 0

Return arrow schema for all the columns.

virtual arrow::Status ReadColumn(int i, std::shared_ptr<arrow::ChunkedArray> *out) = 0

Read column as a whole into a chunked array.

The indicated column index is relative to the schema

virtual arrow::Status GetRecordBatchReader(const std::vector<int> &row_group_indices, std::unique_ptr<arrow::RecordBatchReader> *out) = 0

Return a RecordBatchReader of row groups selected from row_group_indices, the ordering in row_group_indices matters.

Return

error Status if row_group_indices contains invalid index

virtual arrow::Status GetRecordBatchReader(const std::vector<int> &row_group_indices, const std::vector<int> &column_indices, std::unique_ptr<arrow::RecordBatchReader> *out) = 0

Return a RecordBatchReader of row groups selected from row_group_indices, whose columns are selected by column_indices.

The ordering in row_group_indices and column_indices matter.

Return

error Status if either row_group_indices or column_indices contains invalid index

virtual arrow::Status ReadTable(std::shared_ptr<arrow::Table> *out) = 0

Read all columns into a Table.

virtual arrow::Status ReadTable(const std::vector<int> &column_indices, std::shared_ptr<arrow::Table> *out) = 0

Read the given columns into a Table.

The indicated column indices are relative to the schema

virtual arrow::Status ScanContents(std::vector<int> columns, const int32_t column_batch_size, int64_t *num_rows) = 0

Scan file contents with one thread, return number of rows.

virtual std::shared_ptr<RowGroupReader> RowGroup(int row_group_index) = 0

Return a reader for the RowGroup, this object must not outlive the FileReader.

virtual int num_row_groups() const = 0

The number of row groups in the file.

virtual void set_use_threads(bool use_threads) = 0

Set whether to use multiple threads during reads of multiple columns.

By default only one thread is used.

Public Static Functions

arrow::Status Make(arrow::MemoryPool *pool, std::unique_ptr<ParquetFileReader> reader, const ArrowReaderProperties &properties, std::unique_ptr<FileReader> *out)

Factory function to create a FileReader from a ParquetFileReader and properties.

arrow::Status Make(arrow::MemoryPool *pool, std::unique_ptr<ParquetFileReader> reader, std::unique_ptr<FileReader> *out)

Factory function to create a FileReader from a ParquetFileReader.

class FileReaderBuilder

Experimental helper class for bindings (like Python) that struggle either with std::move or C++ exceptions.

Public Functions

arrow::Status Open(const std::shared_ptr<arrow::io::RandomAccessFile> &file, const ReaderProperties &properties = default_reader_properties(), const std::shared_ptr<FileMetaData> &metadata = NULLPTR)

Create FileReaderBuilder from Arrow file and optional properties / metadata.

FileReaderBuilder *memory_pool(arrow::MemoryPool *pool)

Set Arrow MemoryPool for memory allocation.

FileReaderBuilder *properties(const ArrowReaderProperties &arg_properties)

Set Arrow reader properties.

arrow::Status Build(std::unique_ptr<FileReader> *out)

Build FileReader instance.

arrow::Status parquet::arrow::OpenFile(const std::shared_ptr<arrow::io::RandomAccessFile> &file, arrow::MemoryPool *allocator, std::unique_ptr<FileReader> *reader)

Build FileReader from Arrow file and MemoryPool.

Advanced settings are supported through the FileReaderBuilder class.

arrow::Status parquet::arrow::OpenFile(const std::shared_ptr<arrow::io::RandomAccessFile> &file, arrow::MemoryPool *allocator, const ReaderProperties &properties, const std::shared_ptr<FileMetaData> &metadata, std::unique_ptr<FileReader> *reader)
arrow::Status parquet::arrow::OpenFile(const std::shared_ptr<arrow::io::RandomAccessFile> &file, arrow::MemoryPool *allocator, const ArrowReaderProperties &properties, std::unique_ptr<FileReader> *reader)

Parquet writer

class WriterProperties
class Builder

Public Functions

Builder *encoding(Encoding::type encoding_type)

Define the encoding that is used when we don’t utilise dictionary encoding.

This either apply if dictionary encoding is disabled or if we fallback as the dictionary grew too large.

Builder *encoding(const std::string &path, Encoding::type encoding_type)

Define the encoding that is used when we don’t utilise dictionary encoding.

This either apply if dictionary encoding is disabled or if we fallback as the dictionary grew too large.

Builder *encoding(const std::shared_ptr<schema::ColumnPath> &path, Encoding::type encoding_type)

Define the encoding that is used when we don’t utilise dictionary encoding.

This either apply if dictionary encoding is disabled or if we fallback as the dictionary grew too large.

Builder *compression_level(int compression_level)

Specify the default compression level for the compressor in every column.

In case a column does not have an explicitly specified compression level, the default one would be used.

The provided compression level is compressor specific. The user would have to familiarize oneself with the available levels for the selected compressor. If the compressor does not allow for selecting different compression levels, calling this function would not have any effect. Parquet and Arrow do not validate the passed compression level. If no level is selected by the user or if the special std::numeric_limits<int>::min() value is passed, then Arrow selects the compression level.

Builder *compression_level(const std::string &path, int compression_level)

Specify a compression level for the compressor for the column described by path.

The provided compression level is compressor specific. The user would have to familiarize oneself with the available levels for the selected compressor. If the compressor does not allow for selecting different compression levels, calling this function would not have any effect. Parquet and Arrow do not validate the passed compression level. If no level is selected by the user or if the special std::numeric_limits<int>::min() value is passed, then Arrow selects the compression level.

Builder *compression_level(const std::shared_ptr<schema::ColumnPath> &path, int compression_level)

Specify a compression level for the compressor for the column described by path.

The provided compression level is compressor specific. The user would have to familiarize oneself with the available levels for the selected compressor. If the compressor does not allow for selecting different compression levels, calling this function would not have any effect. Parquet and Arrow do not validate the passed compression level. If no level is selected by the user or if the special std::numeric_limits<int>::min() value is passed, then Arrow selects the compression level.

class ArrowWriterProperties
class Builder

Public Functions

Builder *store_schema()

EXPERIMENTAL: Write binary serialized Arrow schema to the file, to enable certain read options (like “read_dictionary”) to be set automatically.

class FileWriter

Iterative FileWriter class.

Start a new RowGroup or Chunk with NewRowGroup. Write column-by-column the whole column chunk.

Public Functions

virtual arrow::Status WriteTable(const arrow::Table &table, int64_t chunk_size) = 0

Write a Table to Parquet.

virtual arrow::Status WriteColumnChunk(const std::shared_ptr<arrow::ChunkedArray> &data, int64_t offset, int64_t size) = 0

Write ColumnChunk in row group using slice of a ChunkedArray.

arrow::Status parquet::arrow::WriteTable(const arrow::Table &table, MemoryPool *pool, const std::shared_ptr<arrow::io::OutputStream> &sink, int64_t chunk_size, const std::shared_ptr<WriterProperties> &properties = default_writer_properties(), const std::shared_ptr<ArrowWriterProperties> &arrow_properties = default_arrow_writer_properties())

Write a Table to Parquet.

The table shall only consist of columns of primitive type or of primitive lists.