File Formats¶
CSV¶
-
struct
arrow::csv
::
ConvertOptions
¶ -
Public Members
-
bool
check_utf8
= true¶ Whether to check UTF8 validity of string columns.
-
std::unordered_map<std::string, std::shared_ptr<DataType>>
column_types
¶ Optional per-column types (disabling type inference on those columns)
-
std::vector<std::string>
null_values
¶ Recognized spellings for null values.
-
std::vector<std::string>
true_values
¶ Recognized spellings for boolean true values.
-
std::vector<std::string>
false_values
¶ Recognized spellings for boolean false values.
-
bool
strings_can_be_null
= false¶ Whether string / binary columns can have null values.
If true, then strings in “null_values” are considered null for string columns. If false, then all strings are valid string values.
-
bool
quoted_strings_can_be_null
= true¶ Whether string / binary columns can have quoted null values.
If true and
strings_can_be_null
is true, then quoted strings in “null_values” are also considered null for string columns. Otherwise, quoted strings are never considered null.
-
bool
auto_dict_encode
= false¶ Whether to try to automatically dict-encode string / binary data.
If true, then when type inference detects a string or binary column, it is dict-encoded up to
auto_dict_max_cardinality
distinct values (per chunk), after which it switches to regular encoding.This setting is ignored for non-inferred columns (those in
column_types
).
-
std::vector<std::string>
include_columns
¶ If non-empty, indicates the names of columns from the CSV file that should be actually read and converted (in the vector’s order).
Columns not in this vector will be ignored.
-
bool
include_missing_columns
= false¶ If false, columns in
include_columns
but not in the CSV file will error out.If true, columns in
include_columns
but not in the CSV file will produce a column of nulls (whose type is selected usingcolumn_types
, or null by default) This option is ignored ifinclude_columns
is empty.
-
std::vector<std::shared_ptr<TimestampParser>>
timestamp_parsers
¶ User-defined timestamp parsers, using the virtual parser interface in arrow/util/value_parsing.h.
More than one parser can be specified, and the CSV conversion logic will try parsing values starting from the beginning of this vector. If no parsers are specified, we use the default built-in ISO-8601 parser.
Public Static Functions
-
static ConvertOptions
Defaults
()¶ Create conversion options with default values, including conventional values for
null_values
,true_values
andfalse_values
-
bool
-
struct
arrow::csv
::
ParseOptions
¶ -
Public Members
-
bool
quoting
= true¶ Whether quoting is used.
-
char
quote_char
= '"'¶ Quoting character (if
quoting
is true)
-
bool
double_quote
= true¶ Whether a quote inside a value is double-quoted.
-
bool
escaping
= false¶ Whether escaping is used.
-
char
escape_char
= kDefaultEscapeChar¶ Escaping character (if
escaping
is true)
-
bool
newlines_in_values
= false¶ Whether values are allowed to contain CR (0x0d) and LF (0x0a) characters.
-
bool
ignore_empty_lines
= true¶ Whether empty lines are ignored.
If false, an empty line represents a single empty value (assuming a one-column CSV file).
Public Static Functions
-
static ParseOptions
Defaults
()¶ Create parsing options with default values.
-
bool
-
struct
arrow::csv
::
ReadOptions
¶ -
Public Members
-
bool
use_threads
= true¶ Whether to use the global CPU thread pool.
-
int32_t
block_size
= 1 << 20¶ Block size we request from the IO layer.
This will determine multi-threading granularity as well as the size of individual record batches. Minimum valid value for block size is 1
-
int32_t
skip_rows
= 0¶ Number of header rows to skip (not including the row of column names, if any)
-
int32_t
skip_rows_after_names
= 0¶ Number of rows to skip after the column names are read, if any.
-
std::vector<std::string>
column_names
¶ Column names for the target table.
If empty, fall back on autogenerate_column_names.
-
bool
autogenerate_column_names
= false¶ Whether to autogenerate column names if
column_names
is empty.If true, column names will be of the form “f0”, “f1”… If false, column names will be read from the first CSV row after
skip_rows
.
Public Static Functions
-
static ReadOptions
Defaults
()¶ Create read options with default values.
-
bool
-
struct
arrow::csv
::
WriteOptions
¶ Experimental.
Public Members
-
bool
include_header
= true¶ Whether to write an initial header line with column names.
-
int32_t
batch_size
= 1024¶ Maximum number of rows processed at a time.
The CSV writer converts and writes data in batches of N rows. This number can impact performance.
-
io::IOContext
io_context
¶ IO context for writing.
Public Static Functions
-
static WriteOptions
Defaults
()¶ Create write options with default values.
-
bool
-
class
arrow::csv
::
TableReader
¶ A class that reads an entire CSV file into a Arrow Table.
Public Functions
Public Static Functions
Create a TableReader instance.
Create a new CSV writer.
- Parameters
[in] sink – output stream to write to (does not take ownership)
[in] schema – the schema of the record batches to be written
[in] options – options for serialization
- Returns
Result<std::shared_ptr<RecordBatchWriter>>
Create a new CSV writer.
User is responsible for closing the actual OutputStream.
- Parameters
[in] sink – output stream to write to
[in] schema – the schema of the record batches to be written
[in] options – options for serialization
- Returns
Result<std::shared_ptr<RecordBatchWriter>>
-
Status
arrow::csv
::
WriteCSV
(const RecordBatch &batch, const WriteOptions &options, arrow::io::OutputStream *output)¶ Converts batch to CSV and writes the results to output.
Experimental
-
Status
arrow::csv
::
WriteCSV
(const Table &table, const WriteOptions &options, arrow::io::OutputStream *output)¶ Converts table to a CSV and writes the results to output.
Experimental
Line-separated JSON¶
-
enum
arrow::json
::
UnexpectedFieldBehavior
¶ Values:
-
enumerator
Ignore
¶ Unexpected JSON fields are ignored.
-
enumerator
Error
¶ Unexpected JSON fields error out.
-
enumerator
InferType
¶ Unexpected JSON fields are type-inferred and included in the output.
-
enumerator
-
struct
arrow::json
::
ReadOptions
¶ Public Members
-
bool
use_threads
= true¶ Whether to use the global CPU thread pool.
-
int32_t
block_size
= 1 << 20¶ Block size we request from the IO layer; also determines the size of chunks when use_threads is true.
Public Static Functions
-
static ReadOptions
Defaults
()¶ Create read options with default values.
-
bool
-
struct
arrow::json
::
ParseOptions
¶ Public Members
-
std::shared_ptr<Schema>
explicit_schema
¶ Optional explicit schema (disables type inference on those fields)
-
bool
newlines_in_values
= false¶ Whether objects may be printed across multiple lines (for example pretty-printed)
If true, parsing may be slower.
-
UnexpectedFieldBehavior
unexpected_field_behavior
= UnexpectedFieldBehavior::InferType¶ How JSON fields outside of explicit_schema (if given) are treated.
Public Static Functions
-
static ParseOptions
Defaults
()¶ Create parsing options with default values.
-
std::shared_ptr<Schema>
-
class
arrow::json
::
TableReader
¶ A class that reads an entire JSON file into a Arrow Table.
The file is expected to consist of individual line-separated JSON objects
Public Functions
Public Static Functions
Create a TableReader instance.
Parquet reader¶
-
class
parquet
::
ReaderProperties
¶ Public Functions
-
inline bool
is_buffered_stream_enabled
() const¶ Buffered stream reading allows the user to control the memory usage of parquet readers.
This ensure that all
RandomAccessFile::ReadAt
calls are wrapped in a buffered reader that uses a fix sized buffer (of sizebuffer_size()
) instead of the full size of the ReadAt.The primary reason for this control knobs is for resource control and not performance.
-
inline bool
-
class
parquet
::
ArrowReaderProperties
¶ EXPERIMENTAL: Properties for configuring FileReader behavior.
Public Functions
-
inline void
set_pre_buffer
(bool pre_buffer)¶ Enable read coalescing.
When enabled, the Arrow reader will pre-buffer necessary regions of the file in-memory. This is intended to improve performance on high-latency filesystems (e.g. Amazon S3).
-
inline void
set_cache_options
(::arrow::io::CacheOptions options)¶ Set options for read coalescing.
This can be used to tune the implementation for characteristics of different filesystems.
-
inline void
set_io_context
(const ::arrow::io::IOContext &ctx)¶ Set execution context for read coalescing.
-
inline void
-
class
parquet
::
ParquetFileReader
¶ Public Functions
-
void
PreBuffer
(const std::vector<int> &row_groups, const std::vector<int> &column_indices, const ::arrow::io::IOContext &ctx, const ::arrow::io::CacheOptions &options)¶ Pre-buffer the specified column indices in all row groups.
Readers can optionally call this to cache the necessary slices of the file in-memory before deserialization. Arrow readers can automatically do this via an option. This is intended to increase performance when reading from high-latency filesystems (e.g. Amazon S3).
After calling this, creating readers for row groups/column indices that were not buffered may fail. Creating multiple readers for the a subset of the buffered regions is acceptable. This may be called again to buffer a different set of row groups/columns.
If memory usage is a concern, note that data will remain buffered in memory until either PreBuffer() is called again, or the reader itself is destructed. Reading - and buffering - only one row group at a time may be useful.
This method may throw.
-
::arrow::Future
WhenBuffered
(const std::vector<int> &row_groups, const std::vector<int> &column_indices) const¶ Wait for the specified row groups and column indices to be pre-buffered.
After the returned Future completes, reading the specified row groups/columns will not block.
PreBuffer must be called first. This method does not throw.
-
struct
Contents
¶
-
void
-
class
parquet::arrow
::
FileReader
¶ Arrow read adapter class for deserializing Parquet files as Arrow row batches.
This interfaces caters for different use cases and thus provides different interfaces. In its most simplistic form, we cater for a user that wants to read the whole Parquet at once with the
FileReader::ReadTable
method.More advanced users that also want to implement parallelism on top of each single Parquet files should do this on the RowGroup level. For this, they can call
FileReader::RowGroup(i)->ReadTable
to receive only the specified RowGroup as a table.In the most advanced situation, where a consumer wants to independently read RowGroups in parallel and consume each column individually, they can call
FileReader::RowGroup(i)->Column(j)->Read
and receive anarrow::Column
instance.The parquet format supports an optional integer field_id which can be assigned to a field. Arrow will convert these field IDs to a metadata key named PARQUET:field_id on the appropriate field.
Public Functions
Return arrow schema for all the columns.
Read column as a whole into a chunked array.
The indicated column index is relative to the schema
-
virtual ::arrow::Status
GetRecordBatchReader
(const std::vector<int> &row_group_indices, std::unique_ptr<::arrow::RecordBatchReader> *out) = 0¶ Return a RecordBatchReader of row groups selected from row_group_indices.
Note that the ordering in row_group_indices matters. FileReaders must outlive their RecordBatchReaders.
- Returns
error Status if row_group_indices contains an invalid index
-
virtual ::arrow::Status
GetRecordBatchReader
(const std::vector<int> &row_group_indices, const std::vector<int> &column_indices, std::unique_ptr<::arrow::RecordBatchReader> *out) = 0¶ Return a RecordBatchReader of row groups selected from row_group_indices, whose columns are selected by column_indices.
Note that the ordering in row_group_indices and column_indices matter. FileReaders must outlive their RecordBatchReaders.
- Returns
error Status if either row_group_indices or column_indices contains an invalid index
-
virtual ::arrow::Result< std::function<::arrow::Future< std::shared_ptr<::arrow::RecordBatch > >)> > GetRecordBatchGenerator (std::shared_ptr< FileReader > reader, const std::vector< int > row_group_indices, const std::vector< int > column_indices, ::arrow::internal::Executor *cpu_executor=NULLPTR)=0
Return a generator of record batches.
The FileReader must outlive the generator, so this requires that you pass in a shared_ptr.
- Returns
error Result if either row_group_indices or column_indices contains an invalid index
Read all columns into a Table.
Read the given columns into a Table.
The indicated column indices are relative to the schema
-
virtual ::arrow::Status
ScanContents
(std::vector<int> columns, const int32_t column_batch_size, int64_t *num_rows) = 0¶ Scan file contents with one thread, return number of rows.
-
virtual std::shared_ptr<RowGroupReader>
RowGroup
(int row_group_index) = 0¶ Return a reader for the RowGroup, this object must not outlive the FileReader.
-
virtual int
num_row_groups
() const = 0¶ The number of row groups in the file.
-
virtual void
set_use_threads
(bool use_threads) = 0¶ Set whether to use multiple threads during reads of multiple columns.
By default only one thread is used.
-
virtual void
set_batch_size
(int64_t batch_size) = 0¶ Set number of records to read per batch for the RecordBatchReader.
Public Static Functions
-
static ::arrow::Status
Make
(::arrow::MemoryPool *pool, std::unique_ptr<ParquetFileReader> reader, const ArrowReaderProperties &properties, std::unique_ptr<FileReader> *out)¶ Factory function to create a FileReader from a ParquetFileReader and properties.
-
static ::arrow::Status
Make
(::arrow::MemoryPool *pool, std::unique_ptr<ParquetFileReader> reader, std::unique_ptr<FileReader> *out)¶ Factory function to create a FileReader from a ParquetFileReader.
-
class
parquet::arrow
::
FileReaderBuilder
¶ Experimental helper class for bindings (like Python) that struggle either with std::move or C++ exceptions.
Public Functions
Create FileReaderBuilder from Arrow file and optional properties / metadata.
-
FileReaderBuilder *
memory_pool
(::arrow::MemoryPool *pool)¶ Set Arrow MemoryPool for memory allocation.
-
FileReaderBuilder *
properties
(const ArrowReaderProperties &arg_properties)¶ Set Arrow reader properties.
-
::arrow::Status
Build
(std::unique_ptr<FileReader> *out)¶ Build FileReader instance.
Build FileReader from Arrow file and MemoryPool.
Advanced settings are supported through the FileReaderBuilder class.
-
class
parquet
::
StreamReader
¶ A class for reading Parquet files using an output stream type API.
The values given must be of the correct type i.e. the type must match the file schema exactly otherwise a ParquetException will be thrown.
The user must explicitly advance to the next row using the EndRow() function or EndRow input manipulator.
Required and optional fields are supported:
Required fields are read using operator>>(T)
Optional fields are read with operator>>(arrow::util::optional<T>)
Note that operator>>(arrow::util::optional<T>) can be used to read required fields.
Similarly operator>>(T) can be used to read optional fields. However, if the value is not present then a ParquetException will be raised.
Currently there is no support for repeated fields.
Public Functions
-
void
EndRow
()¶ Terminate current row and advance to next one.
- Throws
ParquetException – if all columns in the row were not read or skipped.
-
int64_t
SkipColumns
(int64_t num_columns_to_skip)¶ Skip the data in the next columns.
If the number of columns exceeds the columns remaining on the current row then skipping is terminated - it does not continue skipping columns on the next row. Skipping of columns still requires the use ‘EndRow’ even if all remaining columns were skipped.
- Returns
Number of columns actually skipped.
-
int64_t
SkipRows
(int64_t num_rows_to_skip)¶ Skip the data in the next rows.
Skipping of rows is not allowed if reading of data for the current row is not finished. Skipping of rows will be terminated if the end of file is reached.
- Returns
Number of rows actually skipped.
Parquet writer¶
-
class
parquet
::
WriterProperties
¶ -
class
Builder
¶ Public Functions
-
inline Builder *
encoding
(Encoding::type encoding_type)¶ Define the encoding that is used when we don’t utilise dictionary encoding.
This either apply if dictionary encoding is disabled or if we fallback as the dictionary grew too large.
-
inline Builder *
encoding
(const std::string &path, Encoding::type encoding_type)¶ Define the encoding that is used when we don’t utilise dictionary encoding.
This either apply if dictionary encoding is disabled or if we fallback as the dictionary grew too large.
Define the encoding that is used when we don’t utilise dictionary encoding.
This either apply if dictionary encoding is disabled or if we fallback as the dictionary grew too large.
-
inline Builder *
compression_level
(int compression_level)¶ Specify the default compression level for the compressor in every column.
In case a column does not have an explicitly specified compression level, the default one would be used.
The provided compression level is compressor specific. The user would have to familiarize oneself with the available levels for the selected compressor. If the compressor does not allow for selecting different compression levels, calling this function would not have any effect. Parquet and Arrow do not validate the passed compression level. If no level is selected by the user or if the special std::numeric_limits<int>::min() value is passed, then Arrow selects the compression level.
-
inline Builder *
compression_level
(const std::string &path, int compression_level)¶ Specify a compression level for the compressor for the column described by path.
The provided compression level is compressor specific. The user would have to familiarize oneself with the available levels for the selected compressor. If the compressor does not allow for selecting different compression levels, calling this function would not have any effect. Parquet and Arrow do not validate the passed compression level. If no level is selected by the user or if the special std::numeric_limits<int>::min() value is passed, then Arrow selects the compression level.
Specify a compression level for the compressor for the column described by path.
The provided compression level is compressor specific. The user would have to familiarize oneself with the available levels for the selected compressor. If the compressor does not allow for selecting different compression levels, calling this function would not have any effect. Parquet and Arrow do not validate the passed compression level. If no level is selected by the user or if the special std::numeric_limits<int>::min() value is passed, then Arrow selects the compression level.
-
inline Builder *
-
class
-
class
parquet
::
ArrowWriterProperties
¶ Public Functions
-
inline bool
compliant_nested_types
() const¶ Enable nested type naming according to the parquet specification.
Older versions of arrow wrote out field names for nested lists based on the name of the field. According to the parquet specification they should always be “element”.
-
inline EngineVersion
engine_version
() const¶ The underlying engine version to use when writing Arrow data.
V2 is currently the latest V1 is considered deprecated but left in place in case there are bugs detected in V2.
-
class
Builder
¶
-
inline bool
-
class
parquet::arrow
::
FileWriter
¶ Iterative FileWriter class.
Start a new RowGroup or Chunk with NewRowGroup. Write column-by-column the whole column chunk.
If PARQUET:field_id is present as a metadata key on a field, and the corresponding value is a nonnegative integer, then it will be used as the field_id in the parquet file.
Write a Table to Parquet.
-
class
parquet
::
StreamWriter
¶ A class for writing Parquet files using an output stream type API.
The values given must be of the correct type i.e. the type must match the file schema exactly otherwise a ParquetException will be thrown.
The user must explicitly indicate the end of the row using the EndRow() function or EndRow output manipulator.
A maximum row group size can be configured, the default size is 512MB. Alternatively the row group size can be set to zero and the user can create new row groups by calling the EndRowGroup() function or using the EndRowGroup output manipulator.
Required and optional fields are supported:
Required fields are written using operator<<(T)
Optional fields are written using operator<<(arrow::util::optional<T>).
Note that operator<<(T) can be used to write optional fields.
Similarly, operator<<(arrow::util::optional<T>) can be used to write required fields. However if the optional parameter does not have a value (i.e. it is nullopt) then a ParquetException will be raised.
Currently there is no support for repeated fields.
Public Functions
-
StreamWriter &
operator<<
(bool v)¶ Output operators for required fields.
These can also be used for optional fields when a value must be set.
-
template<int
N
>
inline StreamWriter &operator<<
(const char (&v)[N])¶ Output operators for fixed length strings.
-
StreamWriter &
operator<<
(const char *v)¶ Output operators for variable length strings.
-
template<typename
T
>
inline StreamWriter &operator<<
(const optional<T> &v)¶ Output operator for optional fields.
-
int64_t
SkipColumns
(int num_columns_to_skip)¶ Skip the next N columns of optional data.
If there are less than N columns remaining then the excess columns are ignored.
- Throws
ParquetException – if there is an attempt to skip any required column.
- Returns
Number of columns actually skipped.
-
void
EndRow
()¶ Terminate the current row and advance to next one.
- Throws
ParquetException – if all columns in the row were not written or skipped.
-
void
EndRowGroup
()¶ Terminate the current row group and create new one.
-
struct
FixedStringView
¶ Helper class to write fixed length strings.
This is useful as the standard string view (such as arrow::util::string_view) is for variable length data.