Compute Functions

Datum class

class Datum

Variant type for various Arrow C++ data structures.

Public Functions

Datum() = default

Empty datum, to be populated elsewhere.

int64_t TotalBufferSize() const

The sum of bytes in each buffer referenced by the datum Note: Scalars report a size of 0.

See also

arrow::util::TotalBufferSize for caveats

inline bool is_value() const

True if Datum contains a scalar or array-like data.

const std::shared_ptr<DataType> &type() const

The value type of the variant, if any.

Returns:

nullptr if no type

const std::shared_ptr<Schema> &schema() const

The schema of the variant, if any.

Returns:

nullptr if no schema

int64_t length() const

The value length of the variant, if any.

Returns:

kUnknownLength if no type

ArrayVector chunks() const

The array chunks of the variant, if any.

Returns:

empty if not arraylike

struct Empty

Abstract Function classes

void PrintTo(const FunctionOptions&, std::ostream*)
class FunctionOptionsType
#include <arrow/compute/function.h>

Extension point for defining options outside libarrow (but still within this project).

class FunctionOptions : public arrow::util::EqualityComparable<FunctionOptions>
#include <arrow/compute/function.h>

Base class for specifying options configuring a function’s behavior, such as error handling.

Subclassed by arrow::compute::ArithmeticOptions, arrow::compute::ArraySortOptions, arrow::compute::AssumeTimezoneOptions, arrow::compute::CastOptions, arrow::compute::CountOptions, arrow::compute::CumulativeSumOptions, arrow::compute::DayOfWeekOptions, arrow::compute::DictionaryEncodeOptions, arrow::compute::ElementWiseAggregateOptions, arrow::compute::ExtractRegexOptions, arrow::compute::FilterOptions, arrow::compute::IndexOptions, arrow::compute::JoinOptions, arrow::compute::ListSliceOptions, arrow::compute::MakeStructOptions, arrow::compute::MapLookupOptions, arrow::compute::MatchSubstringOptions, arrow::compute::ModeOptions, arrow::compute::NullOptions, arrow::compute::PadOptions, arrow::compute::PartitionNthOptions, arrow::compute::QuantileOptions, arrow::compute::RandomOptions, arrow::compute::RankOptions, arrow::compute::ReplaceSliceOptions, arrow::compute::ReplaceSubstringOptions, arrow::compute::RoundBinaryOptions, arrow::compute::RoundOptions, arrow::compute::RoundTemporalOptions, arrow::compute::RoundToMultipleOptions, arrow::compute::RunEndEncodeOptions, arrow::compute::ScalarAggregateOptions, arrow::compute::SelectKOptions, arrow::compute::SetLookupOptions, arrow::compute::SliceOptions, arrow::compute::SortOptions, arrow::compute::SplitOptions, arrow::compute::SplitPatternOptions, arrow::compute::StrftimeOptions, arrow::compute::StrptimeOptions, arrow::compute::StructFieldOptions, arrow::compute::TakeOptions, arrow::compute::TDigestOptions, arrow::compute::TrimOptions, arrow::compute::Utf8NormalizeOptions, arrow::compute::VarianceOptions, arrow::compute::WeekOptions

Public Functions

Result<std::shared_ptr<Buffer>> Serialize() const

Serialize an options struct to a buffer.

Public Static Functions

static Result<std::unique_ptr<FunctionOptions>> Deserialize(const std::string &type_name, const Buffer &buffer)

Deserialize an options struct from a buffer.

Note: this will only look for type_name in the default FunctionRegistry; to use a custom FunctionRegistry, look up the FunctionOptionsType, then call FunctionOptionsType::Deserialize().

struct Arity
#include <arrow/compute/function.h>

Contains the number of required arguments for the function.

Naming conventions taken from https://en.wikipedia.org/wiki/Arity.

Public Members

int num_args

The number of required arguments (or the minimum number for varargs functions).

bool is_varargs = false

If true, then the num_args is the minimum number of required arguments.

Public Static Functions

static inline Arity Nullary()

A function taking no arguments.

static inline Arity Unary()

A function taking 1 argument.

static inline Arity Binary()

A function taking 2 arguments.

static inline Arity Ternary()

A function taking 3 arguments.

static inline Arity VarArgs(int min_args = 0)

A function taking a variable number of arguments.

Parameters:

min_args[in] the minimum number of arguments required when invoking the function

struct FunctionDoc
#include <arrow/compute/function.h>

Public Members

std::string summary

A one-line summary of the function, using a verb.

For example, “Add two numeric arrays or scalars”.

std::string description

A detailed description of the function, meant to follow the summary.

std::vector<std::string> arg_names

Symbolic names (identifiers) for the function arguments.

Some bindings may use this to generate nicer function signatures.

std::string options_class

Name of the options class, if any.

bool options_required

Whether options are required for function execution.

If false, then either the function does not have an options class or there is a usable default options value.

class FunctionExecutor
#include <arrow/compute/function.h>

An executor of a function with a preconfigured kernel.

Public Functions

virtual Status Init(const FunctionOptions *options = NULLPTR, ExecContext *exec_ctx = NULLPTR) = 0

Initialize or re-initialize the preconfigured kernel.

This method may be called zero or more times. Depending on how the FunctionExecutor was obtained, it may already have been initialized.

virtual Result<Datum> Execute(const std::vector<Datum> &args, int64_t length = -1) = 0

Execute the preconfigured kernel with arguments that must fit it.

The method requires the arguments be castable to the preconfigured types.

Parameters:
  • args[in] Arguments to execute the function on

  • length[in] Length of arguments batch or -1 to default it. If the function has no parameters, this determines the batch length, defaulting to 0. Otherwise, if the function is scalar, this must equal the argument batch’s inferred length or be -1 to default to it. This is ignored for vector functions.

class Function
#include <arrow/compute/function.h>

Base class for compute functions.

Function implementations contain a collection of “kernels” which are implementations of the function for specific argument types. Selecting a viable kernel for executing a function is referred to as “dispatching”.

Subclassed by arrow::compute::detail::FunctionImpl< KernelType >, arrow::compute::MetaFunction, arrow::compute::detail::FunctionImpl< HashAggregateKernel >, arrow::compute::detail::FunctionImpl< ScalarAggregateKernel >, arrow::compute::detail::FunctionImpl< ScalarKernel >, arrow::compute::detail::FunctionImpl< VectorKernel >

Public Types

enum Kind

The kind of function, which indicates in what contexts it is valid for use.

Values:

enumerator SCALAR

A function that performs scalar data operations on whole arrays of data.

Can generally process Array or Scalar values. The size of the output will be the same as the size (or broadcasted size, in the case of mixing Array and Scalar inputs) of the input.

enumerator VECTOR

A function with array input and output whose behavior depends on the values of the entire arrays passed, rather than the value of each scalar value.

enumerator SCALAR_AGGREGATE

A function that computes scalar summary statistics from array input.

enumerator HASH_AGGREGATE

A function that computes grouped summary statistics from array input and an array of group identifiers.

enumerator META

A function that dispatches to other functions and does not contain its own kernels.

Public Functions

inline const std::string &name() const

The name of the kernel. The registry enforces uniqueness of names.

inline Function::Kind kind() const

The kind of kernel, which indicates in what contexts it is valid for use.

inline const Arity &arity() const

Contains the number of arguments the function requires, or if the function accepts variable numbers of arguments.

inline const FunctionDoc &doc() const

Return the function documentation.

virtual int num_kernels() const = 0

Returns the number of registered kernels for this function.

virtual Result<const Kernel*> DispatchExact(const std::vector<TypeHolder> &types) const

Return a kernel that can execute the function given the exact argument types (without implicit type casts).

NB: This function is overridden in CastFunction.

virtual Result<const Kernel*> DispatchBest(std::vector<TypeHolder> *values) const

Return a best-match kernel that can execute the function given the argument types, after implicit casts are applied.

Parameters:

values[inout] Argument types. An element may be modified to indicate that the returned kernel only approximately matches the input value descriptors; callers are responsible for casting inputs to the type required by the kernel.

virtual Result<std::shared_ptr<FunctionExecutor>> GetBestExecutor(std::vector<TypeHolder> inputs) const

Get a function executor with a best-matching kernel.

The returned executor will by default work with the default FunctionOptions and KernelContext. If you want to change that, call FunctionExecutor::Init.

virtual Result<Datum> Execute(const std::vector<Datum> &args, const FunctionOptions *options, ExecContext *ctx) const

Execute the function eagerly with the passed input arguments with kernel dispatch, batch iteration, and memory allocation details taken care of.

If the options pointer is null, then default_options() will be used.

This function can be overridden in subclasses.

inline const FunctionOptions *default_options() const

Returns the default options for this function.

Whatever option semantics a Function has, implementations must guarantee that default_options() is valid to pass to Execute as options.

class ScalarFunction : public arrow::compute::detail::FunctionImpl<ScalarKernel>
#include <arrow/compute/function.h>

A function that executes elementwise operations on arrays or scalars, and therefore whose results generally do not depend on the order of the values in the arguments.

Accepts and returns arrays that are all of the same size. These functions roughly correspond to the functions used in SQL expressions.

Public Functions

Status AddKernel(std::vector<InputType> in_types, OutputType out_type, ArrayKernelExec exec, KernelInit init = NULLPTR)

Add a kernel with given input/output types, no required state initialization, preallocation for fixed-width types, and default null handling (intersect validity bitmaps of inputs).

Status AddKernel(ScalarKernel kernel)

Add a kernel (function implementation).

Returns error if the kernel’s signature does not match the function’s arity.

class VectorFunction : public arrow::compute::detail::FunctionImpl<VectorKernel>
#include <arrow/compute/function.h>

A function that executes general array operations that may yield outputs of different sizes or have results that depend on the whole array contents.

These functions roughly correspond to the functions found in non-SQL array languages like APL and its derivatives.

Public Functions

Status AddKernel(std::vector<InputType> in_types, OutputType out_type, ArrayKernelExec exec, KernelInit init = NULLPTR)

Add a simple kernel with given input/output types, no required state initialization, no data preallocation, and no preallocation of the validity bitmap.

Status AddKernel(VectorKernel kernel)

Add a kernel (function implementation).

Returns error if the kernel’s signature does not match the function’s arity.

class ScalarAggregateFunction : public arrow::compute::detail::FunctionImpl<ScalarAggregateKernel>
#include <arrow/compute/function.h>

Public Functions

Status AddKernel(ScalarAggregateKernel kernel)

Add a kernel (function implementation).

Returns error if the kernel’s signature does not match the function’s arity.

class HashAggregateFunction : public arrow::compute::detail::FunctionImpl<HashAggregateKernel>
#include <arrow/compute/function.h>

Public Functions

Status AddKernel(HashAggregateKernel kernel)

Add a kernel (function implementation).

Returns error if the kernel’s signature does not match the function’s arity.

class MetaFunction : public arrow::compute::Function
#include <arrow/compute/function.h>

A function that dispatches to other functions.

Must implement MetaFunction::ExecuteImpl.

For Array, ChunkedArray, and Scalar Datum kinds, may rely on the execution of concrete Function types, but must handle other Datum kinds on its own.

Public Functions

inline virtual int num_kernels() const override

Returns the number of registered kernels for this function.

virtual Result<Datum> Execute(const std::vector<Datum> &args, const FunctionOptions *options, ExecContext *ctx) const override

Execute the function eagerly with the passed input arguments with kernel dispatch, batch iteration, and memory allocation details taken care of.

If the options pointer is null, then default_options() will be used.

This function can be overridden in subclasses.

Function execution

Warning

doxygengroup: Cannot find group “compute-functions-executor” in doxygen xml output for project “arrow_cpp” from directory: ../../cpp/apidoc/xml

Function registry

class FunctionRegistry

A mutable central function registry for built-in functions as well as user-defined functions.

Functions are implementations of arrow::compute::Function.

Generally, each function contains kernels which are implementations of a function for a specific argument signature. After looking up a function in the registry, one can either execute it eagerly with Function::Execute or use one of the function’s dispatch methods to pick a suitable kernel for lower-level function execution.

Public Functions

Status CanAddFunction(std::shared_ptr<Function> function, bool allow_overwrite = false)

Check whether a new function can be added to the registry.

Returns:

Status::KeyError if a function with the same name is already registered.

Status AddFunction(std::shared_ptr<Function> function, bool allow_overwrite = false)

Add a new function to the registry.

Returns:

Status::KeyError if a function with the same name is already registered.

Status CanAddAlias(const std::string &target_name, const std::string &source_name)

Check whether an alias can be added for the given function name.

Returns:

Status::KeyError if the function with the given name is not registered.

Status AddAlias(const std::string &target_name, const std::string &source_name)

Add alias for the given function name.

Returns:

Status::KeyError if the function with the given name is not registered.

Status CanAddFunctionOptionsType(const FunctionOptionsType *options_type, bool allow_overwrite = false)

Check whether a new function options type can be added to the registry.

Returns:

Status::KeyError if a function options type with the same name is already registered.

Status AddFunctionOptionsType(const FunctionOptionsType *options_type, bool allow_overwrite = false)

Add a new function options type to the registry.

Returns:

Status::KeyError if a function options type with the same name is already registered.

Result<std::shared_ptr<Function>> GetFunction(const std::string &name) const

Retrieve a function by name from the registry.

std::vector<std::string> GetFunctionNames() const

Return vector of all entry names in the registry.

Helpful for displaying a manifest of available functions.

Result<const FunctionOptionsType*> GetFunctionOptionsType(const std::string &name) const

Retrieve a function options type by name from the registry.

int num_functions() const

The number of currently registered functions.

Public Static Functions

static std::unique_ptr<FunctionRegistry> Make()

Construct a new registry.

Most users only need to use the global registry.

static std::unique_ptr<FunctionRegistry> Make(FunctionRegistry *parent)

Construct a new nested registry with the given parent.

Most users only need to use the global registry. The returned registry never changes its parent, even when an operation allows overwritting.

FunctionRegistry *arrow::compute::GetFunctionRegistry()

Return the process-global function registry.

Convenience functions

Result<Datum> CallFunction(const std::string &func_name, const std::vector<Datum> &args, const FunctionOptions *options, ExecContext *ctx = NULLPTR)

One-shot invoker for all types of functions.

Does kernel dispatch, argument checking, iteration of ChunkedArray inputs, and wrapping of outputs.

Result<Datum> CallFunction(const std::string &func_name, const std::vector<Datum> &args, ExecContext *ctx = NULLPTR)

Variant of CallFunction which uses a function’s default options.

NB: Some functions require FunctionOptions be provided.

Result<Datum> CallFunction(const std::string &func_name, const ExecBatch &batch, const FunctionOptions *options, ExecContext *ctx = NULLPTR)

One-shot invoker for all types of functions.

Does kernel dispatch, argument checking, iteration of ChunkedArray inputs, and wrapping of outputs.

Result<Datum> CallFunction(const std::string &func_name, const ExecBatch &batch, ExecContext *ctx = NULLPTR)

Variant of CallFunction which uses a function’s default options.

NB: Some functions require FunctionOptions be provided.

Concrete options classes

enum class RoundMode : int8_t

Rounding and tie-breaking modes for round compute functions.

Additional details and examples are provided in compute.rst.

Values:

enumerator DOWN

Round to nearest integer less than or equal in magnitude (aka “floor”)

enumerator UP

Round to nearest integer greater than or equal in magnitude (aka “ceil”)

enumerator TOWARDS_ZERO

Get the integral part without fractional digits (aka “trunc”)

enumerator TOWARDS_INFINITY

Round negative values with DOWN rule and positive values with UP rule (aka “away from zero”)

enumerator HALF_DOWN

Round ties with DOWN rule (also called “round half towards negative infinity”)

enumerator HALF_UP

Round ties with UP rule (also called “round half towards positive infinity”)

enumerator HALF_TOWARDS_ZERO

Round ties with TOWARDS_ZERO rule (also called “round half away from infinity”)

enumerator HALF_TOWARDS_INFINITY

Round ties with TOWARDS_INFINITY rule (also called “round half away from zero”)

enumerator HALF_TO_EVEN

Round ties to nearest even integer.

enumerator HALF_TO_ODD

Round ties to nearest odd integer.

enum class CalendarUnit : int8_t

Values:

enumerator NANOSECOND
enumerator MICROSECOND
enumerator MILLISECOND
enumerator SECOND
enumerator MINUTE
enumerator HOUR
enumerator DAY
enumerator WEEK
enumerator MONTH
enumerator QUARTER
enumerator YEAR
enum CompareOperator

Values:

enumerator EQUAL
enumerator NOT_EQUAL
enumerator GREATER
enumerator GREATER_EQUAL
enumerator LESS
enumerator LESS_EQUAL
class ScalarAggregateOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_aggregate.h>

Control general scalar aggregate kernel behavior.

By default, null values are ignored (skip_nulls = true).

Public Functions

explicit ScalarAggregateOptions(bool skip_nulls = true, uint32_t min_count = 1)

Public Members

bool skip_nulls

If true (the default), null values are ignored.

Otherwise, if any value is null, emit null.

uint32_t min_count

If less than this many non-null values are observed, emit null.

Public Static Functions

static inline ScalarAggregateOptions Defaults()

Public Static Attributes

static constexpr const char kTypeName[] = "ScalarAggregateOptions"
class CountOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_aggregate.h>

Control count aggregate kernel behavior.

By default, only non-null values are counted.

Public Types

enum CountMode

Values:

enumerator ONLY_VALID

Count only non-null values.

enumerator ONLY_NULL

Count only null values.

enumerator ALL

Count both non-null and null values.

Public Functions

explicit CountOptions(CountMode mode = CountMode::ONLY_VALID)

Public Members

CountMode mode

Public Static Functions

static inline CountOptions Defaults()

Public Static Attributes

static constexpr const char kTypeName[] = "CountOptions"
class ModeOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_aggregate.h>

Control Mode kernel behavior.

Returns top-n common values and counts. By default, returns the most common value and count.

Public Functions

explicit ModeOptions(int64_t n = 1, bool skip_nulls = true, uint32_t min_count = 0)

Public Members

int64_t n = 1
bool skip_nulls

If true (the default), null values are ignored.

Otherwise, if any value is null, emit null.

uint32_t min_count

If less than this many non-null values are observed, emit null.

Public Static Functions

static inline ModeOptions Defaults()

Public Static Attributes

static constexpr const char kTypeName[] = "ModeOptions"
class VarianceOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_aggregate.h>

Control Delta Degrees of Freedom (ddof) of Variance and Stddev kernel.

The divisor used in calculations is N - ddof, where N is the number of elements. By default, ddof is zero, and population variance or stddev is returned.

Public Functions

explicit VarianceOptions(int ddof = 0, bool skip_nulls = true, uint32_t min_count = 0)

Public Members

int ddof = 0
bool skip_nulls

If true (the default), null values are ignored.

Otherwise, if any value is null, emit null.

uint32_t min_count

If less than this many non-null values are observed, emit null.

Public Static Functions

static inline VarianceOptions Defaults()

Public Static Attributes

static constexpr const char kTypeName[] = "VarianceOptions"
class QuantileOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_aggregate.h>

Control Quantile kernel behavior.

By default, returns the median value.

Public Types

enum Interpolation

Interpolation method to use when quantile lies between two data points.

Values:

enumerator LINEAR
enumerator LOWER
enumerator HIGHER
enumerator NEAREST
enumerator MIDPOINT

Public Functions

explicit QuantileOptions(double q = 0.5, enum Interpolation interpolation = LINEAR, bool skip_nulls = true, uint32_t min_count = 0)
explicit QuantileOptions(std::vector<double> q, enum Interpolation interpolation = LINEAR, bool skip_nulls = true, uint32_t min_count = 0)

Public Members

std::vector<double> q

quantile must be between 0 and 1 inclusive

enum Interpolation interpolation
bool skip_nulls

If true (the default), null values are ignored.

Otherwise, if any value is null, emit null.

uint32_t min_count

If less than this many non-null values are observed, emit null.

Public Static Functions

static inline QuantileOptions Defaults()

Public Static Attributes

static constexpr const char kTypeName[] = "QuantileOptions"
class TDigestOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_aggregate.h>

Control TDigest approximate quantile kernel behavior.

By default, returns the median value.

Public Functions

explicit TDigestOptions(double q = 0.5, uint32_t delta = 100, uint32_t buffer_size = 500, bool skip_nulls = true, uint32_t min_count = 0)
explicit TDigestOptions(std::vector<double> q, uint32_t delta = 100, uint32_t buffer_size = 500, bool skip_nulls = true, uint32_t min_count = 0)

Public Members

std::vector<double> q

quantile must be between 0 and 1 inclusive

uint32_t delta

compression parameter, default 100

uint32_t buffer_size

input buffer size, default 500

bool skip_nulls

If true (the default), null values are ignored.

Otherwise, if any value is null, emit null.

uint32_t min_count

If less than this many non-null values are observed, emit null.

Public Static Functions

static inline TDigestOptions Defaults()

Public Static Attributes

static constexpr const char kTypeName[] = "TDigestOptions"
class IndexOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_aggregate.h>

Control Index kernel behavior.

Public Functions

explicit IndexOptions(std::shared_ptr<Scalar> value)
IndexOptions()

Public Members

std::shared_ptr<Scalar> value

Public Static Attributes

static constexpr const char kTypeName[] = "IndexOptions"
struct Aggregate
#include <arrow/compute/api_aggregate.h>

Configure a grouped aggregation.

Public Functions

Aggregate() = default
inline Aggregate(std::string function, std::shared_ptr<FunctionOptions> options, std::vector<FieldRef> target, std::string name = "")
inline Aggregate(std::string function, std::shared_ptr<FunctionOptions> options, FieldRef target, std::string name = "")
inline Aggregate(std::string function, FieldRef target, std::string name)
inline Aggregate(std::string function, std::string name)

Public Members

std::string function

the name of the aggregation function

std::shared_ptr<FunctionOptions> options

options for the aggregation function

std::vector<FieldRef> target

zero or more fields to which aggregations will be applied

std::string name

optional output field name for aggregations

class ArithmeticOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Public Functions

explicit ArithmeticOptions(bool check_overflow = false)

Public Members

bool check_overflow

Public Static Attributes

static constexpr const char kTypeName[] = "ArithmeticOptions"
class ElementWiseAggregateOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Public Functions

explicit ElementWiseAggregateOptions(bool skip_nulls = true)

Public Members

bool skip_nulls

Public Static Functions

static inline ElementWiseAggregateOptions Defaults()

Public Static Attributes

static constexpr const char kTypeName[] = "ElementWiseAggregateOptions"
class RoundOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Public Functions

explicit RoundOptions(int64_t ndigits = 0, RoundMode round_mode = RoundMode::HALF_TO_EVEN)

Public Members

int64_t ndigits

Rounding precision (number of digits to round to)

RoundMode round_mode

Rounding and tie-breaking mode.

Public Static Functions

static inline RoundOptions Defaults()

Public Static Attributes

static constexpr const char kTypeName[] = "RoundOptions"
class RoundBinaryOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Public Functions

explicit RoundBinaryOptions(RoundMode round_mode = RoundMode::HALF_TO_EVEN)

Public Members

RoundMode round_mode

Rounding and tie-breaking mode.

Public Static Functions

static inline RoundBinaryOptions Defaults()

Public Static Attributes

static constexpr const char kTypeName[] = "RoundBinaryOptions"
class RoundTemporalOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Public Functions

explicit RoundTemporalOptions(int multiple = 1, CalendarUnit unit = CalendarUnit::DAY, bool week_starts_monday = true, bool ceil_is_strictly_greater = false, bool calendar_based_origin = false)

Public Members

int multiple

Number of units to round to.

CalendarUnit unit

The unit used for rounding of time.

bool week_starts_monday

What day does the week start with (Monday=true, Sunday=false)

bool ceil_is_strictly_greater

Enable this flag to return a rounded value that is strictly greater than the input.

For example: ceiling 1970-01-01T00:00:00 to 3 hours would yield 1970-01-01T03:00:00 if set to true and 1970-01-01T00:00:00 if set to false. This applies for ceiling only.

bool calendar_based_origin

By default time is rounded to a multiple of units since 1970-01-01T00:00:00.

By setting calendar_based_origin to true, time will be rounded to a number of units since the last greater calendar unit. For example: rounding to a multiple of days since the beginning of the month or to hours since the beginning of the day. Exceptions: week and quarter are not used as greater units, therefore days will will be rounded to the beginning of the month not week. Greater unit of week is year. Note that ceiling and rounding might change sorting order of an array near greater unit change. For example rounding YYYY-mm-dd 23:00:00 to 5 hours will ceil and round to YYYY-mm-dd+1 01:00:00 and floor to YYYY-mm-dd 20:00:00. On the other hand YYYY-mm-dd+1 00:00:00 will ceil, round and floor to YYYY-mm-dd+1 00:00:00. This can break the order of an already ordered array.

Public Static Functions

static inline RoundTemporalOptions Defaults()

Public Static Attributes

static constexpr const char kTypeName[] = "RoundTemporalOptions"
class RoundToMultipleOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Public Functions

explicit RoundToMultipleOptions(double multiple = 1.0, RoundMode round_mode = RoundMode::HALF_TO_EVEN)
explicit RoundToMultipleOptions(std::shared_ptr<Scalar> multiple, RoundMode round_mode = RoundMode::HALF_TO_EVEN)

Public Members

std::shared_ptr<Scalar> multiple

Rounding scale (multiple to round to).

Should be a positive numeric scalar of a type compatible with the argument to be rounded. The cast kernel is used to convert the rounding multiple to match the result type.

RoundMode round_mode

Rounding and tie-breaking mode.

Public Static Functions

static inline RoundToMultipleOptions Defaults()

Public Static Attributes

static constexpr const char kTypeName[] = "RoundToMultipleOptions"
class JoinOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Options for var_args_join.

Public Types

enum NullHandlingBehavior

How to handle null values. (A null separator always results in a null output.)

Values:

enumerator EMIT_NULL

A null in any input results in a null in the output.

enumerator SKIP

Nulls in inputs are skipped.

enumerator REPLACE

Nulls in inputs are replaced with the replacement string.

Public Functions

explicit JoinOptions(NullHandlingBehavior null_handling = EMIT_NULL, std::string null_replacement = "")

Public Members

NullHandlingBehavior null_handling
std::string null_replacement

Public Static Functions

static inline JoinOptions Defaults()

Public Static Attributes

static constexpr const char kTypeName[] = "JoinOptions"
class MatchSubstringOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Public Functions

explicit MatchSubstringOptions(std::string pattern, bool ignore_case = false)
MatchSubstringOptions()

Public Members

std::string pattern

The exact substring (or regex, depending on kernel) to look for inside input values.

bool ignore_case

Whether to perform a case-insensitive match.

Public Static Attributes

static constexpr const char kTypeName[] = "MatchSubstringOptions"
class SplitOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Public Functions

explicit SplitOptions(int64_t max_splits = -1, bool reverse = false)

Public Members

int64_t max_splits

Maximum number of splits allowed, or unlimited when -1.

bool reverse

Start splitting from the end of the string (only relevant when max_splits != -1)

Public Static Attributes

static constexpr const char kTypeName[] = "SplitOptions"
class SplitPatternOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Public Functions

explicit SplitPatternOptions(std::string pattern, int64_t max_splits = -1, bool reverse = false)
SplitPatternOptions()

Public Members

std::string pattern

The exact substring to split on.

int64_t max_splits

Maximum number of splits allowed, or unlimited when -1.

bool reverse

Start splitting from the end of the string (only relevant when max_splits != -1)

Public Static Attributes

static constexpr const char kTypeName[] = "SplitPatternOptions"
class ReplaceSliceOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Public Functions

explicit ReplaceSliceOptions(int64_t start, int64_t stop, std::string replacement)
ReplaceSliceOptions()

Public Members

int64_t start

Index to start slicing at.

int64_t stop

Index to stop slicing at.

std::string replacement

String to replace the slice with.

Public Static Attributes

static constexpr const char kTypeName[] = "ReplaceSliceOptions"
class ReplaceSubstringOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Public Functions

explicit ReplaceSubstringOptions(std::string pattern, std::string replacement, int64_t max_replacements = -1)
ReplaceSubstringOptions()

Public Members

std::string pattern

Pattern to match, literal, or regular expression depending on which kernel is used.

std::string replacement

String to replace the pattern with.

int64_t max_replacements

Max number of substrings to replace (-1 means unbounded)

Public Static Attributes

static constexpr const char kTypeName[] = "ReplaceSubstringOptions"
class ExtractRegexOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Public Functions

explicit ExtractRegexOptions(std::string pattern)
ExtractRegexOptions()

Public Members

std::string pattern

Regular expression with named capture fields.

Public Static Attributes

static constexpr const char kTypeName[] = "ExtractRegexOptions"
class SetLookupOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Options for IsIn and IndexIn functions.

Public Functions

explicit SetLookupOptions(Datum value_set, bool skip_nulls = false)
SetLookupOptions()

Public Members

Datum value_set

The set of values to look up input values into.

bool skip_nulls

Whether nulls in value_set count for lookup.

If true, any null in value_set is ignored and nulls in the input produce null (IndexIn) or false (IsIn) values in the output. If false, any null in value_set is successfully matched in the input.

Public Static Attributes

static constexpr const char kTypeName[] = "SetLookupOptions"
class StructFieldOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Options for struct_field function.

Public Functions

explicit StructFieldOptions(std::vector<int> indices)
explicit StructFieldOptions(std::initializer_list<int>)
explicit StructFieldOptions(FieldRef field_ref)
StructFieldOptions()

Public Members

FieldRef field_ref

The FieldRef specifying what to extract from struct or union.

Public Static Attributes

static constexpr const char kTypeName[] = "StructFieldOptions"
class StrptimeOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Public Functions

explicit StrptimeOptions(std::string format, TimeUnit::type unit, bool error_is_null = false)
StrptimeOptions()

Public Members

std::string format

The desired format string.

TimeUnit::type unit

The desired time resolution.

bool error_is_null

Return null on parsing errors if true or raise if false.

Public Static Attributes

static constexpr const char kTypeName[] = "StrptimeOptions"
class StrftimeOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Public Functions

explicit StrftimeOptions(std::string format, std::string locale = "C")
StrftimeOptions()

Public Members

std::string format

The desired format string.

std::string locale

The desired output locale string.

Public Static Attributes

static constexpr const char kTypeName[] = "StrftimeOptions"
static constexpr const char *kDefaultFormat = "%Y-%m-%dT%H:%M:%S"
class PadOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Public Functions

explicit PadOptions(int64_t width, std::string padding = " ")
PadOptions()

Public Members

int64_t width

The desired string length.

std::string padding

What to pad the string with. Should be one codepoint (Unicode)/byte (ASCII).

Public Static Attributes

static constexpr const char kTypeName[] = "PadOptions"
class TrimOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Public Functions

explicit TrimOptions(std::string characters)
TrimOptions()

Public Members

std::string characters

The individual characters to be trimmed from the string.

Public Static Attributes

static constexpr const char kTypeName[] = "TrimOptions"
class SliceOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Public Functions

explicit SliceOptions(int64_t start, int64_t stop = std::numeric_limits<int64_t>::max(), int64_t step = 1)
SliceOptions()

Public Members

int64_t start
int64_t stop
int64_t step

Public Static Attributes

static constexpr const char kTypeName[] = "SliceOptions"
class ListSliceOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Public Functions

explicit ListSliceOptions(int64_t start, std::optional<int64_t> stop = std::nullopt, int64_t step = 1, std::optional<bool> return_fixed_size_list = std::nullopt)
ListSliceOptions()

Public Members

int64_t start

The start of list slicing.

std::optional<int64_t> stop

Optional stop of list slicing. If not set, then slice to end. (NotImplemented)

int64_t step

Slicing step.

std::optional<bool> return_fixed_size_list

Public Static Attributes

static constexpr const char kTypeName[] = "ListSliceOptions"
class NullOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Public Functions

explicit NullOptions(bool nan_is_null = false)

Public Members

bool nan_is_null

Public Static Functions

static inline NullOptions Defaults()

Public Static Attributes

static constexpr const char kTypeName[] = "NullOptions"
struct CompareOptions
#include <arrow/compute/api_scalar.h>

Public Functions

inline explicit CompareOptions(CompareOperator op)
inline CompareOptions()

Public Members

enum CompareOperator op
class MakeStructOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Public Functions

MakeStructOptions(std::vector<std::string> n, std::vector<bool> r, std::vector<std::shared_ptr<const KeyValueMetadata>> m)
explicit MakeStructOptions(std::vector<std::string> n)
MakeStructOptions()

Public Members

std::vector<std::string> field_names

Names for wrapped columns.

std::vector<bool> field_nullability

Nullability bits for wrapped columns.

std::vector<std::shared_ptr<const KeyValueMetadata>> field_metadata

Metadata attached to wrapped columns.

Public Static Attributes

static constexpr const char kTypeName[] = "MakeStructOptions"
struct DayOfWeekOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Public Functions

explicit DayOfWeekOptions(bool count_from_zero = true, uint32_t week_start = 1)

Public Members

bool count_from_zero

Number days from 0 if true and from 1 if false.

uint32_t week_start

What day does the week start with (Monday=1, Sunday=7).

The numbering is unaffected by the count_from_zero parameter.

Public Static Functions

static inline DayOfWeekOptions Defaults()

Public Static Attributes

static constexpr const char kTypeName[] = "DayOfWeekOptions"
struct AssumeTimezoneOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Used to control timestamp timezone conversion and handling ambiguous/nonexistent times.

Public Types

enum Ambiguous

How to interpret ambiguous local times that can be interpreted as multiple instants (normally two) due to DST shifts.

AMBIGUOUS_EARLIEST emits the earliest instant amongst possible interpretations. AMBIGUOUS_LATEST emits the latest instant amongst possible interpretations.

Values:

enumerator AMBIGUOUS_RAISE
enumerator AMBIGUOUS_EARLIEST
enumerator AMBIGUOUS_LATEST
enum Nonexistent

How to handle local times that do not exist due to DST shifts.

NONEXISTENT_EARLIEST emits the instant “just before” the DST shift instant in the given timestamp precision (for example, for a nanoseconds precision timestamp, this is one nanosecond before the DST shift instant). NONEXISTENT_LATEST emits the DST shift instant.

Values:

enumerator NONEXISTENT_RAISE
enumerator NONEXISTENT_EARLIEST
enumerator NONEXISTENT_LATEST

Public Functions

explicit AssumeTimezoneOptions(std::string timezone, Ambiguous ambiguous = AMBIGUOUS_RAISE, Nonexistent nonexistent = NONEXISTENT_RAISE)
AssumeTimezoneOptions()

Public Members

std::string timezone

Timezone to convert timestamps from.

Ambiguous ambiguous

How to interpret ambiguous local times (due to DST shifts)

Nonexistent nonexistent

How to interpret non-existent local times (due to DST shifts)

Public Static Attributes

static constexpr const char kTypeName[] = "AssumeTimezoneOptions"
struct WeekOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Public Functions

explicit WeekOptions(bool week_starts_monday = true, bool count_from_zero = false, bool first_week_is_fully_in_year = false)

Public Members

bool week_starts_monday

What day does the week start with (Monday=true, Sunday=false)

bool count_from_zero

Dates from current year that fall into last ISO week of the previous year return 0 if true and 52 or 53 if false.

bool first_week_is_fully_in_year

Must the first week be fully in January (true), or is a week that begins on December 29, 30, or 31 considered to be the first week of the new year (false)?

Public Static Functions

static inline WeekOptions Defaults()
static inline WeekOptions ISODefaults()
static inline WeekOptions USDefaults()

Public Static Attributes

static constexpr const char kTypeName[] = "WeekOptions"
struct Utf8NormalizeOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Public Types

enum Form

Values:

enumerator NFC
enumerator NFKC
enumerator NFD
enumerator NFKD

Public Functions

explicit Utf8NormalizeOptions(Form form = NFC)

Public Members

Form form

The Unicode normalization form to apply.

Public Static Functions

static inline Utf8NormalizeOptions Defaults()

Public Static Attributes

static constexpr const char kTypeName[] = "Utf8NormalizeOptions"
class RandomOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Public Types

enum Initializer

Values:

enumerator SystemRandom
enumerator Seed

Public Functions

RandomOptions(Initializer initializer, uint64_t seed)
RandomOptions()

Public Members

Initializer initializer

The type of initialization for random number generation - system or provided seed.

uint64_t seed

The seed value used to initialize the random number generation.

Public Static Functions

static inline RandomOptions FromSystemRandom()
static inline RandomOptions FromSeed(uint64_t seed)
static inline RandomOptions Defaults()

Public Static Attributes

static constexpr const char kTypeName[] = "RandomOptions"
class MapLookupOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_scalar.h>

Options for map_lookup function.

Public Types

enum Occurrence

Values:

enumerator FIRST

Return the first matching value.

enumerator LAST

Return the last matching value.

enumerator ALL

Return all matching values.

Public Functions

explicit MapLookupOptions(std::shared_ptr<Scalar> query_key, Occurrence occurrence)
MapLookupOptions()

Public Members

std::shared_ptr<Scalar> query_key

The key to lookup in the map.

Occurrence occurrence

Whether to return the first, last, or all matching values.

Public Static Attributes

static constexpr const char kTypeName[] = "MapLookupOptions"
class FilterOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_vector.h>

Public Types

enum NullSelectionBehavior

Configure the action taken when a slot of the selection mask is null.

Values:

enumerator DROP

The corresponding filtered value will be removed in the output.

enumerator EMIT_NULL

The corresponding filtered value will be null in the output.

Public Functions

explicit FilterOptions(NullSelectionBehavior null_selection = DROP)

Public Members

NullSelectionBehavior null_selection_behavior = DROP

Public Static Functions

static inline FilterOptions Defaults()

Public Static Attributes

static constexpr const char kTypeName[] = "FilterOptions"
class TakeOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_vector.h>

Public Functions

explicit TakeOptions(bool boundscheck = true)

Public Members

bool boundscheck = true

Public Static Functions

static inline TakeOptions BoundsCheck()
static inline TakeOptions NoBoundsCheck()
static inline TakeOptions Defaults()

Public Static Attributes

static constexpr const char kTypeName[] = "TakeOptions"
class DictionaryEncodeOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_vector.h>

Options for the dictionary encode function.

Public Types

enum NullEncodingBehavior

Configure how null values will be encoded.

Values:

enumerator ENCODE

The null value will be added to the dictionary with a proper index.

enumerator MASK

The null value will be masked in the indices array.

Public Functions

explicit DictionaryEncodeOptions(NullEncodingBehavior null_encoding = MASK)

Public Members

NullEncodingBehavior null_encoding_behavior = MASK

Public Static Functions

static inline DictionaryEncodeOptions Defaults()

Public Static Attributes

static constexpr const char kTypeName[] = "DictionaryEncodeOptions"
class RunEndEncodeOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_vector.h>

Options for the run-end encode function.

Public Functions

explicit RunEndEncodeOptions(std::shared_ptr<DataType> run_end_type = int32())

Public Members

std::shared_ptr<DataType> run_end_type

Public Static Functions

static inline RunEndEncodeOptions Defaults()

Public Static Attributes

static constexpr const char kTypeName[] = "RunEndEncodeOptions"
class ArraySortOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_vector.h>

Public Functions

explicit ArraySortOptions(SortOrder order = SortOrder::Ascending, NullPlacement null_placement = NullPlacement::AtEnd)

Public Members

SortOrder order

Sorting order.

NullPlacement null_placement

Whether nulls and NaNs are placed at the start or at the end.

Public Static Functions

static inline ArraySortOptions Defaults()

Public Static Attributes

static constexpr const char kTypeName[] = "ArraySortOptions"
class SortOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_vector.h>

Public Functions

explicit SortOptions(std::vector<SortKey> sort_keys = {}, NullPlacement null_placement = NullPlacement::AtEnd)
explicit SortOptions(const Ordering &ordering)
inline Ordering AsOrdering() &&

Convenience constructor to create an ordering from SortOptions.

Note: Both classes contain the exact same information. However, sort_options should only be used in a “function options” context while Ordering is used more generally.

inline Ordering AsOrdering() const &

Public Members

std::vector<SortKey> sort_keys

Column key(s) to order by and how to order by these sort keys.

NullPlacement null_placement

Whether nulls and NaNs are placed at the start or at the end.

Public Static Functions

static inline SortOptions Defaults()

Public Static Attributes

static constexpr const char kTypeName[] = "SortOptions"
class SelectKOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_vector.h>

SelectK options.

Public Functions

explicit SelectKOptions(int64_t k = -1, std::vector<SortKey> sort_keys = {})

Public Members

int64_t k

The number of k elements to keep.

std::vector<SortKey> sort_keys

Column key(s) to order by and how to order by these sort keys.

Public Static Functions

static inline SelectKOptions Defaults()
static inline SelectKOptions TopKDefault(int64_t k, std::vector<std::string> key_names = {})
static inline SelectKOptions BottomKDefault(int64_t k, std::vector<std::string> key_names = {})

Public Static Attributes

static constexpr const char kTypeName[] = "SelectKOptions"
class RankOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_vector.h>

Rank options.

Public Types

enum Tiebreaker

Configure how ties between equal values are handled.

Values:

enumerator Min

Ties get the smallest possible rank in sorted order.

enumerator Max

Ties get the largest possible rank in sorted order.

enumerator First

Ranks are assigned in order of when ties appear in the input.

This ensures the ranks are a stable permutation of the input.

enumerator Dense

The ranks span a dense [1, M] interval where M is the number of distinct values in the input.

Public Functions

explicit RankOptions(std::vector<SortKey> sort_keys = {}, NullPlacement null_placement = NullPlacement::AtEnd, Tiebreaker tiebreaker = RankOptions::First)
inline explicit RankOptions(SortOrder order, NullPlacement null_placement = NullPlacement::AtEnd, Tiebreaker tiebreaker = RankOptions::First)

Convenience constructor for array inputs.

Public Members

std::vector<SortKey> sort_keys

Column key(s) to order by and how to order by these sort keys.

NullPlacement null_placement

Whether nulls and NaNs are placed at the start or at the end.

Tiebreaker tiebreaker

Tiebreaker for dealing with equal values in ranks.

Public Static Functions

static inline RankOptions Defaults()

Public Static Attributes

static constexpr const char kTypeName[] = "RankOptions"
class PartitionNthOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_vector.h>

Partitioning options for NthToIndices.

Public Functions

explicit PartitionNthOptions(int64_t pivot, NullPlacement null_placement = NullPlacement::AtEnd)
inline PartitionNthOptions()

Public Members

int64_t pivot

The index into the equivalent sorted array of the partition pivot element.

NullPlacement null_placement

Whether nulls and NaNs are partitioned at the start or at the end.

Public Static Attributes

static constexpr const char kTypeName[] = "PartitionNthOptions"
class CumulativeSumOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/api_vector.h>

Options for cumulative sum function.

Public Functions

explicit CumulativeSumOptions(double start = 0, bool skip_nulls = false, bool check_overflow = false)
explicit CumulativeSumOptions(std::shared_ptr<Scalar> start, bool skip_nulls = false, bool check_overflow = false)

Public Members

std::shared_ptr<Scalar> start

Optional starting value for cumulative operation computation.

bool skip_nulls = false

If true, nulls in the input are ignored and produce a corresponding null output.

When false, the first null encountered is propagated through the remaining output.

bool check_overflow = false

When true, returns an Invalid Status when overflow is detected.

Public Static Functions

static inline CumulativeSumOptions Defaults()

Public Static Attributes

static constexpr const char kTypeName[] = "CumulativeSumOptions"
class CastOptions : public arrow::compute::FunctionOptions
#include <arrow/compute/cast.h>

Public Functions

explicit CastOptions(bool safe = true)

Public Members

TypeHolder to_type
bool allow_int_overflow
bool allow_time_truncate
bool allow_time_overflow
bool allow_decimal_truncate
bool allow_float_truncate
bool allow_invalid_utf8

Public Static Functions

static inline CastOptions Safe(TypeHolder to_type = {})
static inline CastOptions Unsafe(TypeHolder to_type = {})

Public Static Attributes

static constexpr const char kTypeName[] = "CastOptions"

Streaming Execution

Streaming Execution Operators

Warning

doxygenenum: Cannot find enum “arrow::compute::JoinType” in doxygen xml output for project “arrow_cpp” from directory: ../../cpp/apidoc/xml

Warning

doxygenenum: Cannot find enum “arrow::compute::JoinKeyCmp” in doxygen xml output for project “arrow_cpp” from directory: ../../cpp/apidoc/xml

Warning

doxygengroup: Cannot find group “execnode-options” in doxygen xml output for project “arrow_cpp” from directory: ../../cpp/apidoc/xml

enum class UnalignedBufferHandling

How to handle unaligned buffers.

Values:

enumerator kWarn
enumerator kIgnore
enumerator kReallocate
enumerator kError
constexpr int kDefaultBackgroundMaxQ = 32
constexpr int kDefaultBackgroundQRestart = 16
ARROW_ACERO_EXPORT ExecFactoryRegistry * default_exec_factory_registry ()

The default registry, which includes built-in factories.

inline Result<ExecNode*> MakeExecNode(const std::string &factory_name, ExecPlan *plan, std::vector<ExecNode*> inputs, const ExecNodeOptions &options, ExecFactoryRegistry *registry = default_exec_factory_registry())

Construct an ExecNode using the named factory.

UnalignedBufferHandling GetDefaultUnalignedBufferHandling()

get the default behavior of unaligned buffer handling

This is configurable via the ACERO_ALIGNMENT_HANDLING environment variable which can be set to “warn”, “ignore”, “reallocate”, or “error”. If the environment variable is not set, or is set to an invalid value, this will return kWarn

ARROW_ACERO_EXPORT Result< std::shared_ptr< Schema > > DeclarationToSchema (const Declaration &declaration, FunctionRegistry *function_registry=NULLPTR)

Calculate the output schema of a declaration.

This does not actually execute the plan. This operation may fail if the declaration represents an invalid plan (e.g. a project node with multiple inputs)

Parameters:
  • declaration – A declaration describing an execution plan

  • function_registry – The function registry to use for function execution. If null then the default function registry will be used.

Returns:

the schema that batches would have after going through the execution plan

ARROW_ACERO_EXPORT Result< std::string > DeclarationToString (const Declaration &declaration, FunctionRegistry *function_registry=NULLPTR)

Create a string representation of a plan.

This representation is for debug purposes only.

Conversion to a string may fail if the declaration represents an invalid plan.

Use Substrait for complete serialization of plans

Parameters:
  • declaration – A declaration describing an execution plan

  • function_registry – The function registry to use for function execution. If null then the default function registry will be used.

Returns:

a string representation of the plan suitable for debugging output

ARROW_ACERO_EXPORT Result< std::shared_ptr< Table > > DeclarationToTable (Declaration declaration, bool use_threads=true, MemoryPool *memory_pool=default_memory_pool(), FunctionRegistry *function_registry=NULLPTR)

Utility method to run a declaration and collect the results into a table.

This method will add a sink node to the declaration to collect results into a table. It will then create an ExecPlan from the declaration, start the exec plan, block until the plan has finished, and return the created table.

Parameters:
  • declaration – A declaration describing the plan to run

  • use_threads – If use_threads is false then all CPU work will be done on the calling thread. I/O tasks will still happen on the I/O executor and may be multi-threaded (but should not use significant CPU resources).

  • memory_pool – The memory pool to use for allocations made while running the plan.

  • function_registry – The function registry to use for function execution. If null then the default function registry will be used.

ARROW_ACERO_EXPORT Result< std::shared_ptr< Table > > DeclarationToTable (Declaration declaration, QueryOptions query_options)
ARROW_ACERO_EXPORT Future< std::shared_ptr< Table > > DeclarationToTableAsync (Declaration declaration, bool use_threads=true, MemoryPool *memory_pool=default_memory_pool(), FunctionRegistry *function_registry=NULLPTR)

Asynchronous version of.

Parameters:
  • declaration – A declaration describing the plan to run

  • use_threads – The behavior of use_threads is slightly different than the synchronous version since we cannot run synchronously on the calling thread. Instead, if use_threads=false then a new thread pool will be created with a single thread and this will be used for all compute work.

  • memory_pool – The memory pool to use for allocations made while running the plan.

  • function_registry – The function registry to use for function execution. If null then the default function registry will be used.

ARROW_ACERO_EXPORT Future< std::shared_ptr< Table > > DeclarationToTableAsync (Declaration declaration, ExecContext custom_exec_context)

Overload of.

The executor must be specified (cannot be null) and must be kept alive until the returned future finishes.

See also

DeclarationToTableAsync accepting a custom exec context

ARROW_ACERO_EXPORT Result< BatchesWithCommonSchema > DeclarationToExecBatches (Declaration declaration, bool use_threads=true, MemoryPool *memory_pool=default_memory_pool(), FunctionRegistry *function_registry=NULLPTR)

Utility method to run a declaration and collect the results into ExecBatch vector.

See also

DeclarationToTable for details on threading & execution

ARROW_ACERO_EXPORT Result< BatchesWithCommonSchema > DeclarationToExecBatches (Declaration declaration, QueryOptions query_options)
ARROW_ACERO_EXPORT Future< BatchesWithCommonSchema > DeclarationToExecBatchesAsync (Declaration declaration, bool use_threads=true, MemoryPool *memory_pool=default_memory_pool(), FunctionRegistry *function_registry=NULLPTR)

Asynchronous version of.

See also

DeclarationToTableAsync for details on threading & execution

ARROW_ACERO_EXPORT Future< BatchesWithCommonSchema > DeclarationToExecBatchesAsync (Declaration declaration, ExecContext custom_exec_context)

Overload of.

See also

DeclarationToExecBatchesAsync accepting a custom exec context

See also

DeclarationToTableAsync for details on threading & execution

ARROW_ACERO_EXPORT Result< std::vector< std::shared_ptr< RecordBatch > > > DeclarationToBatches (Declaration declaration, bool use_threads=true, MemoryPool *memory_pool=default_memory_pool(), FunctionRegistry *function_registry=NULLPTR)

Utility method to run a declaration and collect the results into a vector.

See also

DeclarationToTable for details on threading & execution

ARROW_ACERO_EXPORT Result< std::vector< std::shared_ptr< RecordBatch > > > DeclarationToBatches (Declaration declaration, QueryOptions query_options)
ARROW_ACERO_EXPORT Future< std::vector< std::shared_ptr< RecordBatch > > > DeclarationToBatchesAsync (Declaration declaration, bool use_threads=true, MemoryPool *memory_pool=default_memory_pool(), FunctionRegistry *function_registry=NULLPTR)

Asynchronous version of.

See also

DeclarationToTableAsync for details on threading & execution

ARROW_ACERO_EXPORT Future< std::vector< std::shared_ptr< RecordBatch > > > DeclarationToBatchesAsync (Declaration declaration, ExecContext exec_context)

Overload of.

See also

DeclarationToBatchesAsync accepting a custom exec context

See also

DeclarationToTableAsync for details on threading & execution

ARROW_ACERO_EXPORT Result< std::unique_ptr< RecordBatchReader > > DeclarationToReader (Declaration declaration, bool use_threads=true, MemoryPool *memory_pool=default_memory_pool(), FunctionRegistry *function_registry=NULLPTR)

Utility method to run a declaration and return results as a RecordBatchReader.

If an exec context is not provided then a default exec context will be used based on the value of use_threads. If use_threads is false then the CPU exeuctor will be a serial executor and all CPU work will be done on the calling thread. I/O tasks will still happen on the I/O executor and may be multi-threaded.

If use_threads is false then all CPU work will happen during the calls to RecordBatchReader::Next and no CPU work will happen in the background. If use_threads is true then CPU work will happen on the CPU thread pool and tasks may run in between calls to RecordBatchReader::Next. If the returned reader is not consumed quickly enough then the plan will eventually pause as the backpressure queue fills up.

If a custom exec context is provided then the value of use_threads will be ignored.

ARROW_ACERO_EXPORT Result< std::unique_ptr< RecordBatchReader > > DeclarationToReader (Declaration declaration, QueryOptions query_options)
ARROW_ACERO_EXPORT Status DeclarationToStatus (Declaration declaration, bool use_threads=true, MemoryPool *memory_pool=default_memory_pool(), FunctionRegistry *function_registry=NULLPTR)

Utility method to run a declaration and ignore results.

This can be useful when the data are consumed as part of the plan itself, for example, when the plan ends with a write node.

See also

DeclarationToTable for details on threading & execution

ARROW_ACERO_EXPORT Status DeclarationToStatus (Declaration declaration, QueryOptions query_options)
ARROW_ACERO_EXPORT Future DeclarationToStatusAsync (Declaration declaration, bool use_threads=true, MemoryPool *memory_pool=default_memory_pool(), FunctionRegistry *function_registry=NULLPTR)

Asynchronous version of.

This can be useful when the data are consumed as part of the plan itself, for example, when the plan ends with a write node.

See also

DeclarationToTableAsync for details on threading & execution

ARROW_ACERO_EXPORT Future DeclarationToStatusAsync (Declaration declaration, ExecContext exec_context)

Overload of.

See also

DeclarationToStatusAsync accepting a custom exec context

See also

DeclarationToTableAsync for details on threading & execution

ARROW_ACERO_EXPORT std::shared_ptr< RecordBatchReader > MakeGeneratorReader (std::shared_ptr< Schema >, std::function< Future< std::optional< ExecBatch >>()>, MemoryPool *)

Wrap an ExecBatch generator in a RecordBatchReader.

The RecordBatchReader does not impose any ordering on emitted batches.

ARROW_ACERO_EXPORT Result< std::function< Future< std::optional< ExecBatch > >)> > MakeReaderGenerator (std::shared_ptr< RecordBatchReader > reader, arrow::internal::Executor *io_executor, int max_q=kDefaultBackgroundMaxQ, int q_restart=kDefaultBackgroundQRestart)

Make a generator of RecordBatchReaders.

Useful as a source node for an Exec plan

inline bool operator==(const ExecBatch &l, const ExecBatch &r)
inline bool operator!=(const ExecBatch &l, const ExecBatch &r)
void PrintTo(const ExecBatch&, std::ostream*)
class ExecPlan : public std::enable_shared_from_this<ExecPlan>
#include <arrow/acero/exec_plan.h>

Public Types

using NodeVector = std::vector<ExecNode*>

Public Functions

virtual ~ExecPlan() = default
QueryContext *query_context()
const NodeVector &nodes() const

retrieve the nodes in the plan

ExecNode *AddNode(std::unique_ptr<ExecNode> node)
template<typename Node, typename ...Args>
inline Node *EmplaceNode(Args&&... args)
Status Validate()
void StartProducing()

Start producing on all nodes.

Nodes are started in reverse topological order, such that any node is started before all of its inputs.

void StopProducing()

Stop producing on all nodes.

Triggers all sources to stop producing new data. In order to cleanly stop the plan will continue to run any tasks that are already in progress. The caller should still wait for finished to complete before destroying the plan.

Future finished()

A future which will be marked finished when all tasks have finished.

bool HasMetadata() const

Return whether the plan has non-empty metadata.

std::shared_ptr<const KeyValueMetadata> metadata() const

Return the plan’s attached metadata.

std::string ToString() const

Public Static Functions

static Result<std::shared_ptr<ExecPlan>> Make(QueryOptions options, ExecContext exec_context = *threaded_exec_context(), std::shared_ptr<const KeyValueMetadata> metadata = NULLPTR)

Make an empty exec plan.

static Result<std::shared_ptr<ExecPlan>> Make(ExecContext exec_context = *threaded_exec_context(), std::shared_ptr<const KeyValueMetadata> metadata = NULLPTR)
static Result<std::shared_ptr<ExecPlan>> Make(QueryOptions options, ExecContext *exec_context, std::shared_ptr<const KeyValueMetadata> metadata = NULLPTR)
static Result<std::shared_ptr<ExecPlan>> Make(ExecContext *exec_context, std::shared_ptr<const KeyValueMetadata> metadata = NULLPTR)

Public Static Attributes

static const uint32_t kMaxBatchSize = 1 << 15
class ExecNode
#include <arrow/acero/exec_plan.h>

Subclassed by arrow::acero::MapNode

Public Types

using NodeVector = std::vector<ExecNode*>

Public Functions

virtual ~ExecNode() = default
virtual const char *kind_name() const = 0
inline int num_inputs() const
inline const NodeVector &inputs() const

This node’s predecessors in the exec plan.

inline bool is_sink() const

True if the plan has no output schema (is a sink)

inline const std::vector<std::string> &input_labels() const

Labels identifying the function of each input.

inline const ExecNode *output() const

This node’s successor in the exec plan.

inline const std::shared_ptr<Schema> &output_schema() const

The datatypes for batches produced by this node.

inline ExecPlan *plan()

This node’s exec plan.

inline const std::string &label() const

An optional label, for display and debugging.

There is no guarantee that this value is non-empty or unique.

inline void SetLabel(std::string label)
virtual Status Validate() const
virtual const Ordering &ordering() const

the ordering of the output batches

This does not guarantee the batches will be emitted by this node in order. Instead it guarantees that the batches will have their ExecBatch::index property set in a way that respects this ordering.

In other words, given the ordering {{“x”, SortOrder::Ascending}} we know that all values of x in a batch with index N will be less than or equal to all values of x in a batch with index N+k (assuming k > 0). Furthermore, we also know that values will be sorted within a batch. Any row N will have a value of x that is less than the value for any row N+k.

Note that an ordering can be both Ordering::Unordered and Ordering::Implicit. A node’s output should be marked Ordering::Unordered if the order is non-deterministic. For example, a hash-join has no predictable output order.

If the ordering is Ordering::Implicit then there is a meaningful order but that odering is not represented by any column in the data. The most common case for this is when reading data from an in-memory table. The data has an implicit “row order” which is not neccesarily represented in the data set.

A filter or project node will not modify the ordering. Nothing needs to be done other than ensure the index assigned to output batches is the same as the input batch that was mapped.

Other nodes may introduce order. For example, an order-by node will emit a brand new ordering independent of the input ordering.

Finally, as described above, such as a hash-join or aggregation may may destroy ordering (although these nodes could also choose to establish a new ordering based on the hash keys).

Some nodes will require an ordering. For example, a fetch node or an asof join node will only function if the input data is ordered (for fetch it is enough to be implicitly ordered. For an asof join the ordering must be explicit and compatible with the on key.)

Nodes that maintain ordering should be careful to avoid introducing gaps in the batch index. This may require emitting empty batches in order to maintain continuity.

virtual Status InputReceived(ExecNode *input, ExecBatch batch) = 0

Upstream API: These functions are called by input nodes that want to inform this node about an updated condition (a new input batch or an impending end of stream).

Implementation rules:

A node will typically perform some kind of operation on the batch and then call InputReceived on its outputs with the result.

Other nodes may need to accumulate some number of inputs before any output can be produced. These nodes will add the batch to some kind of in-memory accumulation queue and return.

virtual Status InputFinished(ExecNode *input, int total_batches) = 0

Mark the inputs finished after the given number of batches.

This may be called before all inputs are received. This simply fixes the total number of incoming batches for an input, so that the ExecNode knows when it has received all input, regardless of order.

virtual Status Init()

Perform any needed initialization.

This hook performs any actions in between creation of ExecPlan and the call to StartProducing. An example could be Bloom filter pushdown. The order of ExecNodes that executes this method is undefined, but the calls are made synchronously.

At this point a node can rely on all inputs & outputs (and the input schemas) being well defined.

virtual Status StartProducing() = 0

Lifecycle API:

  • start / stop to initiate and terminate production

  • pause / resume to apply backpressure

Implementation rules:

StopProducing may be called due to an error, by the user (e.g. cancel), or because a node has all the data it needs (e.g. limit, top-k on sorted data). This means the method may be called multiple times and we have the following additional rules

  • StopProducing() must be idempotent

  • StopProducing() must be forwarded to inputs (this is needed for the limit/top-k case because we may not be stopping the entire plan)

Start producing

This must only be called once.

This is typically called automatically by ExecPlan::StartProducing().

virtual void PauseProducing(ExecNode *output, int32_t counter) = 0

Pause producing temporarily.

This call is a hint that an output node is currently not willing to receive data.

This may be called any number of times. However, the node is still free to produce data (which may be difficult to prevent anyway if data is produced using multiple threads).

Parameters:
  • output – Pointer to the output that is full

  • counter – Counter used to sequence calls to pause/resume

virtual void ResumeProducing(ExecNode *output, int32_t counter) = 0

Resume producing after a temporary pause.

This call is a hint that an output node is willing to receive data again.

This may be called any number of times.

Parameters:
  • output – Pointer to the output that is now free

  • counter – Counter used to sequence calls to pause/resume

Status StopProducing()

Stop producing new data.

If this node is a source then the source should stop generating data as quickly as possible. If this node is not a source then there is typically nothing that needs to be done although a node may choose to start ignoring incoming data.

This method will be called when an error occurs in the plan This method may also be called by the user if they wish to end a plan early Finally, this method may be called if a node determines it no longer needs any more input (for example, a limit node).

This method may be called multiple times.

This is not a pause. There will be no way to start the source again after this has been called.

std::string ToString(int indent = 0) const
class ExecFactoryRegistry
#include <arrow/acero/exec_plan.h>

An extensible registry for factories of ExecNodes.

Public Types

using Factory = std::function<Result<ExecNode*>(ExecPlan*, std::vector<ExecNode*>, const ExecNodeOptions&)>

Public Functions

virtual ~ExecFactoryRegistry() = default
virtual Result<Factory> GetFactory(const std::string &factory_name) = 0

Get the named factory from this registry.

will raise if factory_name is not found

virtual Status AddFactory(std::string factory_name, Factory factory) = 0

Add a factory to this registry with the provided name.

will raise if factory_name is already in the registry

struct Declaration
#include <arrow/acero/exec_plan.h>

Helper class for declaring sets of ExecNodes efficiently.

A Declaration represents an unconstructed ExecNode (and potentially more since its inputs may also be Declarations). The node can be constructed and added to a plan with Declaration::AddToPlan, which will recursively construct any inputs as necessary.

Public Types

using Input = std::variant<ExecNode*, Declaration>

Public Functions

inline Declaration()
inline Declaration(std::string factory_name, std::vector<Input> inputs, std::shared_ptr<ExecNodeOptions> options, std::string label)
template<typename Options>
inline Declaration(std::string factory_name, std::vector<Input> inputs, Options options, std::string label)
template<typename Options>
inline Declaration(std::string factory_name, std::vector<Input> inputs, Options options)
template<typename Options>
inline Declaration(std::string factory_name, Options options)
template<typename Options>
inline Declaration(std::string factory_name, Options options, std::string label)
Result<ExecNode*> AddToPlan(ExecPlan *plan, ExecFactoryRegistry *registry = default_exec_factory_registry()) const
bool IsValid(ExecFactoryRegistry *registry = default_exec_factory_registry()) const

Public Members

std::string factory_name
std::vector<Input> inputs
std::shared_ptr<ExecNodeOptions> options
std::string label

Public Static Functions

static Declaration Sequence(std::vector<Declaration> decls)

Convenience factory for the common case of a simple sequence of nodes.

Each of decls will be appended to the inputs of the subsequent declaration, and the final modified declaration will be returned.

Without this convenience factory, constructing a sequence would require explicit, difficult-to-read nesting:

Declaration{"n3",
              {
                  Declaration{"n2",
                              {
                                  Declaration{"n1",
                                              {
                                                  Declaration{"n0", N0Opts{}},
                                              },
                                              N1Opts{}},
                              },
                              N2Opts{}},
              },
              N3Opts{}};

An equivalent Declaration can be constructed more tersely using Sequence:

Declaration::Sequence({
    {"n0", N0Opts{}},
    {"n1", N1Opts{}},
    {"n2", N2Opts{}},
    {"n3", N3Opts{}},
});

struct QueryOptions
#include <arrow/acero/exec_plan.h>

plan-wide options that can be specified when executing an execution plan

Public Members

bool use_legacy_batching = false

Should the plan use a legacy batching strategy.

This is currently in place only to support the Scanner::ToTable method. This method relies on batch indices from the scanner remaining consistent. This is impractical in the ExecPlan which might slice batches as needed (e.g. for a join)

However, it still works for simple plans and this is the only way we have at the moment for maintaining implicit order.

std::optional<bool> sequence_output = std::nullopt

If the output has a meaningful order then sequence the output of the plan.

The default behavior (std::nullopt) will sequence output batches if there is a meaningful ordering in the final node and will emit batches immediately otherwise.

If explicitly set to true then plan execution will fail if there is no meaningful ordering. This can be useful to valdiate a query that should be emitting ordered results.

If explicitly set to false then batches will be emit immediately even if there is a meaningful ordering. This could cause batches to be emit out of order but may offer a small decrease to latency.

bool use_threads = true

should the plan use multiple background threads for CPU-intensive work

If this is false then all CPU work will be done on the calling thread. I/O tasks will still happen on the I/O executor and may be multi-threaded (but should not use significant CPU resources).

Will be ignored if custom_cpu_executor is set

::arrow::internal::Executor *custom_cpu_executor = NULLPTR

custom executor to use for CPU-intensive work

Must be null or remain valid for the duration of the plan. If this is null then a default thread pool will be chosen whose behavior will be controlled by the use_threads option.

MemoryPool *memory_pool = default_memory_pool()

a memory pool to use for allocations

Must remain valid for the duration of the plan.

FunctionRegistry *function_registry = GetFunctionRegistry()

a function registry to use for the plan

Must remain valid for the duration of the plan.

std::vector<std::string> field_names

the names of the output columns

If this is empty then names will be generated based on the input columns

If set then the number of names must equal the number of output columns

std::optional<UnalignedBufferHandling> unaligned_buffer_handling

Policy for unaligned buffers in source data.

Various compute functions and acero internals will type pun array buffers from uint8_t* to some kind of value type (e.g. we might cast to int32_t* to add two int32 arrays)

If the buffer is poorly aligned (e.g. an int32 array is not aligned on a 4-byte boundary) then this is technically undefined behavior in C++. However, most modern compilers and CPUs are fairly tolerant of this behavior and nothing bad (beyond a small hit to performance) is likely to happen.

Note that this only applies to source buffers. All buffers allocated internally by Acero will be suitably aligned.

If this field is set to kWarn then Acero will check if any buffers are unaligned and, if they are, will emit a warning.

If this field is set to kReallocate then Acero will allocate a new, suitably aligned buffer and copy the contents from the old buffer into this new buffer.

If this field is set to kError then Acero will gracefully abort the plan instead.

If this field is set to kIgnore then Acero will not even check if the buffers are unaligned.

If this field is not set then it will be treated as kWarn unless overridden by the ACERO_ALIGNMENT_HANDLING environment variable

struct BatchesWithCommonSchema
#include <arrow/acero/exec_plan.h>

a collection of exec batches with a common schema

Public Members

std::vector<ExecBatch> batches
std::shared_ptr<Schema> schema
struct ExecBatch
#include <arrow/compute/exec.h>

Public Functions

ExecBatch() = default
inline ExecBatch(std::vector<Datum> values, int64_t length)
explicit ExecBatch(const RecordBatch &batch)
Result<std::shared_ptr<RecordBatch>> ToRecordBatch(std::shared_ptr<Schema> schema, MemoryPool *pool = default_memory_pool()) const
int64_t TotalBufferSize() const

The sum of bytes in each buffer referenced by the batch.

Note: Scalars are not counted Note: Some values may referenced only part of a buffer, for example, an array with an offset. The actual data visible to this batch will be smaller than the total buffer size in this case.

template<typename index_type>
inline const Datum &operator[](index_type i) const

Return the value at the i-th index.

bool Equals(const ExecBatch &other) const
inline int num_values() const

A convenience for the number of values / arguments.

ExecBatch Slice(int64_t offset, int64_t length) const
Result<ExecBatch> SelectValues(const std::vector<int> &ids) const
inline std::vector<TypeHolder> GetTypes() const

A convenience for returning the types from the batch.

std::string ToString() const

Public Members

std::vector<Datum> values

The values representing positional arguments to be passed to a kernel’s exec function for processing.

std::shared_ptr<SelectionVector> selection_vector

A deferred filter represented as an array of indices into the values.

For example, the filter [true, true, false, true] would be represented as the selection vector [0, 1, 3]. When the selection vector is set, ExecBatch::length is equal to the length of this array.

Expression guarantee = literal(true)

A predicate Expression guaranteed to evaluate to true for all rows in this batch.

int64_t length = 0

The semantic length of the ExecBatch.

When the values are all scalars, the length should be set to 1 for non-aggregate kernels, otherwise the length is taken from the array values, except when there is a selection vector. When there is a selection vector set, the length of the batch is the length of the selection. Aggregate kernels can have an ExecBatch formed by projecting just the partition columns from a batch in which case, it would have scalar rows with length greater than 1.

If the array values are of length 0 then the length is 0 regardless of whether any values are Scalar.

int64_t index = kUnsequencedIndex

index of this batch in a sorted stream of batches

This index must be strictly monotonic starting at 0 without gaps or it can be set to kUnsequencedIndex if there is no meaningful order

Public Static Functions

static Result<int64_t> InferLength(const std::vector<Datum> &values)

Infer the ExecBatch length from values.

static Result<ExecBatch> Make(std::vector<Datum> values, int64_t length = -1)

Creates an ExecBatch with length-validation.

If any value is given, then all values must have a common length. If the given length is negative, then the length of the ExecBatch is set to this common length, or to 1 if no values are given. Otherwise, the given length must equal the common length, if any value is given.

struct ExecValue
#include <arrow/compute/exec.h>

Public Functions

inline ExecValue(Scalar *scalar)
inline ExecValue(ArraySpan array)
inline ExecValue(const ArrayData &array)
ExecValue() = default
ExecValue(const ExecValue &other) = default
ExecValue &operator=(const ExecValue &other) = default
ExecValue(ExecValue &&other) = default
ExecValue &operator=(ExecValue &&other) = default
inline int64_t length() const
inline bool is_array() const
inline bool is_scalar() const
inline void SetArray(const ArrayData &array)
inline void SetScalar(const Scalar *scalar)
template<typename ExactType>
inline const ExactType &scalar_as() const
inline int64_t null_count() const

XXX: here temporarily for compatibility with datum, see e.g.

MakeStructExec in scalar_nested.cc

inline const DataType *type() const

Public Members

ArraySpan array = {}
const Scalar *scalar = NULLPTR
struct ExecResult
#include <arrow/compute/exec.h>

Public Functions

inline int64_t length() const
inline const DataType *type() const
inline const ArraySpan *array_span() const
inline ArraySpan *array_span_mutable()
inline bool is_array_span() const
inline const std::shared_ptr<ArrayData> &array_data() const
inline bool is_array_data() const

Public Members

std::variant<ArraySpan, std::shared_ptr<ArrayData>> value
struct ExecSpan
#include <arrow/compute/exec.h>

A “lightweight” column batch object which contains no std::shared_ptr objects and does not have any memory ownership semantics.

Can represent a view onto an “owning” ExecBatch.

Public Functions

ExecSpan() = default
ExecSpan(const ExecSpan &other) = default
ExecSpan &operator=(const ExecSpan &other) = default
ExecSpan(ExecSpan &&other) = default
ExecSpan &operator=(ExecSpan &&other) = default
inline explicit ExecSpan(std::vector<ExecValue> values, int64_t length)
inline explicit ExecSpan(const ExecBatch &batch)
template<typename index_type>
inline const ExecValue &operator[](index_type i) const

Return the value at the i-th index.

inline int num_values() const

A convenience for the number of values / arguments.

inline std::vector<TypeHolder> GetTypes() const
inline ExecBatch ToExecBatch() const

Public Members

int64_t length = 0
std::vector<ExecValue> values

Execution Plan Expressions

inline bool operator==(const Expression &l, const Expression &r)
inline bool operator!=(const Expression &l, const Expression &r)
void PrintTo(const Expression&, std::ostream*)
Expression literal(Datum lit)
template<typename Arg>
Expression literal(Arg &&arg)
Expression field_ref(FieldRef ref)
Expression call(std::string function, std::vector<Expression> arguments, std::shared_ptr<FunctionOptions> options = NULLPTR)
template<typename Options, typename = typename std::enable_if<std::is_base_of<FunctionOptions, Options>::value>::type>
Expression call(std::string function, std::vector<Expression> arguments, Options options)
std::vector<FieldRef> FieldsInExpression(const Expression&)

Assemble a list of all fields referenced by an Expression at any depth.

bool ExpressionHasFieldRefs(const Expression&)

Check if the expression references any fields.

Result<KnownFieldValues> ExtractKnownFieldValues(const Expression &guaranteed_true_predicate)

Assemble a mapping from field references to known values.

This derives known values from “equal” and “is_null” Expressions referencing a field and a literal.

class Expression
#include <arrow/compute/expression.h>

An unbound expression which maps a single Datum to another Datum.

An expression is one of

  • A literal Datum.

  • A reference to a single (potentially nested) field of the input Datum.

  • A call to a compute function, with arguments specified by other Expressions.

Public Functions

Result<Expression> Bind(const TypeHolder &in, ExecContext* = NULLPTR) const

Bind this expression to the given input type, looking up Kernels and field types.

Some expression simplification may be performed and implicit casts will be inserted. Any state necessary for execution will be initialized and returned.

bool IsBound() const

Return true if all an expression’s field references have explicit types and all of its functions’ kernels are looked up.

bool IsScalarExpression() const

Return true if this expression is composed only of Scalar literals, field references, and calls to ScalarFunctions.

bool IsNullLiteral() const

Return true if this expression is literal and entirely null.

bool IsSatisfiable() const

Return true if this expression could evaluate to true.

Will return true for any unbound, non-boolean, or unsimplified Expressions

const Call *call() const

Access a Call or return nullptr if this expression is not a call.

const Datum *literal() const

Access a Datum or return nullptr if this expression is not a literal.

const FieldRef *field_ref() const

Access a FieldRef or return nullptr if this expression is not a field_ref.

const DataType *type() const

The type to which this expression will evaluate.

struct Call
#include <arrow/compute/expression.h>
struct Hash
#include <arrow/compute/expression.h>
struct Parameter
#include <arrow/compute/expression.h>
Expression project(std::vector<Expression> values, std::vector<std::string> names)
Expression equal(Expression lhs, Expression rhs)
Expression not_equal(Expression lhs, Expression rhs)
Expression less(Expression lhs, Expression rhs)
Expression less_equal(Expression lhs, Expression rhs)
Expression greater(Expression lhs, Expression rhs)
Expression greater_equal(Expression lhs, Expression rhs)
Expression is_null(Expression lhs, bool nan_is_null = false)
Expression is_valid(Expression lhs)
Expression and_(Expression lhs, Expression rhs)
Expression and_(const std::vector<Expression>&)
Expression or_(Expression lhs, Expression rhs)
Expression or_(const std::vector<Expression>&)
Expression not_(Expression operand)
Result<Expression> Canonicalize(Expression, ExecContext* = NULLPTR)

Weak canonicalization which establishes guarantees for subsequent passes.

Even equivalent Expressions may result in different canonicalized expressions. TODO this could be a strong canonicalization

Result<Expression> FoldConstants(Expression)

Simplify Expressions based on literal arguments (for example, add(null, x) will always be null so replace the call with a null literal).

Includes early evaluation of all calls whose arguments are entirely literal.

Result<Expression> ReplaceFieldsWithKnownValues(const KnownFieldValues &known_values, Expression)

Simplify Expressions by replacing with known values of the fields which it references.

Result<Expression> SimplifyWithGuarantee(Expression, const Expression &guaranteed_true_predicate)

Simplify an expression by replacing subexpressions based on a guarantee: a boolean expression which is guaranteed to evaluate to true.

For example, this is used to remove redundant function calls from a filter expression or to replace a reference to a constant-value field with a literal.

Result<Expression> RemoveNamedRefs(Expression expression)

Replace all named field refs (e.g.

“x” or “x.y”) with field paths (e.g. [0] or [1,3])

This isn’t usually needed and does not offer any simplification by itself. However, it can be useful to normalize an expression to paths to make it simpler to work with.