Using Acero with Substrait#
In order to use Acero you will need to create an execution plan. This is the model that describes the computation you want to apply to your data. Acero has its own internal representation for execution plans but most users should not interact with this directly as it will couple their code to Acero.
Substrait is an open standard for execution plans. Acero implements the Substrait “consumer” interface. This means that Acero can accept a Substrait plan and fulfill the plan, loading the requested data and applying the desired computation. By using Substrait plans users can easily switch out to a different execution engine at a later time.
Substrait Conformance#
Substrait defines a broad set of operators and functions for many different situations and it is unlikely that Acero will ever completely satisfy all defined Substrait operators and functions. To help understand what features are available the following sections define which features have been currently implemented in Acero and any caveats that apply.
Plans#
A plan should have a single top-level relation.
The consumer is currently based on version 0.20.0 of Substrait. Any features added that are newer will not be supported.
Due to a breaking change in 0.20.0 any Substrait plan older than 0.20.0 will be rejected.
Extensions#
If a plan contains any extension type variations it will be rejected.
Advanced extensions can be provided by supplying a custom implementation of
arrow::engine::ExtensionProvider
.
Relations (in general)#
Any relation not explicitly listed below will not be supported and will cause the plan to be rejected.
Read Relations#
The
projection
property is not supported and plans containing this property will be rejected.The
VirtualTable
andExtensionTable
read types are not supported. Plans containing these types will be rejected.Only the parquet and arrow file formats are currently supported.
All URIs must use the
file
schemepartition_index
,start
, andlength
are not supported. Plans containing non-default values for these properties will be rejected.The Substrait spec requires that a
filter
be completely satisfied by a read relation. However, Acero only uses a read filter for pushdown projection and it may not be fully satisfied. Users should generally attach an additional filter relation with the same filter expression after the read relation.
Filter Relations#
No known caveats
Project Relations#
No known caveats
Join Relations#
The join type
JOIN_TYPE_SINGLE
is not supported and plans containing this will be rejected.The join expression must be a call to either the
equal
oris_not_distinct_from
functions. Both arguments to the call must be direct references. Only a single join key is supported.The
post_join_filter
property is not supported and will be ignored.
Aggregate Relations#
At most one grouping set is supported.
Each grouping expression must be a direct reference.
Each measure’s arguments must be direct references.
A measure may not have a filter
A measure may not have sorts
A measure’s invocation must be AGGREGATION_INVOCATION_ALL or AGGREGATION_INVOCATION_UNSPECIFIED
A measure’s phase must be AGGREGATION_PHASE_INITIAL_TO_RESULT
Expressions (general)#
Various places in the Substrait spec allow for expressions to be used outside of a filter or project relation. For example, a join expression or an aggregate grouping set. Acero typically expects these expressions to be direct references. Planners should extract the implicit projection into a formal project relation before delivering the plan to Acero.
Literals#
A literal with non-default nullability will cause a plan to be rejected.
Types#
Acero does not have full support for non-nullable types and may allow input to have nulls without rejecting it.
The table below shows the mapping between Arrow types and Substrait type classes that are currently supported
Substrait Type |
Arrow Type |
Caveat |
---|---|---|
boolean |
boolean |
|
i8 |
int8 |
|
i16 |
int16 |
|
i32 |
int32 |
|
i64 |
int64 |
|
fp32 |
float32 |
|
fp64 |
float64 |
|
string |
string |
|
binary |
binary |
|
timestamp |
timestamp<MICRO,””> |
|
timestamp_tz |
timestamp<MICRO,”UTC”> |
|
date |
date32<DAY> |
|
time |
time64<MICRO> |
|
interval_year |
Not currently supported |
|
interval_day |
Not currently supported |
|
uuid |
Not currently supported |
|
FIXEDCHAR<L> |
Not currently supported |
|
VARCHAR<L> |
Not currently supported |
|
FIXEDBINARY<L> |
fixed_size_binary<L> |
|
DECIMAL<P,S> |
decimal128<P,S> |
|
STRUCT<T1…TN> |
struct<T1…TN> |
Arrow struct fields will have no name (empty string) |
NSTRUCT<N:T1…N:Tn> |
Not currently supported |
|
LIST<T> |
list<T> |
|
MAP<K,V> |
map<K,V> |
K must not be nullable |
Functions#
The following functions have caveats or are not supported at all. Note that this is not a comprehensive list. Functions are being added to Substrait at a rapid pace and new functions may be missing.
Acero does not support the SATURATE option for overflow
Acero does not support kernels that take more than two arguments for the functions
and
,or
,xor
Substrait has not yet clearly identified the form that URIs should take for standard functions. Acero will look for the URIs to the
main
GitHub branch. In other words, for the filefunctions_arithmetic.yaml
Acero expectshttps://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml
Acero has functions that are not yet a part of Substrait (or may never be added as official functions). To invoke these functions you can use the special URI
urn:arrow:substrait_simple_extension_function
. If this URI is encountered then Acero will match only on function name and will ignore any function options.Alternatively, the URI can be left completely empty and Acero will match based only on function name. This fallback mechanism is non-standard and should be considered deprecated in favor of the special URI above.