Acero User’s Guide#

This page describes how to use Acero. It’s recommended that you read the overview first and familiarize yourself with the basic concepts.

Using Acero#

The basic workflow for Acero is this:

First, create a graph of Declaration objects describing the plan
Call one of the DeclarationToXyz methods to execute the Declaration.
1. A new ExecPlan is created from the graph of Declarations. Each Declaration will correspond to one ExecNode in the plan. In addition, a sink node will be added, depending on which DeclarationToXyz method was used.
2. The ExecPlan is executed. Typically this happens as part of the DeclarationToXyz call but in DeclarationToReader the reader is returned before the plan is finished executing.
3. Once the plan is finished it is destroyed

Creating a Plan#

Using Substrait#

Substrait is the preferred mechanism for creating a plan (graph of Declaration). There are a few reasons for this:

Substrait producers spend a lot of time and energy in creating user-friendly APIs for producing complex execution plans in a simple way. For example, the pivot_wider operation can be achieved using a complex series of aggregate nodes. Rather than create all of those aggregate nodes by hand a producer will give you a much simpler API.
If you are using Substrait then you can easily switch out to any other Substrait-consuming engine should you at some point find that it serves your needs better than Acero.
We hope that tools will eventually emerge for Substrait-based optimizers and planners. By using Substrait you will be making it much easier to use these tools in the future.

You could create the Substrait plan yourself but you’ll probably have a much easier time finding an existing Substrait producer. For example, you could use ibis-substrait to easily create Substrait plans from python expressions. There are a few different tools that are able to create Substrait plans from SQL. Eventually, we hope that C++ based Substrait producers will emerge. However, we are not aware of any at this time.

Detailed instructions on creating an execution plan from Substrait can be found in the Substrait page

Programmatic Plan Creation#

Creating an execution plan programmatically is simpler than creating a plan from Substrait, though loses some of the flexibility and future-proofing guarantees. The simplest way to create a Declaration is to simply instantiate one. You will need the name of the declaration, a vector of inputs, and an options object. For example:

/// \brief An example showing a project node
///
/// Scan-Project-Table
/// This example shows how a Scan operation can be used to load the data
/// into the execution plan, how a project operation can be applied on the
/// data stream and how the output is collected into a table
arrow::Status ScanProjectSinkExample() {
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::dataset::Dataset> dataset, GetDataset());

  auto options = std::make_shared<arrow::dataset::ScanOptions>();
  // projection
  cp::Expression a_times_2 = cp::call("multiply", {cp::field_ref("a"), cp::literal(2)});
  options->projection = cp::project({}, {});

  auto scan_node_options = arrow::dataset::ScanNodeOptions{dataset, options};

  ac::Declaration scan{"scan", std::move(scan_node_options)};
  ac::Declaration project{
      "project", {std::move(scan)}, ac::ProjectNodeOptions({a_times_2})};

  return ExecutePlanAndCollectAsTable(std::move(project));
}

The above code creates a scan declaration (which has no inputs) and a project declaration (using the scan as input). This is simple enough but we can make it slightly easier. If you are creating a linear sequence of declarations (like in the above example) then you can also use the Declaration::Sequence() function.

  // Inputs do not have to be passed to the project node when using Sequence
  ac::Declaration plan =
      ac::Declaration::Sequence({{"scan", std::move(scan_node_options)},
                                 {"project", ac::ProjectNodeOptions({a_times_2})}});

There are many more examples of programmatic plan creation later in this document.

Executing a Plan#

There are a number of different methods that can be used to execute a declaration. Each one provides the data in a slightly different form. Since all of these methods start with DeclarationTo... this guide will often refer to these methods as the DeclarationToXyz methods.

DeclarationToTable#

The DeclarationToTable() method will accumulate all of the results into a single arrow::Table. This is perhaps the simplest way to collect results from Acero. The main disadvantage to this approach is that it requires accumulating all results into memory.

Note

Acero processes large datasets in small chunks. This is described in more detail in the developer’s guide. As a result, you may be surprised to find that a table collected with DeclarationToTable is chunked differently than your input. For example, your input might be a large table with a single chunk with 2 million rows. Your output table might then have 64 chunks with 32Ki rows each. There is a current request to specify the chunk size for the output in GH-15155.

DeclarationToReader#

The DeclarationToReader() method allows you to iteratively consume the results. It will create an arrow::RecordBatchReader which you can read from at your leisure. If you do not read from the reader quickly enough then backpressure will be applied and the execution plan will pause. Closing the reader will cancel the running execution plan and the reader’s destructor will wait for the execution plan to finish whatever it is doing and so it may block.

DeclarationToStatus#

The DeclarationToStatus() method is useful if you want to run the plan but do not actually want to consume the results. For example, this is useful when benchmarking or when the plan has side effects such as a dataset write node. If the plan generates any results then they will be immediately discarded.

Running a Plan Directly#

If one of the DeclarationToXyz methods is not sufficient for some reason then it is possible to run a plan directly. This should only be needed if you are doing something unique. For example, if you have created a custom sink node or if you need a plan that has multiple outputs.

Note

In academic literature and many existing systems there is a general assumption that an execution plan has at most one output. There are some things in Acero, such as the DeclarationToXyz methods, which will expect this. However, there is nothing in the design that strictly prevents having multiple sink nodes.

Detailed instructions on how to do this are out of scope for this guide but the rough steps are:

Create a new ExecPlan object.
Add sink nodes to your graph of Declaration objects (this is the only type you will need to create declarations for sink nodes)
Use Declaration::AddToPlan() to add your declaration to your plan (if you have more than one output then you will not be able to use this method and will need to add your nodes one at a time)
Validate the plan with ExecPlan::Validate()
Start the plan with ExecPlan::StartProducing()
Wait for the future returned by ExecPlan::finished() to complete.

Providing Input#

Input data for an exec plan can come from a variety of sources. It is often read from files stored on some kind of filesystem. It is also common for input to come from in-memory data. In-memory data is typical, for example, in a pandas-like frontend. Input could also come from network streams like a Flight request. Acero can support all of these cases and can even support unique and custom situations not mentioned here.

There are pre-defined source nodes that cover the most common input scenarios. These are listed below. However, if your source data is unique then you will need to use the generic source node. This node expects you to provide an asycnhronous stream of batches and is covered in more detail here.

Available `ExecNode` Implementations#

The following tables quickly summarize the available operators.

Sources#

These nodes can be used as sources of data

Source Nodes#
Factory Name	Options	Brief Description
`source`	`SourceNodeOptions`	A generic source node that wraps an asynchronous stream of data (example)
`table_source`	`TableSourceNodeOptions`	Generates data from an `arrow::Table` (example)
`record_batch_source`	`RecordBatchSourceNodeOptions`	Generates data from an iterator of `arrow::RecordBatch`
`record_batch_reader_source`	`RecordBatchReaderSourceNodeOptions`	Generates data from an `arrow::RecordBatchReader`
`exec_batch_source`	`ExecBatchSourceNodeOptions`	Generates data from an iterator of `arrow::compute::ExecBatch`
`array_vector_source`	`ArrayVectorSourceNodeOptions`	Generates data from an iterator of vectors of `arrow::Array`
`scan`	`arrow::dataset::ScanNodeOptions`	Generates data from an `arrow::dataset::Dataset` (requires the datasets module) (example)

Compute Nodes#

These nodes perform computations on data and may transform or reshape the data

Compute Nodes#
Factory Name	Options	Brief Description
`filter`	`FilterNodeOptions`	Removes rows that do not match a given filter expression (example)
`project`	`ProjectNodeOptions`	Creates new columns by evaluating compute expressions. Can also drop and reorder columns (example)
`aggregate`	`AggregateNodeOptions`	Calculates summary statistics across the entire input stream or on groups of data (example)
`pivot_longer`	`PivotLongerNodeOptions`	Reshapes data by converting some columns into additional rows

Arrangement Nodes#

These nodes reorder, combine, or slice streams of data

Arrangement Nodes#
Factory Name	Options	Brief Description
`hash_join`	`HashJoinNodeOptions`	Joins two inputs based on common columns (example)
`asofjoin`	`AsofJoinNodeOptions`	Joins multiple inputs to the first input based on a common ordered column (often time)
`union`	N/A	Merges two inputs with identical schemas (example)
`order_by`	`OrderByNodeOptions`	Reorders a stream
`fetch`	`FetchNodeOptions`	Slices a range of rows from a stream

Sink Nodes#

These nodes terminate a plan. Users do not typically create sink nodes as they are selected based on the DeclarationToXyz method used to consume the plan. However, this list may be useful for those developing new sink nodes or using Acero in advanced ways.

Sink Nodes#
Factory Name	Options	Brief Description
`sink`	`SinkNodeOptions`	Collects batches into a FIFO queue with optional backpressure
`write`	`arrow::dataset::WriteNodeOptions`	Writes batches to a filesystem (example)
`consuming_sink`	`ConsumingSinkNodeOptions`	Consumes batches using a user provided callback function
`table_sink`	`TableSinkNodeOptions`	Collects batches into an `arrow::Table`
`order_by_sink`	`OrderBySinkNodeOptions`	Deprecated
`select_k_sink`	`SelectKSinkNodeOptions`	Deprecated

Examples#

The rest of this document contains example execution plans. Each example highlights the behavior of a specific execution node.

`source`#

A source operation can be considered as an entry point to create a streaming execution plan. SourceNodeOptions are used to create the source operation. The source operation is the most generic and flexible type of source currently available but it can be quite tricky to configure. First you should review the other source node types to ensure there isn’t a simpler choice.

The source node requires some kind of function that can be called to poll for more data. This function should take no arguments and should return an arrow::Future<std::optional<arrow::ExecBatch>>. This function might be reading a file, iterating through an in memory structure, or receiving data from a network connection. The arrow library refers to these functions as arrow::AsyncGenerator and there are a number of utilities for working with these functions. For this example we use a vector of record batches that we’ve already stored in memory. In addition, the schema of the data must be known up front. Acero must know the schema of the data at each stage of the execution graph before any processing has begun. This means we must supply the schema for a source node separately from the data itself.

Here we define a struct to hold the data generator definition. This includes in-memory batches, schema and a function that serves as a data generator :

struct BatchesWithSchema {
  std::vector<cp::ExecBatch> batches;
  std::shared_ptr<arrow::Schema> schema;
  // This method uses internal arrow utilities to
  // convert a vector of record batches to an AsyncGenerator of optional batches
  arrow::AsyncGenerator<std::optional<cp::ExecBatch>> gen() const {
    auto opt_batches = ::arrow::internal::MapVector(
        [](cp::ExecBatch batch) { return std::make_optional(std::move(batch)); },
        batches);
    arrow::AsyncGenerator<std::optional<cp::ExecBatch>> gen;
    gen = arrow::MakeVectorGenerator(std::move(opt_batches));
    return gen;
  }
};

Generating sample batches for computation:

arrow::Result<BatchesWithSchema> MakeBasicBatches() {
  BatchesWithSchema out;
  auto field_vector = {arrow::field("a", arrow::int32()),
                       arrow::field("b", arrow::boolean())};
  ARROW_ASSIGN_OR_RAISE(auto b1_int, GetArrayDataSample<arrow::Int32Type>({0, 4}));
  ARROW_ASSIGN_OR_RAISE(auto b2_int, GetArrayDataSample<arrow::Int32Type>({5, 6, 7}));
  ARROW_ASSIGN_OR_RAISE(auto b3_int, GetArrayDataSample<arrow::Int32Type>({8, 9, 10}));

  ARROW_ASSIGN_OR_RAISE(auto b1_bool,
                        GetArrayDataSample<arrow::BooleanType>({false, true}));
  ARROW_ASSIGN_OR_RAISE(auto b2_bool,
                        GetArrayDataSample<arrow::BooleanType>({true, false, true}));
  ARROW_ASSIGN_OR_RAISE(auto b3_bool,
                        GetArrayDataSample<arrow::BooleanType>({false, true, false}));

  ARROW_ASSIGN_OR_RAISE(auto b1,
                        GetExecBatchFromVectors(field_vector, {b1_int, b1_bool}));
  ARROW_ASSIGN_OR_RAISE(auto b2,
                        GetExecBatchFromVectors(field_vector, {b2_int, b2_bool}));
  ARROW_ASSIGN_OR_RAISE(auto b3,
                        GetExecBatchFromVectors(field_vector, {b3_int, b3_bool}));

  out.batches = {b1, b2, b3};
  out.schema = arrow::schema(field_vector);
  return out;
}

Example of using source (usage of sink is explained in detail in sink):

/// \brief An example demonstrating a source and sink node
///
/// Source-Table Example
/// This example shows how a custom source can be used
/// in an execution plan. This includes source node using pregenerated
/// data and collecting it into a table.
///
/// This sort of custom source is often not needed.  In most cases you can
/// use a scan (for a dataset source) or a source like table_source, array_vector_source,
/// exec_batch_source, or record_batch_source (for in-memory data)
arrow::Status SourceSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeBasicBatches());

  auto source_node_options = ac::SourceNodeOptions{basic_data.schema, basic_data.gen()};

  ac::Declaration source{"source", std::move(source_node_options)};

  return ExecutePlanAndCollectAsTable(std::move(source));
}

`table_source`#

In the previous example, source node, a source node was used to input the data. But when developing an application, if the data is already in memory as a table, it is much easier, and more performant to use TableSourceNodeOptions. Here the input data can be passed as a std::shared_ptr<arrow::Table> along with a max_batch_size. The max_batch_size is to break up large record batches so that they can be processed in parallel. It is important to note that the table batches will not get merged to form larger batches when the source table has a smaller batch size.

Example of using table_source

/// \brief An example showing a table source node
///
/// TableSource-Table Example
/// This example shows how a table_source can be used
/// in an execution plan. This includes a table source node
/// receiving data from a table.  This plan simply collects the
/// data back into a table but nodes could be added that modify
/// or transform the data as well (as is shown in later examples)
arrow::Status TableSourceSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto table, GetTable());

  arrow::AsyncGenerator<std::optional<cp::ExecBatch>> sink_gen;
  int max_batch_size = 2;
  auto table_source_options = ac::TableSourceNodeOptions{table, max_batch_size};

  ac::Declaration source{"table_source", std::move(table_source_options)};

  return ExecutePlanAndCollectAsTable(std::move(source));
}

`filter`#

filter operation, as the name suggests, provides an option to define data filtering criteria. It selects rows where the given expression evaluates to true. Filters can be written using arrow::compute::Expression, and the expression should have a return type of boolean. For example, if we wish to keep rows where the value of column b is greater than 3, then we can use the following expression.

Filter example:

/// \brief An example showing a filter node
///
/// Source-Filter-Table
/// This example shows how a filter can be used in an execution plan,
/// to filter data from a source. The output from the exeuction plan
/// is collected into a table.
arrow::Status ScanFilterSinkExample() {
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::dataset::Dataset> dataset, GetDataset());

  auto options = std::make_shared<arrow::dataset::ScanOptions>();
  // specify the filter.  This filter removes all rows where the
  // value of the "a" column is greater than 3.
  cp::Expression filter_expr = cp::greater(cp::field_ref("a"), cp::literal(3));
  // set filter for scanner : on-disk / push-down filtering.
  // This step can be skipped if you are not reading from disk.
  options->filter = filter_expr;
  // empty projection
  options->projection = cp::project({}, {});

  // construct the scan node
  std::cout << "Initialized Scanning Options" << std::endl;

  auto scan_node_options = arrow::dataset::ScanNodeOptions{dataset, options};
  std::cout << "Scan node options created" << std::endl;

  ac::Declaration scan{"scan", std::move(scan_node_options)};

  // pipe the scan node into the filter node
  // Need to set the filter in scan node options and filter node options.
  // At scan node it is used for on-disk / push-down filtering.
  // At filter node it is used for in-memory filtering.
  ac::Declaration filter{
      "filter", {std::move(scan)}, ac::FilterNodeOptions(std::move(filter_expr))};

  return ExecutePlanAndCollectAsTable(std::move(filter));
}

`project`#

project operation rearranges, deletes, transforms, and creates columns. Each output column is computed by evaluating an expression against the source record batch. These must be scalar expressions (expressions consisting of scalar literals, field references and scalar functions, i.e. elementwise functions that return one value for each input row independent of the value of all other rows). This is exposed via ProjectNodeOptions which requires, an arrow::compute::Expression and name for each of the output columns (if names are not provided, the string representations of exprs will be used).

Project example:

/// \brief An example showing a project node
///
/// Scan-Project-Table
/// This example shows how a Scan operation can be used to load the data
/// into the execution plan, how a project operation can be applied on the
/// data stream and how the output is collected into a table
arrow::Status ScanProjectSinkExample() {
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::dataset::Dataset> dataset, GetDataset());

  auto options = std::make_shared<arrow::dataset::ScanOptions>();
  // projection
  cp::Expression a_times_2 = cp::call("multiply", {cp::field_ref("a"), cp::literal(2)});
  options->projection = cp::project({}, {});

  auto scan_node_options = arrow::dataset::ScanNodeOptions{dataset, options};

  ac::Declaration scan{"scan", std::move(scan_node_options)};
  ac::Declaration project{
      "project", {std::move(scan)}, ac::ProjectNodeOptions({a_times_2})};

  return ExecutePlanAndCollectAsTable(std::move(project));
}

`aggregate`#

The aggregate node computes various types of aggregates over data.

Arrow supports two types of aggregates: “scalar” aggregates, and “hash” aggregates. Scalar aggregates reduce an array or scalar input to a single scalar output (e.g. computing the mean of a column). Hash aggregates act like GROUP BY in SQL and first partition data based on one or more key columns, then reduce the data in each partition. The aggregate node supports both types of computation, and can compute any number of aggregations at once.

AggregateNodeOptions is used to define the aggregation criteria. It takes a list of aggregation functions and their options; a list of target fields to aggregate, one per function; and a list of names for the output fields, one per function. Optionally, it takes a list of columns that are used to partition the data, in the case of a hash aggregation. The aggregation functions can be selected from this list of aggregation functions.

Note

This node is a “pipeline breaker” and will fully materialize the dataset in memory. In the future, spillover mechanisms will be added which should alleviate this constraint.

The aggregation can provide results as a group or scalar. For instances, an operation like hash_count provides the counts per each unique record as a grouped result while an operation like sum provides a single record.

Scalar Aggregation example:

/// \brief An example showing an aggregation node to aggregate an entire table
///
/// Source-Aggregation-Table
/// This example shows how an aggregation operation can be applied on a
/// execution plan resulting in a scalar output. The source node loads the
/// data and the aggregation (counting unique types in column 'a')
/// is applied on this data. The output is collected into a table (that will
/// have exactly one row)
arrow::Status SourceScalarAggregateSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeBasicBatches());

  auto source_node_options = ac::SourceNodeOptions{basic_data.schema, basic_data.gen()};

  ac::Declaration source{"source", std::move(source_node_options)};
  auto aggregate_options =
      ac::AggregateNodeOptions{/*aggregates=*/{{"sum", nullptr, "a", "sum(a)"}}};
  ac::Declaration aggregate{
      "aggregate", {std::move(source)}, std::move(aggregate_options)};

  return ExecutePlanAndCollectAsTable(std::move(aggregate));
}

Group Aggregation example:

/// \brief An example showing an aggregation node to perform a group-by operation
///
/// Source-Aggregation-Table
/// This example shows how an aggregation operation can be applied on a
/// execution plan resulting in grouped output. The source node loads the
/// data and the aggregation (counting unique types in column 'a') is
/// applied on this data. The output is collected into a table that will contain
/// one row for each unique combination of group keys.
arrow::Status SourceGroupAggregateSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeBasicBatches());

  arrow::AsyncGenerator<std::optional<cp::ExecBatch>> sink_gen;

  auto source_node_options = ac::SourceNodeOptions{basic_data.schema, basic_data.gen()};

  ac::Declaration source{"source", std::move(source_node_options)};
  auto options = std::make_shared<cp::CountOptions>(cp::CountOptions::ONLY_VALID);
  auto aggregate_options =
      ac::AggregateNodeOptions{/*aggregates=*/{{"hash_count", options, "a", "count(a)"}},
                               /*keys=*/{"b"}};
  ac::Declaration aggregate{
      "aggregate", {std::move(source)}, std::move(aggregate_options)};

  return ExecutePlanAndCollectAsTable(std::move(aggregate));
}

`sink`#

sink operation provides output and is the final node of a streaming execution definition. SinkNodeOptions interface is used to pass the required options. Similar to the source operator the sink operator exposes the output with a function that returns a record batch future each time it is called. It is expected the caller will repeatedly call this function until the generator function is exhausted (returns std::optional::nullopt). If this function is not called often enough then record batches will accumulate in memory. An execution plan should only have one “terminal” node (one sink node). An ExecPlan can terminate early due to cancellation or an error, before the output is fully consumed. However, the plan can be safely destroyed independently of the sink, which will hold the unconsumed batches by exec_plan->finished().

As a part of the Source Example, the Sink operation is also included;

/// \brief An example demonstrating a source and sink node
///
/// Source-Table Example
/// This example shows how a custom source can be used
/// in an execution plan. This includes source node using pregenerated
/// data and collecting it into a table.
///
/// This sort of custom source is often not needed.  In most cases you can
/// use a scan (for a dataset source) or a source like table_source, array_vector_source,
/// exec_batch_source, or record_batch_source (for in-memory data)
arrow::Status SourceSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeBasicBatches());

  auto source_node_options = ac::SourceNodeOptions{basic_data.schema, basic_data.gen()};

  ac::Declaration source{"source", std::move(source_node_options)};

  return ExecutePlanAndCollectAsTable(std::move(source));
}

`consuming_sink`#

consuming_sink operator is a sink operation containing consuming operation within the execution plan (i.e. the exec plan should not complete until the consumption has completed). Unlike the sink node this node takes in a callback function that is expected to consume the batch. Once this callback has finished the execution plan will no longer hold any reference to the batch. The consuming function may be called before a previous invocation has completed. If the consuming function does not run quickly enough then many concurrent executions could pile up, blocking the CPU thread pool. The execution plan will not be marked finished until all consuming function callbacks have been completed. Once all batches have been delivered the execution plan will wait for the finish future to complete before marking the execution plan finished. This allows for workflows where the consumption function converts batches into async tasks (this is currently done internally for the dataset write node).

Example:

// define a Custom SinkNodeConsumer
std::atomic<uint32_t> batches_seen{0};
arrow::Future<> finish = arrow::Future<>::Make();
struct CustomSinkNodeConsumer : public cp::SinkNodeConsumer {

    CustomSinkNodeConsumer(std::atomic<uint32_t> *batches_seen, arrow::Future<>finish):
    batches_seen(batches_seen), finish(std::move(finish)) {}
    // Consumption logic can be written here
    arrow::Status Consume(cp::ExecBatch batch) override {
    // data can be consumed in the expected way
    // transfer to another system or just do some work
    // and write to disk
    (*batches_seen)++;
    return arrow::Status::OK();
    }

    arrow::Future<> Finish() override { return finish; }

    std::atomic<uint32_t> *batches_seen;
    arrow::Future<> finish;

};

std::shared_ptr<CustomSinkNodeConsumer> consumer =
        std::make_shared<CustomSinkNodeConsumer>(&batches_seen, finish);

arrow::acero::ExecNode *consuming_sink;

ARROW_ASSIGN_OR_RAISE(consuming_sink, MakeExecNode("consuming_sink", plan.get(),
    {source}, cp::ConsumingSinkNodeOptions(consumer)));

Consuming-Sink example:

/// \brief An example showing a consuming sink node
///
/// Source-Consuming-Sink
/// This example shows how the data can be consumed within the execution plan
/// by using a ConsumingSink node. There is no data output from this execution plan.
arrow::Status SourceConsumingSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeBasicBatches());

  auto source_node_options = ac::SourceNodeOptions{basic_data.schema, basic_data.gen()};

  ac::Declaration source{"source", std::move(source_node_options)};

  std::atomic<uint32_t> batches_seen{0};
  arrow::Future<> finish = arrow::Future<>::Make();
  struct CustomSinkNodeConsumer : public ac::SinkNodeConsumer {
    CustomSinkNodeConsumer(std::atomic<uint32_t>* batches_seen, arrow::Future<> finish)
        : batches_seen(batches_seen), finish(std::move(finish)) {}

    arrow::Status Init(const std::shared_ptr<arrow::Schema>& schema,
                       ac::BackpressureControl* backpressure_control,
                       ac::ExecPlan* plan) override {
      // This will be called as the plan is started (before the first call to Consume)
      // and provides the schema of the data coming into the node, controls for pausing /
      // resuming input, and a pointer to the plan itself which can be used to access
      // other utilities such as the thread indexer or async task scheduler.
      return arrow::Status::OK();
    }

    arrow::Status Consume(cp::ExecBatch batch) override {
      (*batches_seen)++;
      return arrow::Status::OK();
    }

    arrow::Future<> Finish() override {
      // Here you can perform whatever (possibly async) cleanup is needed, e.g. closing
      // output file handles and flushing remaining work
      return arrow::Future<>::MakeFinished();
    }

    std::atomic<uint32_t>* batches_seen;
    arrow::Future<> finish;
  };
  std::shared_ptr<CustomSinkNodeConsumer> consumer =
      std::make_shared<CustomSinkNodeConsumer>(&batches_seen, finish);

  ac::Declaration consuming_sink{"consuming_sink",
                                 {std::move(source)},
                                 ac::ConsumingSinkNodeOptions(std::move(consumer))};

  // Since we are consuming the data within the plan there is no output and we simply
  // run the plan to completion instead of collecting into a table.
  ARROW_RETURN_NOT_OK(ac::DeclarationToStatus(std::move(consuming_sink)));

  std::cout << "The consuming sink node saw " << batches_seen.load() << " batches"
            << std::endl;
  return arrow::Status::OK();
}

`order_by_sink`#

order_by_sink operation is an extension to the sink operation. This operation provides the ability to guarantee the ordering of the stream by providing the OrderBySinkNodeOptions. Here the arrow::compute::SortOptions are provided to define which columns are used for sorting and whether to sort by ascending or descending values.

Note

This node is a “pipeline breaker” and will fully materialize the dataset in memory. In the future, spillover mechanisms will be added which should alleviate this constraint.

Order-By-Sink example:

arrow::Status ExecutePlanAndCollectAsTableWithCustomSink(
    std::shared_ptr<ac::ExecPlan> plan, std::shared_ptr<arrow::Schema> schema,
    arrow::AsyncGenerator<std::optional<cp::ExecBatch>> sink_gen) {
  // translate sink_gen (async) to sink_reader (sync)
  std::shared_ptr<arrow::RecordBatchReader> sink_reader =
      ac::MakeGeneratorReader(schema, std::move(sink_gen), arrow::default_memory_pool());

  // validate the ExecPlan
  ARROW_RETURN_NOT_OK(plan->Validate());
  std::cout << "ExecPlan created : " << plan->ToString() << std::endl;
  // start the ExecPlan
  plan->StartProducing();

  // collect sink_reader into a Table
  std::shared_ptr<arrow::Table> response_table;

  ARROW_ASSIGN_OR_RAISE(response_table,
                        arrow::Table::FromRecordBatchReader(sink_reader.get()));

  std::cout << "Results : " << response_table->ToString() << std::endl;

  // stop producing
  plan->StopProducing();
  // plan mark finished
  auto future = plan->finished();
  return future.status();
}

/// \brief An example showing an order-by node
///
/// Source-OrderBy-Sink
/// In this example, the data enters through the source node
/// and the data is ordered in the sink node. The order can be
/// ASCENDING or DESCENDING and it is configurable. The output
/// is obtained as a table from the sink node.
arrow::Status SourceOrderBySinkExample() {
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<ac::ExecPlan> plan,
                        ac::ExecPlan::Make(*cp::threaded_exec_context()));

  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeSortTestBasicBatches());

  arrow::AsyncGenerator<std::optional<cp::ExecBatch>> sink_gen;

  auto source_node_options = ac::SourceNodeOptions{basic_data.schema, basic_data.gen()};
  ARROW_ASSIGN_OR_RAISE(ac::ExecNode * source,
                        ac::MakeExecNode("source", plan.get(), {}, source_node_options));

  ARROW_RETURN_NOT_OK(ac::MakeExecNode(
      "order_by_sink", plan.get(), {source},
      ac::OrderBySinkNodeOptions{
          cp::SortOptions{{cp::SortKey{"a", cp::SortOrder::Descending}}}, &sink_gen}));

  return ExecutePlanAndCollectAsTableWithCustomSink(plan, basic_data.schema, sink_gen);
}

`select_k_sink`#

select_k_sink option enables selecting the top/bottom K elements, similar to a SQL ORDER BY ... LIMIT K clause. SelectKOptions which is a defined by using OrderBySinkNode definition. This option returns a sink node that receives inputs and then compute top_k/bottom_k.

Note

This node is a “pipeline breaker” and will fully materialize the input in memory. In the future, spillover mechanisms will be added which should alleviate this constraint.

SelectK example:

/// \brief An example showing a select-k node
///
/// Source-KSelect
/// This example shows how K number of elements can be selected
/// either from the top or bottom. The output node is a modified
/// sink node where output can be obtained as a table.
arrow::Status SourceKSelectExample() {
  ARROW_ASSIGN_OR_RAISE(auto input, MakeGroupableBatches());
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<ac::ExecPlan> plan,
                        ac::ExecPlan::Make(*cp::threaded_exec_context()));
  arrow::AsyncGenerator<std::optional<cp::ExecBatch>> sink_gen;

  ARROW_ASSIGN_OR_RAISE(
      ac::ExecNode * source,
      ac::MakeExecNode("source", plan.get(), {},
                       ac::SourceNodeOptions{input.schema, input.gen()}));

  cp::SelectKOptions options = cp::SelectKOptions::TopKDefault(/*k=*/2, {"i32"});

  ARROW_RETURN_NOT_OK(ac::MakeExecNode("select_k_sink", plan.get(), {source},
                                       ac::SelectKSinkNodeOptions{options, &sink_gen}));

  auto schema = arrow::schema(
      {arrow::field("i32", arrow::int32()), arrow::field("str", arrow::utf8())});

  return ExecutePlanAndCollectAsTableWithCustomSink(plan, schema, sink_gen);
}

`table_sink`#

The table_sink node provides the ability to receive the output as an in-memory table. This is simpler to use than the other sink nodes provided by the streaming execution engine but it only makes sense when the output fits comfortably in memory. The node is created using TableSinkNodeOptions.

Example of using table_sink

/// \brief An example showing a table sink node
///
/// TableSink Example
/// This example shows how a table_sink can be used
/// in an execution plan. This includes a source node
/// receiving data as batches and the table sink node
/// which emits the output as a table.
arrow::Status TableSinkExample() {
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<ac::ExecPlan> plan,
                        ac::ExecPlan::Make(*cp::threaded_exec_context()));

  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeBasicBatches());

  auto source_node_options = ac::SourceNodeOptions{basic_data.schema, basic_data.gen()};

  ARROW_ASSIGN_OR_RAISE(ac::ExecNode * source,
                        ac::MakeExecNode("source", plan.get(), {}, source_node_options));

  std::shared_ptr<arrow::Table> output_table;
  auto table_sink_options = ac::TableSinkNodeOptions{&output_table};

  ARROW_RETURN_NOT_OK(
      ac::MakeExecNode("table_sink", plan.get(), {source}, table_sink_options));
  // validate the ExecPlan
  ARROW_RETURN_NOT_OK(plan->Validate());
  std::cout << "ExecPlan created : " << plan->ToString() << std::endl;
  // start the ExecPlan
  plan->StartProducing();

  // Wait for the plan to finish
  auto finished = plan->finished();
  RETURN_NOT_OK(finished.status());
  std::cout << "Results : " << output_table->ToString() << std::endl;
  return arrow::Status::OK();
}

`scan`#

scan is an operation used to load and process datasets. It should be preferred over the more generic source node when your input is a dataset. The behavior is defined using arrow::dataset::ScanNodeOptions. More information on datasets and the various scan options can be found in Tabular Datasets.

This node is capable of applying pushdown filters to the file readers which reduce the amount of data that needs to be read. This means you may supply the same filter expression to the scan node that you also supply to the FilterNode because the filtering is done in two different places.

Scan example:

/// \brief An example demonstrating a scan and sink node
///
/// Scan-Table
/// This example shows how scan operation can be applied on a dataset.
/// There are operations that can be applied on the scan (project, filter)
/// and the input data can be processed. The output is obtained as a table
arrow::Status ScanSinkExample() {
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::dataset::Dataset> dataset, GetDataset());

  auto options = std::make_shared<arrow::dataset::ScanOptions>();
  options->projection = cp::project({}, {});  // create empty projection

  // construct the scan node
  auto scan_node_options = arrow::dataset::ScanNodeOptions{dataset, options};

  ac::Declaration scan{"scan", std::move(scan_node_options)};

  return ExecutePlanAndCollectAsTable(std::move(scan));
}

`write`#

The write node saves query results as a dataset of files in a format like Parquet, Feather, CSV, etc. using the Tabular Datasets functionality in Arrow. The write options are provided via the arrow::dataset::WriteNodeOptions which in turn contains arrow::dataset::FileSystemDatasetWriteOptions. arrow::dataset::FileSystemDatasetWriteOptions provides control over the written dataset, including options like the output directory, file naming scheme, and so on.

Write example:

/// \brief An example showing a write node
/// \param file_path The destination to write to
///
/// Scan-Filter-Write
/// This example shows how scan node can be used to load the data
/// and after processing how it can be written to disk.
arrow::Status ScanFilterWriteExample(const std::string& file_path) {
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::dataset::Dataset> dataset, GetDataset());

  auto options = std::make_shared<arrow::dataset::ScanOptions>();
  // empty projection
  options->projection = cp::project({}, {});

  auto scan_node_options = arrow::dataset::ScanNodeOptions{dataset, options};

  ac::Declaration scan{"scan", std::move(scan_node_options)};

  arrow::AsyncGenerator<std::optional<cp::ExecBatch>> sink_gen;

  std::string root_path = "";
  std::string uri = "file://" + file_path;
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::fs::FileSystem> filesystem,
                        arrow::fs::FileSystemFromUri(uri, &root_path));

  auto base_path = root_path + "/parquet_dataset";
  // Uncomment the following line, if run repeatedly
  // ARROW_RETURN_NOT_OK(filesystem->DeleteDirContents(base_path));
  ARROW_RETURN_NOT_OK(filesystem->CreateDir(base_path));

  // The partition schema determines which fields are part of the partitioning.
  auto partition_schema = arrow::schema({arrow::field("a", arrow::int32())});
  // We'll use Hive-style partitioning,
  // which creates directories with "key=value" pairs.

  auto partitioning =
      std::make_shared<arrow::dataset::HivePartitioning>(partition_schema);
  // We'll write Parquet files.
  auto format = std::make_shared<arrow::dataset::ParquetFileFormat>();

  arrow::dataset::FileSystemDatasetWriteOptions write_options;
  write_options.file_write_options = format->DefaultWriteOptions();
  write_options.filesystem = filesystem;
  write_options.base_dir = base_path;
  write_options.partitioning = partitioning;
  write_options.basename_template = "part{i}.parquet";

  arrow::dataset::WriteNodeOptions write_node_options{write_options};

  ac::Declaration write{"write", {std::move(scan)}, std::move(write_node_options)};

  // Since the write node has no output we simply run the plan to completion and the
  // data should be written
  ARROW_RETURN_NOT_OK(ac::DeclarationToStatus(std::move(write)));

  std::cout << "Dataset written to " << base_path << std::endl;
  return arrow::Status::OK();
}

`union`#

union merges multiple data streams with the same schema into one, similar to a SQL UNION ALL clause.

The following example demonstrates how this can be achieved using two data sources.

Union example:

/// \brief An example showing a union node
///
/// Source-Union-Table
/// This example shows how a union operation can be applied on two
/// data sources. The output is collected into a table.
arrow::Status SourceUnionSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeBasicBatches());

  ac::Declaration lhs{"source",
                      ac::SourceNodeOptions{basic_data.schema, basic_data.gen()}};
  lhs.label = "lhs";
  ac::Declaration rhs{"source",
                      ac::SourceNodeOptions{basic_data.schema, basic_data.gen()}};
  rhs.label = "rhs";
  ac::Declaration union_plan{
      "union", {std::move(lhs), std::move(rhs)}, ac::ExecNodeOptions{}};

  return ExecutePlanAndCollectAsTable(std::move(union_plan));
}

`hash_join`#

hash_join operation provides the relational algebra operation, join using hash-based algorithm. HashJoinNodeOptions contains the options required in defining a join. The hash_join supports left/right/full semi/anti/outerjoins. Also the join-key (i.e. the column(s) to join on), and suffixes (i.e a suffix term like “_x” which can be appended as a suffix for column names duplicated in both left and right relations.) can be set via the the join options. Read more on hash-joins.

Hash-Join example:

/// \brief An example showing a hash join node
///
/// Source-HashJoin-Table
/// This example shows how source node gets the data and how a self-join
/// is applied on the data. The join options are configurable. The output
/// is collected into a table.
arrow::Status SourceHashJoinSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto input, MakeGroupableBatches());

  ac::Declaration left{"source", ac::SourceNodeOptions{input.schema, input.gen()}};
  ac::Declaration right{"source", ac::SourceNodeOptions{input.schema, input.gen()}};

  ac::HashJoinNodeOptions join_opts{
      ac::JoinType::INNER,
      /*left_keys=*/{"str"},
      /*right_keys=*/{"str"}, cp::literal(true), "l_", "r_"};

  ac::Declaration hashjoin{
      "hashjoin", {std::move(left), std::move(right)}, std::move(join_opts)};

  return ExecutePlanAndCollectAsTable(std::move(hashjoin));
}

Summary#

There are examples of these nodes which can be found in cpp/examples/arrow/execution_plan_documentation_examples.cc in the Arrow source.

Complete Example:

#include <arrow/array.h>
#include <arrow/builder.h>

#include <arrow/acero/exec_plan.h>
#include <arrow/compute/api.h>
#include <arrow/compute/api_vector.h>
#include <arrow/compute/cast.h>

#include <arrow/csv/api.h>

#include <arrow/dataset/dataset.h>
#include <arrow/dataset/file_base.h>
#include <arrow/dataset/file_parquet.h>
#include <arrow/dataset/plan.h>
#include <arrow/dataset/scanner.h>

#include <arrow/io/interfaces.h>
#include <arrow/io/memory.h>

#include <arrow/result.h>
#include <arrow/status.h>
#include <arrow/table.h>

#include <arrow/ipc/api.h>

#include <arrow/util/future.h>
#include <arrow/util/range.h>
#include <arrow/util/thread_pool.h>
#include <arrow/util/vector.h>

#include <iostream>
#include <memory>
#include <utility>

// Demonstrate various operators in Arrow Streaming Execution Engine

namespace cp = ::arrow::compute;
namespace ac = ::arrow::acero;

constexpr char kSep[] = "******";

void PrintBlock(const std::string& msg) {
  std::cout << "\n\t" << kSep << " " << msg << " " << kSep << "\n" << std::endl;
}

template <typename TYPE,
          typename = typename std::enable_if<arrow::is_number_type<TYPE>::value |
                                             arrow::is_boolean_type<TYPE>::value |
                                             arrow::is_temporal_type<TYPE>::value>::type>
arrow::Result<std::shared_ptr<arrow::Array>> GetArrayDataSample(
    const std::vector<typename TYPE::c_type>& values) {
  using ArrowBuilderType = typename arrow::TypeTraits<TYPE>::BuilderType;
  ArrowBuilderType builder;
  ARROW_RETURN_NOT_OK(builder.Reserve(values.size()));
  ARROW_RETURN_NOT_OK(builder.AppendValues(values));
  return builder.Finish();
}

template <class TYPE>
arrow::Result<std::shared_ptr<arrow::Array>> GetBinaryArrayDataSample(
    const std::vector<std::string>& values) {
  using ArrowBuilderType = typename arrow::TypeTraits<TYPE>::BuilderType;
  ArrowBuilderType builder;
  ARROW_RETURN_NOT_OK(builder.Reserve(values.size()));
  ARROW_RETURN_NOT_OK(builder.AppendValues(values));
  return builder.Finish();
}

arrow::Result<std::shared_ptr<arrow::RecordBatch>> GetSampleRecordBatch(
    const arrow::ArrayVector array_vector, const arrow::FieldVector& field_vector) {
  std::shared_ptr<arrow::RecordBatch> record_batch;
  ARROW_ASSIGN_OR_RAISE(auto struct_result,
                        arrow::StructArray::Make(array_vector, field_vector));
  return record_batch->FromStructArray(struct_result);
}

/// \brief Create a sample table
/// The table's contents will be:
/// a,b
/// 1,null
/// 2,true
/// null,true
/// 3,false
/// null,true
/// 4,false
/// 5,null
/// 6,false
/// 7,false
/// 8,true
/// \return The created table

arrow::Result<std::shared_ptr<arrow::Table>> GetTable() {
  auto null_long = std::numeric_limits<int64_t>::quiet_NaN();
  ARROW_ASSIGN_OR_RAISE(auto int64_array,
                        GetArrayDataSample<arrow::Int64Type>(
                            {1, 2, null_long, 3, null_long, 4, 5, 6, 7, 8}));

  arrow::BooleanBuilder boolean_builder;
  std::shared_ptr<arrow::BooleanArray> bool_array;

  std::vector<uint8_t> bool_values = {false, true,  true,  false, true,
                                      false, false, false, false, true};
  std::vector<bool> is_valid = {false, true,  true, true, true,
                                true,  false, true, true, true};

  ARROW_RETURN_NOT_OK(boolean_builder.Reserve(10));

  ARROW_RETURN_NOT_OK(boolean_builder.AppendValues(bool_values, is_valid));

  ARROW_RETURN_NOT_OK(boolean_builder.Finish(&bool_array));

  auto record_batch =
      arrow::RecordBatch::Make(arrow::schema({arrow::field("a", arrow::int64()),
                                              arrow::field("b", arrow::boolean())}),
                               10, {int64_array, bool_array});
  ARROW_ASSIGN_OR_RAISE(auto table, arrow::Table::FromRecordBatches({record_batch}));
  return table;
}

/// \brief Create a sample dataset
/// \return An in-memory dataset based on GetTable()
arrow::Result<std::shared_ptr<arrow::dataset::Dataset>> GetDataset() {
  ARROW_ASSIGN_OR_RAISE(auto table, GetTable());
  auto ds = std::make_shared<arrow::dataset::InMemoryDataset>(table);
  return ds;
}

arrow::Result<cp::ExecBatch> GetExecBatchFromVectors(
    const arrow::FieldVector& field_vector, const arrow::ArrayVector& array_vector) {
  std::shared_ptr<arrow::RecordBatch> record_batch;
  ARROW_ASSIGN_OR_RAISE(auto res_batch, GetSampleRecordBatch(array_vector, field_vector));
  cp::ExecBatch batch{*res_batch};
  return batch;
}

// (Doc section: BatchesWithSchema Definition)
struct BatchesWithSchema {
  std::vector<cp::ExecBatch> batches;
  std::shared_ptr<arrow::Schema> schema;
  // This method uses internal arrow utilities to
  // convert a vector of record batches to an AsyncGenerator of optional batches
  arrow::AsyncGenerator<std::optional<cp::ExecBatch>> gen() const {
    auto opt_batches = ::arrow::internal::MapVector(
        [](cp::ExecBatch batch) { return std::make_optional(std::move(batch)); },
        batches);
    arrow::AsyncGenerator<std::optional<cp::ExecBatch>> gen;
    gen = arrow::MakeVectorGenerator(std::move(opt_batches));
    return gen;
  }
};
// (Doc section: BatchesWithSchema Definition)

// (Doc section: MakeBasicBatches Definition)
arrow::Result<BatchesWithSchema> MakeBasicBatches() {
  BatchesWithSchema out;
  auto field_vector = {arrow::field("a", arrow::int32()),
                       arrow::field("b", arrow::boolean())};
  ARROW_ASSIGN_OR_RAISE(auto b1_int, GetArrayDataSample<arrow::Int32Type>({0, 4}));
  ARROW_ASSIGN_OR_RAISE(auto b2_int, GetArrayDataSample<arrow::Int32Type>({5, 6, 7}));
  ARROW_ASSIGN_OR_RAISE(auto b3_int, GetArrayDataSample<arrow::Int32Type>({8, 9, 10}));

  ARROW_ASSIGN_OR_RAISE(auto b1_bool,
                        GetArrayDataSample<arrow::BooleanType>({false, true}));
  ARROW_ASSIGN_OR_RAISE(auto b2_bool,
                        GetArrayDataSample<arrow::BooleanType>({true, false, true}));
  ARROW_ASSIGN_OR_RAISE(auto b3_bool,
                        GetArrayDataSample<arrow::BooleanType>({false, true, false}));

  ARROW_ASSIGN_OR_RAISE(auto b1,
                        GetExecBatchFromVectors(field_vector, {b1_int, b1_bool}));
  ARROW_ASSIGN_OR_RAISE(auto b2,
                        GetExecBatchFromVectors(field_vector, {b2_int, b2_bool}));
  ARROW_ASSIGN_OR_RAISE(auto b3,
                        GetExecBatchFromVectors(field_vector, {b3_int, b3_bool}));

  out.batches = {b1, b2, b3};
  out.schema = arrow::schema(field_vector);
  return out;
}
// (Doc section: MakeBasicBatches Definition)

arrow::Result<BatchesWithSchema> MakeSortTestBasicBatches() {
  BatchesWithSchema out;
  auto field = arrow::field("a", arrow::int32());
  ARROW_ASSIGN_OR_RAISE(auto b1_int, GetArrayDataSample<arrow::Int32Type>({1, 3, 0, 2}));
  ARROW_ASSIGN_OR_RAISE(auto b2_int,
                        GetArrayDataSample<arrow::Int32Type>({121, 101, 120, 12}));
  ARROW_ASSIGN_OR_RAISE(auto b3_int,
                        GetArrayDataSample<arrow::Int32Type>({10, 110, 210, 121}));
  ARROW_ASSIGN_OR_RAISE(auto b4_int,
                        GetArrayDataSample<arrow::Int32Type>({51, 101, 2, 34}));
  ARROW_ASSIGN_OR_RAISE(auto b5_int,
                        GetArrayDataSample<arrow::Int32Type>({11, 31, 1, 12}));
  ARROW_ASSIGN_OR_RAISE(auto b6_int,
                        GetArrayDataSample<arrow::Int32Type>({12, 101, 120, 12}));
  ARROW_ASSIGN_OR_RAISE(auto b7_int,
                        GetArrayDataSample<arrow::Int32Type>({0, 110, 210, 11}));
  ARROW_ASSIGN_OR_RAISE(auto b8_int,
                        GetArrayDataSample<arrow::Int32Type>({51, 10, 2, 3}));

  ARROW_ASSIGN_OR_RAISE(auto b1, GetExecBatchFromVectors({field}, {b1_int}));
  ARROW_ASSIGN_OR_RAISE(auto b2, GetExecBatchFromVectors({field}, {b2_int}));
  ARROW_ASSIGN_OR_RAISE(auto b3,
                        GetExecBatchFromVectors({field, field}, {b3_int, b8_int}));
  ARROW_ASSIGN_OR_RAISE(auto b4,
                        GetExecBatchFromVectors({field, field, field, field},
                                                {b4_int, b5_int, b6_int, b7_int}));
  out.batches = {b1, b2, b3, b4};
  out.schema = arrow::schema({field});
  return out;
}

arrow::Result<BatchesWithSchema> MakeGroupableBatches(int multiplicity = 1) {
  BatchesWithSchema out;
  auto fields = {arrow::field("i32", arrow::int32()), arrow::field("str", arrow::utf8())};
  ARROW_ASSIGN_OR_RAISE(auto b1_int, GetArrayDataSample<arrow::Int32Type>({12, 7, 3}));
  ARROW_ASSIGN_OR_RAISE(auto b2_int, GetArrayDataSample<arrow::Int32Type>({-2, -1, 3}));
  ARROW_ASSIGN_OR_RAISE(auto b3_int, GetArrayDataSample<arrow::Int32Type>({5, 3, -8}));
  ARROW_ASSIGN_OR_RAISE(auto b1_str, GetBinaryArrayDataSample<arrow::StringType>(
                                         {"alpha", "beta", "alpha"}));
  ARROW_ASSIGN_OR_RAISE(auto b2_str, GetBinaryArrayDataSample<arrow::StringType>(
                                         {"alpha", "gamma", "alpha"}));
  ARROW_ASSIGN_OR_RAISE(auto b3_str, GetBinaryArrayDataSample<arrow::StringType>(
                                         {"gamma", "beta", "alpha"}));
  ARROW_ASSIGN_OR_RAISE(auto b1, GetExecBatchFromVectors(fields, {b1_int, b1_str}));
  ARROW_ASSIGN_OR_RAISE(auto b2, GetExecBatchFromVectors(fields, {b2_int, b2_str}));
  ARROW_ASSIGN_OR_RAISE(auto b3, GetExecBatchFromVectors(fields, {b3_int, b3_str}));
  out.batches = {b1, b2, b3};

  size_t batch_count = out.batches.size();
  for (int repeat = 1; repeat < multiplicity; ++repeat) {
    for (size_t i = 0; i < batch_count; ++i) {
      out.batches.push_back(out.batches[i]);
    }
  }

  out.schema = arrow::schema(fields);
  return out;
}

arrow::Status ExecutePlanAndCollectAsTable(ac::Declaration plan) {
  // collect sink_reader into a Table
  std::shared_ptr<arrow::Table> response_table;
  ARROW_ASSIGN_OR_RAISE(response_table, ac::DeclarationToTable(std::move(plan)));

  std::cout << "Results : " << response_table->ToString() << std::endl;

  return arrow::Status::OK();
}

// (Doc section: Scan Example)

/// \brief An example demonstrating a scan and sink node
///
/// Scan-Table
/// This example shows how scan operation can be applied on a dataset.
/// There are operations that can be applied on the scan (project, filter)
/// and the input data can be processed. The output is obtained as a table
arrow::Status ScanSinkExample() {
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::dataset::Dataset> dataset, GetDataset());

  auto options = std::make_shared<arrow::dataset::ScanOptions>();
  options->projection = cp::project({}, {});  // create empty projection

  // construct the scan node
  auto scan_node_options = arrow::dataset::ScanNodeOptions{dataset, options};

  ac::Declaration scan{"scan", std::move(scan_node_options)};

  return ExecutePlanAndCollectAsTable(std::move(scan));
}
// (Doc section: Scan Example)

// (Doc section: Source Example)

/// \brief An example demonstrating a source and sink node
///
/// Source-Table Example
/// This example shows how a custom source can be used
/// in an execution plan. This includes source node using pregenerated
/// data and collecting it into a table.
///
/// This sort of custom source is often not needed.  In most cases you can
/// use a scan (for a dataset source) or a source like table_source, array_vector_source,
/// exec_batch_source, or record_batch_source (for in-memory data)
arrow::Status SourceSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeBasicBatches());

  auto source_node_options = ac::SourceNodeOptions{basic_data.schema, basic_data.gen()};

  ac::Declaration source{"source", std::move(source_node_options)};

  return ExecutePlanAndCollectAsTable(std::move(source));
}
// (Doc section: Source Example)

// (Doc section: Table Source Example)

/// \brief An example showing a table source node
///
/// TableSource-Table Example
/// This example shows how a table_source can be used
/// in an execution plan. This includes a table source node
/// receiving data from a table.  This plan simply collects the
/// data back into a table but nodes could be added that modify
/// or transform the data as well (as is shown in later examples)
arrow::Status TableSourceSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto table, GetTable());

  arrow::AsyncGenerator<std::optional<cp::ExecBatch>> sink_gen;
  int max_batch_size = 2;
  auto table_source_options = ac::TableSourceNodeOptions{table, max_batch_size};

  ac::Declaration source{"table_source", std::move(table_source_options)};

  return ExecutePlanAndCollectAsTable(std::move(source));
}
// (Doc section: Table Source Example)

// (Doc section: Filter Example)

/// \brief An example showing a filter node
///
/// Source-Filter-Table
/// This example shows how a filter can be used in an execution plan,
/// to filter data from a source. The output from the exeuction plan
/// is collected into a table.
arrow::Status ScanFilterSinkExample() {
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::dataset::Dataset> dataset, GetDataset());

  auto options = std::make_shared<arrow::dataset::ScanOptions>();
  // specify the filter.  This filter removes all rows where the
  // value of the "a" column is greater than 3.
  cp::Expression filter_expr = cp::greater(cp::field_ref("a"), cp::literal(3));
  // set filter for scanner : on-disk / push-down filtering.
  // This step can be skipped if you are not reading from disk.
  options->filter = filter_expr;
  // empty projection
  options->projection = cp::project({}, {});

  // construct the scan node
  std::cout << "Initialized Scanning Options" << std::endl;

  auto scan_node_options = arrow::dataset::ScanNodeOptions{dataset, options};
  std::cout << "Scan node options created" << std::endl;

  ac::Declaration scan{"scan", std::move(scan_node_options)};

  // pipe the scan node into the filter node
  // Need to set the filter in scan node options and filter node options.
  // At scan node it is used for on-disk / push-down filtering.
  // At filter node it is used for in-memory filtering.
  ac::Declaration filter{
      "filter", {std::move(scan)}, ac::FilterNodeOptions(std::move(filter_expr))};

  return ExecutePlanAndCollectAsTable(std::move(filter));
}

// (Doc section: Filter Example)

// (Doc section: Project Example)

/// \brief An example showing a project node
///
/// Scan-Project-Table
/// This example shows how a Scan operation can be used to load the data
/// into the execution plan, how a project operation can be applied on the
/// data stream and how the output is collected into a table
arrow::Status ScanProjectSinkExample() {
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::dataset::Dataset> dataset, GetDataset());

  auto options = std::make_shared<arrow::dataset::ScanOptions>();
  // projection
  cp::Expression a_times_2 = cp::call("multiply", {cp::field_ref("a"), cp::literal(2)});
  options->projection = cp::project({}, {});

  auto scan_node_options = arrow::dataset::ScanNodeOptions{dataset, options};

  ac::Declaration scan{"scan", std::move(scan_node_options)};
  ac::Declaration project{
      "project", {std::move(scan)}, ac::ProjectNodeOptions({a_times_2})};

  return ExecutePlanAndCollectAsTable(std::move(project));
}

// (Doc section: Project Example)

// This is a variation of ScanProjectSinkExample introducing how to use the
// Declaration::Sequence function
arrow::Status ScanProjectSequenceSinkExample() {
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::dataset::Dataset> dataset, GetDataset());

  auto options = std::make_shared<arrow::dataset::ScanOptions>();
  // projection
  cp::Expression a_times_2 = cp::call("multiply", {cp::field_ref("a"), cp::literal(2)});
  options->projection = cp::project({}, {});

  auto scan_node_options = arrow::dataset::ScanNodeOptions{dataset, options};

  // (Doc section: Project Sequence Example)
  // Inputs do not have to be passed to the project node when using Sequence
  ac::Declaration plan =
      ac::Declaration::Sequence({{"scan", std::move(scan_node_options)},
                                 {"project", ac::ProjectNodeOptions({a_times_2})}});
  // (Doc section: Project Sequence Example)

  return ExecutePlanAndCollectAsTable(std::move(plan));
}

// (Doc section: Scalar Aggregate Example)

/// \brief An example showing an aggregation node to aggregate an entire table
///
/// Source-Aggregation-Table
/// This example shows how an aggregation operation can be applied on a
/// execution plan resulting in a scalar output. The source node loads the
/// data and the aggregation (counting unique types in column 'a')
/// is applied on this data. The output is collected into a table (that will
/// have exactly one row)
arrow::Status SourceScalarAggregateSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeBasicBatches());

  auto source_node_options = ac::SourceNodeOptions{basic_data.schema, basic_data.gen()};

  ac::Declaration source{"source", std::move(source_node_options)};
  auto aggregate_options =
      ac::AggregateNodeOptions{/*aggregates=*/{{"sum", nullptr, "a", "sum(a)"}}};
  ac::Declaration aggregate{
      "aggregate", {std::move(source)}, std::move(aggregate_options)};

  return ExecutePlanAndCollectAsTable(std::move(aggregate));
}
// (Doc section: Scalar Aggregate Example)

// (Doc section: Group Aggregate Example)

/// \brief An example showing an aggregation node to perform a group-by operation
///
/// Source-Aggregation-Table
/// This example shows how an aggregation operation can be applied on a
/// execution plan resulting in grouped output. The source node loads the
/// data and the aggregation (counting unique types in column 'a') is
/// applied on this data. The output is collected into a table that will contain
/// one row for each unique combination of group keys.
arrow::Status SourceGroupAggregateSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeBasicBatches());

  arrow::AsyncGenerator<std::optional<cp::ExecBatch>> sink_gen;

  auto source_node_options = ac::SourceNodeOptions{basic_data.schema, basic_data.gen()};

  ac::Declaration source{"source", std::move(source_node_options)};
  auto options = std::make_shared<cp::CountOptions>(cp::CountOptions::ONLY_VALID);
  auto aggregate_options =
      ac::AggregateNodeOptions{/*aggregates=*/{{"hash_count", options, "a", "count(a)"}},
                               /*keys=*/{"b"}};
  ac::Declaration aggregate{
      "aggregate", {std::move(source)}, std::move(aggregate_options)};

  return ExecutePlanAndCollectAsTable(std::move(aggregate));
}
// (Doc section: Group Aggregate Example)

// (Doc section: ConsumingSink Example)

/// \brief An example showing a consuming sink node
///
/// Source-Consuming-Sink
/// This example shows how the data can be consumed within the execution plan
/// by using a ConsumingSink node. There is no data output from this execution plan.
arrow::Status SourceConsumingSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeBasicBatches());

  auto source_node_options = ac::SourceNodeOptions{basic_data.schema, basic_data.gen()};

  ac::Declaration source{"source", std::move(source_node_options)};

  std::atomic<uint32_t> batches_seen{0};
  arrow::Future<> finish = arrow::Future<>::Make();
  struct CustomSinkNodeConsumer : public ac::SinkNodeConsumer {
    CustomSinkNodeConsumer(std::atomic<uint32_t>* batches_seen, arrow::Future<> finish)
        : batches_seen(batches_seen), finish(std::move(finish)) {}

    arrow::Status Init(const std::shared_ptr<arrow::Schema>& schema,
                       ac::BackpressureControl* backpressure_control,
                       ac::ExecPlan* plan) override {
      // This will be called as the plan is started (before the first call to Consume)
      // and provides the schema of the data coming into the node, controls for pausing /
      // resuming input, and a pointer to the plan itself which can be used to access
      // other utilities such as the thread indexer or async task scheduler.
      return arrow::Status::OK();
    }

    arrow::Status Consume(cp::ExecBatch batch) override {
      (*batches_seen)++;
      return arrow::Status::OK();
    }

    arrow::Future<> Finish() override {
      // Here you can perform whatever (possibly async) cleanup is needed, e.g. closing
      // output file handles and flushing remaining work
      return arrow::Future<>::MakeFinished();
    }

    std::atomic<uint32_t>* batches_seen;
    arrow::Future<> finish;
  };
  std::shared_ptr<CustomSinkNodeConsumer> consumer =
      std::make_shared<CustomSinkNodeConsumer>(&batches_seen, finish);

  ac::Declaration consuming_sink{"consuming_sink",
                                 {std::move(source)},
                                 ac::ConsumingSinkNodeOptions(std::move(consumer))};

  // Since we are consuming the data within the plan there is no output and we simply
  // run the plan to completion instead of collecting into a table.
  ARROW_RETURN_NOT_OK(ac::DeclarationToStatus(std::move(consuming_sink)));

  std::cout << "The consuming sink node saw " << batches_seen.load() << " batches"
            << std::endl;
  return arrow::Status::OK();
}
// (Doc section: ConsumingSink Example)

// (Doc section: OrderBySink Example)

arrow::Status ExecutePlanAndCollectAsTableWithCustomSink(
    std::shared_ptr<ac::ExecPlan> plan, std::shared_ptr<arrow::Schema> schema,
    arrow::AsyncGenerator<std::optional<cp::ExecBatch>> sink_gen) {
  // translate sink_gen (async) to sink_reader (sync)
  std::shared_ptr<arrow::RecordBatchReader> sink_reader =
      ac::MakeGeneratorReader(schema, std::move(sink_gen), arrow::default_memory_pool());

  // validate the ExecPlan
  ARROW_RETURN_NOT_OK(plan->Validate());
  std::cout << "ExecPlan created : " << plan->ToString() << std::endl;
  // start the ExecPlan
  plan->StartProducing();

  // collect sink_reader into a Table
  std::shared_ptr<arrow::Table> response_table;

  ARROW_ASSIGN_OR_RAISE(response_table,
                        arrow::Table::FromRecordBatchReader(sink_reader.get()));

  std::cout << "Results : " << response_table->ToString() << std::endl;

  // stop producing
  plan->StopProducing();
  // plan mark finished
  auto future = plan->finished();
  return future.status();
}

/// \brief An example showing an order-by node
///
/// Source-OrderBy-Sink
/// In this example, the data enters through the source node
/// and the data is ordered in the sink node. The order can be
/// ASCENDING or DESCENDING and it is configurable. The output
/// is obtained as a table from the sink node.
arrow::Status SourceOrderBySinkExample() {
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<ac::ExecPlan> plan,
                        ac::ExecPlan::Make(*cp::threaded_exec_context()));

  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeSortTestBasicBatches());

  arrow::AsyncGenerator<std::optional<cp::ExecBatch>> sink_gen;

  auto source_node_options = ac::SourceNodeOptions{basic_data.schema, basic_data.gen()};
  ARROW_ASSIGN_OR_RAISE(ac::ExecNode * source,
                        ac::MakeExecNode("source", plan.get(), {}, source_node_options));

  ARROW_RETURN_NOT_OK(ac::MakeExecNode(
      "order_by_sink", plan.get(), {source},
      ac::OrderBySinkNodeOptions{
          cp::SortOptions{{cp::SortKey{"a", cp::SortOrder::Descending}}}, &sink_gen}));

  return ExecutePlanAndCollectAsTableWithCustomSink(plan, basic_data.schema, sink_gen);
}

// (Doc section: OrderBySink Example)

// (Doc section: HashJoin Example)

/// \brief An example showing a hash join node
///
/// Source-HashJoin-Table
/// This example shows how source node gets the data and how a self-join
/// is applied on the data. The join options are configurable. The output
/// is collected into a table.
arrow::Status SourceHashJoinSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto input, MakeGroupableBatches());

  ac::Declaration left{"source", ac::SourceNodeOptions{input.schema, input.gen()}};
  ac::Declaration right{"source", ac::SourceNodeOptions{input.schema, input.gen()}};

  ac::HashJoinNodeOptions join_opts{
      ac::JoinType::INNER,
      /*left_keys=*/{"str"},
      /*right_keys=*/{"str"}, cp::literal(true), "l_", "r_"};

  ac::Declaration hashjoin{
      "hashjoin", {std::move(left), std::move(right)}, std::move(join_opts)};

  return ExecutePlanAndCollectAsTable(std::move(hashjoin));
}

// (Doc section: HashJoin Example)

// (Doc section: KSelect Example)

/// \brief An example showing a select-k node
///
/// Source-KSelect
/// This example shows how K number of elements can be selected
/// either from the top or bottom. The output node is a modified
/// sink node where output can be obtained as a table.
arrow::Status SourceKSelectExample() {
  ARROW_ASSIGN_OR_RAISE(auto input, MakeGroupableBatches());
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<ac::ExecPlan> plan,
                        ac::ExecPlan::Make(*cp::threaded_exec_context()));
  arrow::AsyncGenerator<std::optional<cp::ExecBatch>> sink_gen;

  ARROW_ASSIGN_OR_RAISE(
      ac::ExecNode * source,
      ac::MakeExecNode("source", plan.get(), {},
                       ac::SourceNodeOptions{input.schema, input.gen()}));

  cp::SelectKOptions options = cp::SelectKOptions::TopKDefault(/*k=*/2, {"i32"});

  ARROW_RETURN_NOT_OK(ac::MakeExecNode("select_k_sink", plan.get(), {source},
                                       ac::SelectKSinkNodeOptions{options, &sink_gen}));

  auto schema = arrow::schema(
      {arrow::field("i32", arrow::int32()), arrow::field("str", arrow::utf8())});

  return ExecutePlanAndCollectAsTableWithCustomSink(plan, schema, sink_gen);
}

// (Doc section: KSelect Example)

// (Doc section: Write Example)

/// \brief An example showing a write node
/// \param file_path The destination to write to
///
/// Scan-Filter-Write
/// This example shows how scan node can be used to load the data
/// and after processing how it can be written to disk.
arrow::Status ScanFilterWriteExample(const std::string& file_path) {
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::dataset::Dataset> dataset, GetDataset());

  auto options = std::make_shared<arrow::dataset::ScanOptions>();
  // empty projection
  options->projection = cp::project({}, {});

  auto scan_node_options = arrow::dataset::ScanNodeOptions{dataset, options};

  ac::Declaration scan{"scan", std::move(scan_node_options)};

  arrow::AsyncGenerator<std::optional<cp::ExecBatch>> sink_gen;

  std::string root_path = "";
  std::string uri = "file://" + file_path;
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::fs::FileSystem> filesystem,
                        arrow::fs::FileSystemFromUri(uri, &root_path));

  auto base_path = root_path + "/parquet_dataset";
  // Uncomment the following line, if run repeatedly
  // ARROW_RETURN_NOT_OK(filesystem->DeleteDirContents(base_path));
  ARROW_RETURN_NOT_OK(filesystem->CreateDir(base_path));

  // The partition schema determines which fields are part of the partitioning.
  auto partition_schema = arrow::schema({arrow::field("a", arrow::int32())});
  // We'll use Hive-style partitioning,
  // which creates directories with "key=value" pairs.

  auto partitioning =
      std::make_shared<arrow::dataset::HivePartitioning>(partition_schema);
  // We'll write Parquet files.
  auto format = std::make_shared<arrow::dataset::ParquetFileFormat>();

  arrow::dataset::FileSystemDatasetWriteOptions write_options;
  write_options.file_write_options = format->DefaultWriteOptions();
  write_options.filesystem = filesystem;
  write_options.base_dir = base_path;
  write_options.partitioning = partitioning;
  write_options.basename_template = "part{i}.parquet";

  arrow::dataset::WriteNodeOptions write_node_options{write_options};

  ac::Declaration write{"write", {std::move(scan)}, std::move(write_node_options)};

  // Since the write node has no output we simply run the plan to completion and the
  // data should be written
  ARROW_RETURN_NOT_OK(ac::DeclarationToStatus(std::move(write)));

  std::cout << "Dataset written to " << base_path << std::endl;
  return arrow::Status::OK();
}

// (Doc section: Write Example)

// (Doc section: Union Example)

/// \brief An example showing a union node
///
/// Source-Union-Table
/// This example shows how a union operation can be applied on two
/// data sources. The output is collected into a table.
arrow::Status SourceUnionSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeBasicBatches());

  ac::Declaration lhs{"source",
                      ac::SourceNodeOptions{basic_data.schema, basic_data.gen()}};
  lhs.label = "lhs";
  ac::Declaration rhs{"source",
                      ac::SourceNodeOptions{basic_data.schema, basic_data.gen()}};
  rhs.label = "rhs";
  ac::Declaration union_plan{
      "union", {std::move(lhs), std::move(rhs)}, ac::ExecNodeOptions{}};

  return ExecutePlanAndCollectAsTable(std::move(union_plan));
}

// (Doc section: Union Example)

// (Doc section: Table Sink Example)

/// \brief An example showing a table sink node
///
/// TableSink Example
/// This example shows how a table_sink can be used
/// in an execution plan. This includes a source node
/// receiving data as batches and the table sink node
/// which emits the output as a table.
arrow::Status TableSinkExample() {
  ARROW_ASSIGN_OR_RAISE(std::shared_ptr<ac::ExecPlan> plan,
                        ac::ExecPlan::Make(*cp::threaded_exec_context()));

  ARROW_ASSIGN_OR_RAISE(auto basic_data, MakeBasicBatches());

  auto source_node_options = ac::SourceNodeOptions{basic_data.schema, basic_data.gen()};

  ARROW_ASSIGN_OR_RAISE(ac::ExecNode * source,
                        ac::MakeExecNode("source", plan.get(), {}, source_node_options));

  std::shared_ptr<arrow::Table> output_table;
  auto table_sink_options = ac::TableSinkNodeOptions{&output_table};

  ARROW_RETURN_NOT_OK(
      ac::MakeExecNode("table_sink", plan.get(), {source}, table_sink_options));
  // validate the ExecPlan
  ARROW_RETURN_NOT_OK(plan->Validate());
  std::cout << "ExecPlan created : " << plan->ToString() << std::endl;
  // start the ExecPlan
  plan->StartProducing();

  // Wait for the plan to finish
  auto finished = plan->finished();
  RETURN_NOT_OK(finished.status());
  std::cout << "Results : " << output_table->ToString() << std::endl;
  return arrow::Status::OK();
}

// (Doc section: Table Sink Example)

// (Doc section: RecordBatchReaderSource Example)

/// \brief An example showing the usage of a RecordBatchReader as the data source.
///
/// RecordBatchReaderSourceSink Example
/// This example shows how a record_batch_reader_source can be used
/// in an execution plan. This includes the source node
/// receiving data from a TableRecordBatchReader.

arrow::Status RecordBatchReaderSourceSinkExample() {
  ARROW_ASSIGN_OR_RAISE(auto table, GetTable());
  std::shared_ptr<arrow::RecordBatchReader> reader =
      std::make_shared<arrow::TableBatchReader>(table);
  ac::Declaration reader_source{"record_batch_reader_source",
                                ac::RecordBatchReaderSourceNodeOptions{reader}};
  return ExecutePlanAndCollectAsTable(std::move(reader_source));
}

// (Doc section: RecordBatchReaderSource Example)

enum ExampleMode {
  SOURCE_SINK = 0,
  TABLE_SOURCE_SINK = 1,
  SCAN = 2,
  FILTER = 3,
  PROJECT = 4,
  SCALAR_AGGREGATION = 5,
  GROUP_AGGREGATION = 6,
  CONSUMING_SINK = 7,
  ORDER_BY_SINK = 8,
  HASHJOIN = 9,
  KSELECT = 10,
  WRITE = 11,
  UNION = 12,
  TABLE_SOURCE_TABLE_SINK = 13,
  RECORD_BATCH_READER_SOURCE = 14,
  PROJECT_SEQUENCE = 15
};

int main(int argc, char** argv) {
  if (argc < 3) {
    // Fake success for CI purposes.
    return EXIT_SUCCESS;
  }

  std::string base_save_path = argv[1];
  int mode = std::atoi(argv[2]);
  arrow::Status status;
  // ensure arrow::dataset node factories are in the registry
  arrow::dataset::internal::Initialize();
  switch (mode) {
    case SOURCE_SINK:
      PrintBlock("Source Sink Example");
      status = SourceSinkExample();
      break;
    case TABLE_SOURCE_SINK:
      PrintBlock("Table Source Sink Example");
      status = TableSourceSinkExample();
      break;
    case SCAN:
      PrintBlock("Scan Example");
      status = ScanSinkExample();
      break;
    case FILTER:
      PrintBlock("Filter Example");
      status = ScanFilterSinkExample();
      break;
    case PROJECT:
      PrintBlock("Project Example");
      status = ScanProjectSinkExample();
      break;
    case PROJECT_SEQUENCE:
      PrintBlock("Project Example (using Declaration::Sequence)");
      status = ScanProjectSequenceSinkExample();
      break;
    case GROUP_AGGREGATION:
      PrintBlock("Aggregate Example");
      status = SourceGroupAggregateSinkExample();
      break;
    case SCALAR_AGGREGATION:
      PrintBlock("Aggregate Example");
      status = SourceScalarAggregateSinkExample();
      break;
    case CONSUMING_SINK:
      PrintBlock("Consuming-Sink Example");
      status = SourceConsumingSinkExample();
      break;
    case ORDER_BY_SINK:
      PrintBlock("OrderBy Example");
      status = SourceOrderBySinkExample();
      break;
    case HASHJOIN:
      PrintBlock("HashJoin Example");
      status = SourceHashJoinSinkExample();
      break;
    case KSELECT:
      PrintBlock("KSelect Example");
      status = SourceKSelectExample();
      break;
    case WRITE:
      PrintBlock("Write Example");
      status = ScanFilterWriteExample(base_save_path);
      break;
    case UNION:
      PrintBlock("Union Example");
      status = SourceUnionSinkExample();
      break;
    case TABLE_SOURCE_TABLE_SINK:
      PrintBlock("TableSink Example");
      status = TableSinkExample();
      break;
    case RECORD_BATCH_READER_SOURCE:
      PrintBlock("RecordBatchReaderSource Example");
      status = RecordBatchReaderSourceSinkExample();
      break;
    default:
      break;
  }

  if (status.ok()) {
    return EXIT_SUCCESS;
  } else {
    std::cout << "Error occurred: " << status.message() << std::endl;
    return EXIT_FAILURE;
  }
}

Acero User’s Guide#

Using Acero#

Creating a Plan#

Using Substrait#

Programmatic Plan Creation#

Executing a Plan#

DeclarationToTable#

DeclarationToReader#

DeclarationToStatus#

Running a Plan Directly#

Providing Input#

Available ExecNode Implementations#

Sources#

Compute Nodes#

Arrangement Nodes#

Sink Nodes#

Examples#

source#

table_source#

filter#

project#

aggregate#

sink#

consuming_sink#

order_by_sink#

select_k_sink#

table_sink#

scan#

write#

union#

hash_join#

Summary#

Available `ExecNode` Implementations#

`source`#

`table_source`#

`filter`#

`project`#

`aggregate`#

`sink`#

`consuming_sink`#

`order_by_sink`#

`select_k_sink`#

`table_sink`#

`scan`#

`write`#

`union`#

`hash_join`#