Introducing Apache Arrow DataFusion Contrib


Published 21 Mar 2022
By The Apache Arrow PMC (pmc)

Introduction

Apache Arrow DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.

When you want to extend your Rust project with SQL support, a DataFrame API, or the ability to read and process Parquet, JSON, Avro or CSV data, DataFusion is definitely worth checking out. DataFusion’s pluggable design makes creating extensions at various points particular easy to build.

DataFusion’s SQL, DataFrame, and manual PlanBuilder API let users access a sophisticated query optimizer and execution engine capable of fast, resource efficient, and parallel execution that takes optimal advantage of todays multicore hardware. Being written in Rust means DataFusion can offer both the safety of dynamic languages as well as the resource efficiency of a compiled language.

The DataFusion team is pleased to announce the creation of the DataFusion-Contrib GitHub organization to support and accelerate other projects. While the core DataFusion library remains under Apache governance, the contrib organization provides a more flexible testing ground for new DataFusion features and a home for DataFusion extensions. With this announcement, we are pleased to introduce the following inaugural DataFusion-Contrib repositories.

DataFusion-Python

This project provides Python bindings to the core Rust implementation of DataFusion, which allows users to:

  • Work with familiar SQL or DataFrame APIs to run queries in a safe, multi-threaded environment, returning results in Python
  • Create User Defined Functions and User Defined Aggregate Functions for complex operations
  • Pay no overhead to copy between Python and underlying Rust execution engine (by way of Apache Arrow arrays)

Upcoming enhancements

The team is focusing on exposing more features from the underlying Rust implementation of DataFusion and improving documentation.

How to install

From pip

pip install datafusion

Or

python -m pip install datafusion

DataFusion-ObjectStore-S3

This crate provides an ObjectStore implementation for querying data stored in S3 or S3 compatible storage. This makes it almost as easy to query data that lives on S3 as lives in local files

  • Ability to create S3FileSystem to register as part of DataFusion ExecutionContext
  • Register files or directories stored on S3 with ctx.register_listing_table

Upcoming enhancements

The current priority is adding python bindings for S3FileSystem. After that there will be async improvements as DataFusion adopts more of that functionality and we are looking into S3 Select functionality.

How to Install

Add the below to your Cargo.toml in your Rust Project with DataFusion.

datafusion-objectstore-s3 = "0.1.0"

DataFusion-Substrait

Substrait is an emerging standard that provides a cross-language serialization format for relational algebra (e.g. expressions and query plans).

This crate provides a Substrait producer and consumer for DataFusion. A producer converts a DataFusion logical plan into a Substrait protobuf and a consumer does the reverse.

Examples of how to use this crate can be found here.

Potential Use Cases

  • Replace custom DataFusion protobuf serialization.
  • Make it easier to pass query plans over FFI boundaries, such as from Python to Rust
  • Allow Apache Calcite query plans to be executed in DataFusion

DataFusion-BigTable

This crate implements Bigtable as a data source and physical executor for DataFusion queries. It currently supports both UTF-8 string and 64-bit big-endian signed integers in Bigtable. From a SQL perspective it supports both simple and composite row keys with =, IN, and BETWEEN operators as well as projection pushdown. The physical execution for queries is handled by this crate while any subsequent aggregation, group bys, or joins are handled in DataFusion.

Upcoming Enhancements

  • Predicate pushdown
    • Value range
    • Value Regex
    • Timestamp range
  • Multithreaded
  • Partition aware execution
  • Production ready

How to Install

Add the below to your Cargo.toml in your Rust Project with DataFusion.

datafusion-bigtable = "0.1.0"

DataFusion-HDFS

This crate introduces HadoopFileSystem as a remote ObjectStore which provides the ability to query HDFS files. For HDFS access the fs-hdfs library is used.

DataFusion-Tokomak

This crate provides an e-graph based DataFusion optimization framework based on the Rust egg library. An e-graph is a data structure that powers the equality saturation optimization technique.

As context, the optimizer framework within DataFusion is currently under review with the objective of implementing a more strategic long term solution that is more efficient and simpler to develop.

Some of the benefits of using egg within DataFusion are:

  • Implements optimized algorithms that are hard to match with manually written optimization passes
  • Makes it easy and less verbose to add optimization rules
  • Plugin framework to add more complex optimizations
  • Egg does not depend on rule order and can lead to a higher level of optimization by being able to apply multiple rules at the same time until it converges
  • Allows for cost-based optimizations

This is an exciting new area for DataFusion with lots of opportunity for community involvement!

DataFusion-Tui

DataFusion-tui aka dft provides a feature rich terminal application for using DataFusion. It has drawn inspiration and several features from datafusion-cli. In contrast to datafusion-cli the objective of this tool is to provide a light SQL IDE experience for querying data with DataFusion. This includes features such as the following which are currently implemented:

  • Tab Management to provide clean and structured organization of DataFusion queries, results, ExecutionContext information, and logs
    • SQL Editor
      • Text editor for writing SQL queries
    • Query History
      • History of executed queries, their execution time, and the number of returned rows
    • ExecutionContext information
      • Expose information on which physical optimizers are used and which ExecutionConfig settings are set
    • Logs
      • Logs from dft, DataFusion, and any dependent libraries
  • Support for custom ObjectStores
    • S3
  • Preload DDL from ~/.datafusionrc to enable having local “database” available at startup

Upcoming Enhancements

  • SQL Editor
    • Command to write query results to file
    • Multiple SQL editor tabs
  • Expose more information from ExecutionContext
  • A help tab that provides information on functions
  • Query custom TableProviders such as DeltaTable or BigTable

DataFusion-Streams

DataFusion-Stream is a new testing ground for creating a StreamProvider in DataFusion that will enable querying streaming data sources such as Apache Kafka. The implementation for this feature is currently being designed and is under active review. Once the design is finalized the trait and attendant data structures will be added back to the core DataFusion crate.

DataFusion-Java

This project created an initial set of Java bindings to DataFusion. The project is currently in maintenance mode and is looking for maintainers to drive future development.

How to Get Involved

If you are interested in contributing to DataFusion, and learning about state of the art query processing, we would love to have you join us on the journey! You can help by trying out DataFusion on some of your own data and projects and let us know how it goes or contribute a PR with documentation, tests or code. A list of open issues suitable for beginners is here

The best way to find out about creating new extensions within DataFusion-Contrib is reaching out on the #arrow-rust channel of the Apache Software Foundation Slack workspace.

You can also check out our new Communication Doc on more ways to engage with the community.

Links for each DataFusion-Contrib repository are provided above if you would like to contribute to those.