Quarterly Roadmap

A quarterly roadmap will be published to give the DataFusion community visibility into the priorities of the projects contributors. This roadmap is not binding.

2023 Q4

  • Improve data output (COPY, INSERT and DataFrame) output capability #6569

  • Implementation of ARRAY types and related functions #6980

  • Write an industrial paper about DataFusion for SIGMOD #6782

2022 Q2

DataFusion Core

  • IO Improvements

    • Reading, registering, and writing more file formats from both DataFrame API and SQL

    • Additional options for IO including partitioning and metadata support

  • Work Scheduling

    • Improve predictability, observability and performance of IO and CPU-bound work

    • Develop a more explicit story for managing parallelism during plan execution

  • Memory Management

    • Add more operators for memory limited execution

  • Performance

    • Incorporate row-format into operators such as aggregate

    • Add row-format benchmarks

    • Explore JIT-compiling complex expressions

    • Explore LLVM for JIT, with inline Rust functions as the primary goal

    • Improve performance of Sort and Merge using Row Format / JIT expressions

  • Documentation

    • General improvements to DataFusion website

    • Publish design documents

  • Streaming

    • Create StreamProvider trait

Ballista

  • Make production ready

    • Shuffle file cleanup

    • Fill functional gaps between DataFusion and Ballista

    • Improve task scheduling and data exchange efficiency

    • Better error handling

      • Task failure

      • Executor lost

      • Schedule restart

    • Improve monitoring and logging

    • Auto scaling support

  • Support for multi-scheduler deployments. Initially for resiliency and fault tolerance but ultimately to support sharding for scalability and more efficient caching.

  • Executor deployment grouping based on resource allocation

Extensions (datafusion-contrib)

DataFusion-Python

  • Add missing functionality to DataFrame and SessionContext

  • Improve documentation

DataFusion-S3

  • Create Python bindings to use with datafusion-python

DataFusion-Tui

  • Create multiple SQL editors

  • Expose more Context and query metadata

  • Support new data sources

    • BigTable, HDFS, HTTP APIs

DataFusion-BigTable

  • Python binding to use with datafusion-python

  • Timestamp range predicate pushdown

  • Multi-threaded partition aware execution

  • Production ready Rust SDK

DataFusion-Streams

  • Create experimental implementation of StreamProvider trait