A New Development Workflow for Arrow's Rust Implementation
04 May 2021
By Ruan Pearce-Authers (ruanpa)
The Apache Arrow Rust community is excited to announce that its migration to a new development workflow is now complete! If you’re considering Rust as a language for working with columnar data, read on and see how your use case might benefit from our new and improved project setup.
In recent months, members of the community have been working closely with Arrow’s Project Management Committee and other contributors to expand the set of available workflows for Arrow implementations. The goal was to define a new development process that ultimately:
- Enables a faster release cadence that adheres to SemVer where appropriate
- Encourages maximum participation from the wider community with unified tooling
- Ensures that we continue to uphold the tenets of The Apache Way
If you’re just here for the highlights, the major outcomes of these discussions are as follows:
- The Rust projects have moved to separate repositories, outside the main Arrow monorepo
- The Rust community will use GitHub Issues for tracking feature development and issues, replacing the Jira instance maintained by the Apache Software Foundation (ASF)
- DataFusion and Ballista will follow a new release cycle, independent of the main Arrow releases
But why, as a community, have we decided to change our processes? Let’s take a slightly more in-depth look at the Rust implementation’s needs.
The Rust implementation of Arrow actually consists of several distinct projects, or in Rust parlance, “crates”. In addition to the core crates, namely
parquet, we also maintain:
- DataFusion: an extensible in-memory query execution engine using Arrow as its format
- Ballista: a distributed compute platform, powered by Apache Arrow and DataFusion
Whilst these projects are all closely related, with many shared contributors, they’re very much at different stages in their respective lifecycles. The core Arrow crate, as an implementation of a spec, has strict compatibility requirements with other versions of Arrow, and this is tested via rigorous cross-language integration tests.
However, at the other end of the spectrum, DataFusion and Ballista are still nascent projects in their own right that undergo frequent backwards-incompatible changes. In the old workflow, DataFusion was released in lockstep with Arrow; because DataFusion users often need newly-contributed features or bugfixes on a tighter schedule than Arrow releases, we observed that many people in the community simply resorted to referencing our GitHub repository directly, rather than properly versioned builds on crates.io, Rust’s package registry.
Ultimately, the decision was made to split the Rust crates into two separate repositories: arrow-rs for the core Arrow functionality, and arrow-datafusion for DataFusion and Ballista. There’s still work to be done on determining the exact release workflows for the latter, but this leaves us in a much better position to meet the broader Rust community’s expectations of crate versioning and stability.
All Apache projects are built on volunteer contribution; it’s a core principle of both the ASF and open-source software development more broadly. One point of friction that was observed in the previous workflow for the Rust community in particular was the requirement for issues to be logged in Arrow’s Jira project. This step required would-be contributors to first register an account, and then receive a permissions grant to manage tickets.
To streamline this process for new community members, we’ve taken the decision to migrate to GitHub Issues for tracking both new development work and known bugs that need addressing, and bootstrapped our new repositories by importing their respective tickets from Jira. Creating issues to track non-trivial proposed features and enhancements is still required; this creates an opportunity for community review and helps ensure that feedback is delivered as early in the process as possible. We hope that this strikes a better balance between organization and accessibility for prospective contributors.
To further improve the onboarding flow for new Arrow contributors, we have started the process of labeling select issues as “good first issue” in arrow-rs and arrow-datafusion. These issues are small in scope but still serve as valuable contributions to the project, and help new community members to get familiar with our development workflows and tools.
As a final note: nothing here is intended as prescriptive advice. As a community, we’ve decided that these processes are the best fit for the current status of our projects, but this may change over time! There is, after all, no silver bullet for software engineering.