Adopting Apache Arrow at CloudQuery
04 May 2023
By Yevgeny Pats
This post is a collaboration with CloudQuery and cross-posted on the CloudQuery blog.
CloudQuery is an open source high performance ELT framework written in Go. We previously discussed some of the architecture and design decisions that we took to build a performant ELT framework. A type system is a key component for creating a performant and scalable ELT framework where sources and destinations are decoupled. In this blog post we will go through why we decided to adopt Apache Arrow as our type system and replace our in-house implementation.
What is a Type System?
Let’s quickly recap what a type system is and why an ELT framework needs one. At a very high level, an ELT framework extracts data from some source and moves it to some destination with a specific schema.
API ---> [Source Plugin] -----> [Destination Plugin] -----> [Destination Plugin] gRPC
Sources and destinations are decoupled and communicate via gRPC. This is crucial to allowing the addition of new destinations and updating old destinations without requiring updates to source plugin code (which otherwise would introduce an unmaintainable architecture).
This is where a type system comes in. Source plugins extract information from APIs in the most performant way possible, defining a schema and then transforming the result from the API (JSON or any other format) to a well-defined type system. The destination plugin can then easily create the schema for its database and transform the incoming data to the destination types. So to recap, the source plugin sends mainly two things to a destination: 1) the schema 2) the records that fit the defined schema. In Arrow terminology, these are a schema and a record batch.
Before Arrow, we used our own type system that supported more than 14 types. This served us well, but we started to hit limitations in various use-cases. For example, in database to database replication, we needed to support many more types, including nested types. Also, performance-wise, lots of the time spent in an ELT process is around converting data from one format to another, so we wanted to take a step back and see if we can avoid this famous XKCD (by building yet another format):
This is where Arrow comes in. Apache Arrow defines a language-independent columnar format for flat and hierarchical data, and brings the following advantages:
- Performance: Arrow adoption is rising especially in columnar based databases (DuckDB, ClickHouse, BigQuery) and file formats (Parquet) which makes it easier to write CloudQuery destination or source plugins for databases that already support Arrow as well as much more efficient as we remove the need for additional serialization and transformation step. Moreover, just the performance of sending Arrow format from source plugin to destination is already more performant and memory efficient, given its “zero-copy” nature and not needing serialization/deserialization.
- Rich Data Types: Arrow supports more than 35 types including composite types (i.e. lists, structs and maps of all the available types) and ability to extend the type system with custom types. Also, there is already built-in mapping from/to the arrow type system and the parquet type system (including nested types) which already supported in many of the arrow libraries as explained here.
Adopting Apache Arrow as the CloudQuery in-memory type system enables us to gain better performance, data interoperability and developer experience. Some plugins that are going to gain an immediate boost of rich type systems are our database-to-database replication plugins such as PostgreSQL CDC source plugin (and all database destinations) that are going to get support for all available types including nested ones.
We are excited about this step and joining the growing Arrow community. We already contributed more than 30 upstream pull requests that were quickly reviewed by the Arrow maintainers, thank you!