Apache Arrow Overview

Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to both improve the performance of analytical algorithms and the efficiency of moving data from one system or programming language to another.

A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. This data format has a rich data type system (included nested and user-defined data types) designed to support the needs of analytic database systems, data frame libraries, and more.

Columnar is Fast

The Apache Arrow format allows computational routines and execution engines to maximize their efficiency when scanning and iterating large chunks of data. In particular, the contiguous columnar layout enables vectorization using the latest SIMD (Single Instruction, Multiple Data) operations included in modern processors.

SIMD
common data layer common data layer

Standardization Saves

Without a standard columnar data format, every database and language has to implement its own internal data format. This generates a lot of waste. Moving data from one system to another involves costly serialization and deserialization. In addition, common algorithms must often be rewritten for each data format.

Arrow's in-memory columnar data format is an out-of-the-box solution to these problems. Systems that use or support Arrow can transfer data between them at little-to-no cost. Moreover, they don't need to implement custom connectors for every other system. On top of these savings, a standardized memory format facilitates reuse of libraries of algorithms, even across languages.

Arrow Libraries

The Arrow project contains libraries that enable you to work with data in the Arrow columnar format in many languages. The C++, C#, Go, Java, JavaScript, Julia, Rust, and Swift libraries contain distinct implementations of the Arrow format. These libraries are integration-tested against each other to ensure their fidelity to the format. In addition, Arrow libraries for C (GLib), MATLAB, Python, R, and Ruby are built on top of the C++ library.

These official libraries enable third-party projects to work with Arrow data without having to implement the Arrow columnar format themselves. They also contain many software components that assist with systems problems related to getting data in and out of remote storage systems and moving Arrow-formatted data over network interfaces, among other use cases.