Project News and Blog

Apache Arrow 0.16.0 Release

Published 12 Feb 2020
By The Apache Arrow PMC (pmc)

The Apache Arrow team is pleased to announce the 0.16.0 release. This covers about 4 months of development work and includes 735 resolved issues from 99 distinct contributors. See the Install Page to learn how to get the libraries for your platform.

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.

New committers

Since the 0.15.0 release, we’ve added two new committers:

Thank you for all your contributions!

Columnar Format Notes

We still have work to do to complete comprehensive columnar format integration testing between the Java and C++ libraries. Once this work is completed, we intend to make a 1.0.0 release with forward and backward compatibility guarantees.

We clarified some ambiguity on dictionary encoding in the specification. Work is on going to implement the features in Arrow libraries.

Arrow Flight RPC notes

Flight development work has recently focused on robustness and stability. If you are not yet familiar with Flight, read the introductory blog post from October.

We are also discussing adding a “bidirectional RPC” which enables request-response workflows requiring both client and server to send data streams to be performed a single RPC request.

C++ notes

Some work has been done to make the default build configuration of Arrow C++ as lean as possible. The Arrow C++ core can now be built without any external dependencies other than a new enough C++ compiler (gcc 4.9 or higher). Notably, Boost is no longer required. We invested effort to vendor some small essential dependencies: Flatbuffers, double-conversion, and uriparser. Many optional features requiring external libraries, like compression and GLog integration, are now disabled by default. Several subcomponents of the C++ project like the filesystem API, CSV, compute, dataset and JSON layers, as well as command-line utilities, are now disabled by default. The only toolchain dependency enabled by default is jemalloc, the recommended memory allocator, but this can also be disabled if desired. For illustration, see the example minimal build script and Dockerfile.

When enabled, the default jemalloc configuration has been tweaked to return memory more aggressively to the OS (ARROW-6910, ARROW-6994). We welcome feedback from users about our memory allocation configuration and performance in applications.

The array validation facilities have been vastly expanded and now exist in two flavors: the Validate method does a light-weight validation that’s O(1) in array size, while the potentially O(N) method ValidateFull does thorough data validation (ARROW-6157).

The IO APIs now use Result<T> when returning both a Status and result value, rather than taking a pointer-out function parameter (ARROW-7235).

C++: CSV

An option is added to attempt automatic dictionary encoding of string columns during reading a CSV file, until a cardinality limit is reached. When successful, it can make reading faster and the resulting Arrow data is much more memory-efficient (ARROW-3408).

The CSV APIs now use Result<T> when returning both a Status and result value, rather than taking a pointer-out function parameter (ARROW-7236).

C++: Datasets

The 0.16 release introduces the Datasets API to the C++ library, along with bindings in Python and R. This API allows you to treat multiple files as a single logical dataset entity and make efficient selection queries against it. This release includes support for Parquet and Arrow IPC file formats. Factory objects allow you to discover files in a directory recursively, inspect the schemas in the files, and performs some basic schema unification. You may specify how file path segments map to partition, and there is support for auto-detecting some partition information, including Hive-style partitioning. The Datasets API includes a filter expression syntax as well as column selection. These are evaluated with predicate pushdown, and for Parquet, evaluation is pushed down to row groups.

C++: Filesystem layer

An HDFS implementation of the FileSystem class is available (ARROW-6720). We plan to deprecate the prior bespoke C++ HDFS class in favor of the standardized filesystem API.

The filesystem APIs now use Result<T> when returning both a Status and result value, rather than taking a pointer-out function parameter (ARROW-7161).

C++: IPC

The Arrow IPC reader is being fuzzed continuously by the OSS-Fuzz infrastructure, to detect undesirable behavior on invalid or malicious input. Several issues have already been found and fixed.

C++: Parquet

Modular encryption is now supported (PARQUET-1300).

A performance regression when reading a file with a large number of columns has been fixed (ARROW-6876, ARROW-7059), as well as several bugs (PARQUET-1766, ARROW-6895).

C++: Tensors

CSC sparse matrices are supported (ARROW-4225).

The Tensor APIs now use Result<T> when returning both a Status and result value, rather than taking a pointer-out function parameter (ARROW-7420).

C# Notes

There were a number of C# bug fixes this release. Note that the C# library is not yet being tested in CI against the other native Arrow implementations (integration tests). We are looking for more contributors for the C# project to help with this and other new feature development.

Java notes

  • Added prose documentation describing how to work with the Java libraries.
  • Some additional algorithms have been added to the “contrib” algorithms package: multithreaded searching of ValueVectors,
  • The memory modules have been refactored so non-netty allocators can now be used.
  • A new utility for populating ValueVectors more concisely for testing was introduced
  • The “contrib” Avro adapter now supports Avro Logical type conversion to corresponding Arrow type.
  • Various bug fixes across all packages

Python notes

pyarrow 0.16 will be the last release to support Python 2.7.

Python now has bindings for the datasets API (ARROW-6341) as well as the S3 (ARROW-6655) and HDFS (ARROW-7310) filesystem implementations.

The Duration (ARROW-5855) and Fixed Size List (ARROW-7261) types are exposed in Python.

Sparse tensors can be converted to dense tensors (ARROW-6624). They are also interoperable with the pydata/sparse and scipy.sparse libraries (ARROW-4223, ARROW-4224).

Pandas extension arrays now are able to roundtrip through Arrow conversion (ARROW-2428).

A memory leak when converting Arrow data to Pandas “object” data has been fixed (ARROW-6874).

Arrow is now tested against Python 3.8, and we now build manylinux2014 wheels for Python 3 (ARROW-7344).

R notes

This release includes a dplyr interface to Arrow Datasets, which let you work efficiently with large, multi-file datasets as a single entity. See vignette("dataset", package = "arrow") for more.

Another major area of work in this release was to improve the installation experience on Linux. A source package installation (as from CRAN) will now handle its C++ dependencies automatically, with no system dependencies beyond what R requires. For common Linux distributions and versions, installation will retrieve a prebuilt static C++ library for inclusion in the package; where this binary is not available, the package executes a bundled script that should build the Arrow C++ library. See vignette("install", package = "arrow") for details.

For more on what’s in the 0.16 R package, see the R changelog.

Ruby and C GLib notes

Ruby and C GLib continues to follow the features in the C++ project.


Ruby includes the following improvements.

  • Improve CSV save performance (ARROW-7474).
  • Add support for saving/loading TSV (ARROW-7454).
  • Add Arrow::Schema#build_expression to improve building Gandiva::Expression (ARROW-6619).

C GLib

C GLib includes the following changes.

  • Add support for LargeList, LargeBinary, and LargeString (ARROW-6285, ARROW-6286).
  • Add filter and take API for GArrowTable, GArrowChunkedArray, and GArrowRecordBatch (ARROW-7110, ARROW-7111).
  • Add garrow_table_combine_chunks() API (ARROW-7369).

Rust notes

Support for Arrow data types has been improved, with the following array types now supported (ARROW-3690):

  • Fixed Size List and Fixed Size Binary
  • Adding a String Array for utf-8 strings, and keeping the Binary Array for general binary data
  • Duration and interval arrays.

Initial work on Arrow IPC support has been completed, with readers and writers for streams and files implemented (ARROW-5180).

Rust: DataFusion

Query execution has been reimplemented with an extensible physical query plan. This allows other projects to add other plans, such as for distributed computing or for specific database servers (ARROW-5227).

Added support for writing query results to CSV (ARROW-6274).

The new Parquet -> Arrow reader is now used to read Parquet files (ARROW-6700).

Various other query improvements have been implemented, especially on grouping and aggregate queries (ARROW-6689).

Rust: Parquet

The Arrow reader integration has been completed, allowing Parquet files to be read into Arrow memory (ARROW-4059).

Development notes

Arrow has moved away from Travis-CI and is now using Github Actions for PR-based continuous integration. This new CI configuration relies heavily on docker-compose, making it easier for developers to reproduce builds locally, thanks to tremendous work by Krisztián Szűcs (ARROW-7101).

Community Discussions Ongoing

There are a number of active discussions ongoing on the developer mailing list. We look forward to hearing from the community there.

  • Mandatory fields in IPC format: to ease input validation, it is being proposed to mark some fields in our Flatbuffers schema “required”. Those fields are already semantically required, but are not considered so by the generated Flatbuffers verifier. Before accepting this proposal, we need to ensure that it does not break binary compatibility with existing valid data.
  • The C Data Interface has not yet been formally adopted, though the community has reached consensus to move forward after addressing various design questions and concerns.
  • Guidelines for the use of “unsafe” in the Rust implementation are being discussed.

Introducing Apache Arrow Flight: A Framework for Fast Data Transport

Published 13 Oct 2019
By Wes McKinney (wesm)

Translations: 日本語

Over the last 18 months, the Apache Arrow community has been busy designing and implementing Flight, a new general-purpose client-server framework to simplify high performance transport of large datasets over network interfaces.

Flight initially is focused on optimized transport of the Arrow columnar format (i.e. “Arrow record batches”) over gRPC, Google’s popular HTTP/2-based general-purpose RPC library and framework. While we have focused on integration with gRPC, as a development framework Flight is not intended to be exclusive to gRPC.

One of the biggest features that sets apart Flight from other data transport frameworks is parallel transfers, allowing data to be streamed to or from a cluster of servers simultaneously. This enables developers to more easily create scalable data services that can serve a growing client base.

In the 0.15.0 Apache Arrow release, we have ready-to-use Flight implementations in C++ (with Python bindings) and Java. These libraries are suitable for beta users who are comfortable with API or protocol changes while we continue to refine some low-level details in the Flight internals.


Many people have experienced the pain associated with accessing large datasets over a network. There are many different transfer protocols and tools for reading datasets from remote data services, such as ODBC and JDBC. Over the last 10 years, file-based data warehousing in formats like CSV, Avro, and Parquet has become popular, but this also presents challenges as raw data must be transferred to local hosts before being deserialized.

The work we have done since the beginning of Apache Arrow holds exciting promise for accelerating data transport in a number of ways. The Arrow columnar format has key features that can help us:

  • It is an “on-the-wire” representation of tabular data that does not require deserialization on receipt
  • Its natural mode is that of “streaming batches”, larger datasets are transported a batch of rows at a time (called “record batches” in Arrow parlance). In this post we will talk about “data streams”, these are sequences of Arrow record batches using the project’s binary protocol
  • The format is language-independent and now has library support in 11 languages and counting.

Implementations of standard protocols like ODBC generally implement their own custom on-wire binary protocols that must be marshalled to and from each library’s public interface. The performance of ODBC or JDBC libraries varies greatly from case to case.

Our design goal for Flight is to create a new protocol for data services that uses the Arrow columnar format as both the over-the-wire data representation as well as the public API presented to developers. In doing so, we reduce or remove the serialization costs associated with data transport and increase the overall efficiency of distributed data systems. Additionally, two systems that are already using Apache Arrow for other purposes can communicate data to each other with extreme efficiency.

Flight Basics

The Arrow Flight libraries provide a development framework for implementing a service that can send and receive data streams. A Flight server supports several basic kinds of requests:

  • Handshake: a simple request to determine whether the client is authorized and, in some cases, to establish an implementation-defined session token to use for future requests
  • ListFlights: return a list of available data streams
  • GetSchema: return the schema for a data stream
  • GetFlightInfo: return an “access plan” for a dataset of interest, possibly requiring consuming multiple data streams. This request can accept custom serialized commands containing, for example, your specific application parameters.
  • DoGet: send a data stream to a client
  • DoPut: receive a data stream from a client
  • DoAction: perform an implementation-specific action and return any results, i.e. a generalized function call
  • ListActions: return a list of available action types

We take advantage of gRPC’s elegant “bidirectional” streaming support (built on top of HTTP/2 streaming) to allow clients and servers to send data and metadata to each other simultaneously while requests are being served.

A simple Flight setup might consist of a single server to which clients connect and make DoGet requests.

Flight Simple Architecture

Optimizing Data Throughput over gRPC

While using a general-purpose messaging library like gRPC has numerous specific benefits beyond the obvious ones (taking advantage of all the engineering that Google has done on the problem), some work was needed to improve the performance of transporting large datasets. Many kinds of gRPC users only deal with relatively small messages, for example.

The best-supported way to use gRPC is to define services in a Protocol Buffers (aka “Protobuf”) .proto file. A Protobuf plugin for gRPC generates gRPC service stubs that you can use to implement your applications. RPC commands and data messages are serialized using the Protobuf wire format. Because we use “vanilla gRPC and Protocol Buffers”, gRPC clients that are ignorant of the Arrow columnar format can still interact with Flight services and handle the Arrow data opaquely.

The main data-related Protobuf type in Flight is called FlightData. Reading and writing Protobuf messages in general is not free, so we implemented some low-level optimizations in gRPC in both C++ and Java to do the following:

  • Generate the Protobuf wire format for FlightData including the Arrow record batch being sent without going through any intermediate memory copying or serialization steps.
  • Reconstruct a Arrow record batch from the Protobuf representation of FlightData without any memory copying or deserialization. In fact, we intercept the encoded data payloads without allowing the Protocol Buffers library to touch them.

In a sense we are “having our cake and eating it, too”. Flight implementations having these optimizations will have better performance, while naive gRPC clients can still talk to the Flight service and use a Protobuf library to deserialize FlightData (albeit with some performance penalty).

As far as absolute speed, in our C++ data throughput benchmarks, we are seeing end-to-end TCP throughput in excess of 2-3GB/s on localhost without TLS enabled. This benchmark shows a transfer of ~12 gigabytes of data in about 4 seconds:

$ ./arrow-flight-benchmark --records_per_stream 100000000
Bytes read: 12800000000
Nanos: 3900466413
Speed: 3129.63 MB/s

From this we can conclude that the machinery of Flight and gRPC adds relatively little overhead, and it suggests that many real-world applications of Flight will be bottlenecked on network bandwidth.

Horizontal Scalability: Parallel and Partitioned Data Access

Many distributed database-type systems make use of an architectural pattern where the results of client requests are routed through a “coordinator” and sent to the client. Aside from the obvious efficiency issues of transporting a dataset multiple times on its way to a client, it also presents a scalability problem for getting access to very large datasets.

We wanted Flight to enable systems to create horizontally scalable data services without having to deal with such bottlenecks. A client request to a dataset using the GetFlightInfo RPC returns a list of endpoints, each of which contains a server location and a ticket to send that server in a DoGet request to obtain a part of the full dataset. To get access to the entire dataset, all of the endpoints must be consumed. While Flight streams are not necessarily ordered, we provide for application-defined metadata which can be used to serialize ordering information.

This multiple-endpoint pattern has a number of benefits:

  • Endpoints can be read by clients in parallel.
  • The service that serves the GetFlightInfo “planning” request can delegate work to sibling services to take advantage of data locality or simply to help with load balancing.
  • Nodes in a distributed cluster can take on different roles. For example, a subset of nodes might be responsible for planning queries while other nodes exclusively fulfill data stream (DoGet or DoPut) requests.

Here is an example diagram of a multi-node architecture with split service roles:

Flight Complex Architecture

Actions: Extending Flight with application business logic

While the GetFlightInfo request supports sending opaque serialized commands when requesting a dataset, a client may need to be able to ask a server to perform other kinds of operations. For example, a client may request for a particular dataset to be “pinned” in memory so that subsequent requests from other clients are served faster.

A Flight service can thus optionally define “actions” which are carried out by the DoAction RPC. An action request contains the name of the action being performed and optional serialized data containing further needed information. The result of an action is a gRPC stream of opaque binary results.

Some example actions:

  • Metadata discovery, beyond the capabilities provided by the built-in ListFlights RPC
  • Setting session-specific parameters and settings

Note that it is not required for a server to implement any actions, and actions need not return results.

Encryption and Authentication

Flight supports encryption out of the box using gRPC’s built in TLS / OpenSSL capabilities.

For authentication, there are extensible authentication handlers for the client and server that permit simple authentication schemes (like user and password) as well as more involved authentication such as Kerberos. The Flight protocol comes with a built-in BasicAuth so that user/password authentication can be implemented out of the box without custom development.

Middleware and Tracing

gRPC has the concept of “interceptors” which have allowed us to develop developer-defined “middleware” that can provide instrumentation of or telemetry for incoming and outgoing requests. One such framework for such instrumentation is OpenTracing.

Note that middleware functionality is one of the newest areas of the project and is only currently available in the project’s master branch.

gRPC, but not only gRPC

We specify server locations for DoGet requests using RFC 3986 compliant URIs. For example, TLS-secured gRPC may be specified like grpc+tls://$HOST:$PORT.

While we think that using gRPC for the “command” layer of Flight servers makes sense, we may wish to support data transport layers other than TCP such as RDMA. While some design and development work is required to make this possible, the idea is that gRPC could be used to coordinate get and put transfers which may be carried out on protocols other than TCP.

Getting Started and What’s Next

Documentation for Flight users is a work in progress, but the libraries themselves are mature enough for beta users that are tolerant of some minor API or protocol changes over the coming year.

One of the easiest ways to experiment with Flight is using the Python API, since custom servers and clients can be defined entirely in Python without any compilation required. You can see an example Flight client and server in Python in the Arrow codebase.

In real-world use, Dremio has developed an Arrow Flight-based connector which has been shown to deliver 20-50x better performance over ODBC. For Apache Spark users, Arrow contributor Ryan Murray has created a data source implementation to connect to Flight-enabled endpoints.

As far as “what’s next” in Flight, support for non-gRPC (or non-TCP) data transport may be an interesting direction of research and development work. A lot of the Flight work from here will be creating user-facing Flight-enabled services. Since Flight is a development framework, we expect that user-facing APIs will utilize a layer of API veneer that hides many general Flight details and details related to a particular application of Flight in a custom data service.

Apache Arrow 0.15.0 Release

Published 06 Oct 2019
By The Apache Arrow PMC (pmc)

The Apache Arrow team is pleased to announce the 0.15.0 release. This covers about 3 months of development work and includes 687 resolved issues from 80 distinct contributors. See the Install Page to learn how to get the libraries for your platform. The complete changelog is also available.

About a third of issues closed (240) were classified as bug fixes, so this release brings many stability, memory use, and performance improvements over 0.14.x. We will discuss some of the language-specific improvements and new features below.

New committers

Since the 0.14.0 release, we’ve added four new committers:

In addition, Sebastien Binet and Micah Kornfield have joined the PMC.

Thank you for all your contributions!

Columnar Format Notes

The format gets new datatypes : LargeList(ARROW-4810), LargeBinary and LargeString (ARROW-750). LargeList is similar to List but with 64-bit offsets instead of 32-bit. The same relationship holds for LargeBinary and LargeString with respect to Binary and String.

Since the last major release, we have also made a significant overhaul of the columnar format documentation to be clearer and easier to follow for implementation creators.

Upcoming Columnar Format Stability and Library / Format Version Split

The Arrow community has decided to make a 1.0.0 release of the project marking formal stability of the columnar format and binary protocol, including explicit forward and backward compatibility guarantees. You can read about these guarantees in the new documentation page about versioning.

Starting with 1.0.0, we will give the columnar format and libraries separate version numbers. This will allow the library versions to evolve without creating confusion or uncertainty about whether the Arrow columnar format remains stable or not.

Columnar “Streaming Protocol” Change since 0.14.0

Since 0.14.0 we have modified the IPC “encapsulated message” format to insert 4 bytes of additional data in the message preamble to ensure that the Flatbuffers metadata starts on an aligned offset. By default, IPC streams generated by 0.15.0 and later will not be readable by library versions 0.14.1 and prior. Implementations have offered options to write messages using the now “legacy” message format.

For users who cannot upgrade to version 0.15.0 in all parts of their system, such as Apache Spark users, we recommend one of the two routes:

  • If using pyarrow, set the environment variable ARROW_PRE_0_15_IPC_FORMAT=1 when using 0.15.0 and sending data to an old library
  • Wait to upgrade all components simultaneously

We do not anticipate making this kind of change again in the near future and would not have made such a non-forward-compatible change unless we deemed it very important.

Arrow Flight notes

A GetFlightSchema method is added to the Flight RPC protocol (ARROW-6094). As the name suggests, it returns the schema for a given Flight descriptor on the server. This is useful for cases where the Flight locations are not immediately available, depending on the server implementation.

Flight implementations for C++ and Java now implement half-closed semantics for DoPut (ARROW-6063). The client can close the writing end of the stream to signal that it has finished sending the Flight data, but still receive the batch-specific response and its associated metadata.

C++ notes

C++ now supports the LargeList, LargeBinary and LargeString datatypes.

The Status class gains the ability to carry additional subsystem-specific data with it, under the form of an opaque StatusDetail interface (ARROW-4036). This allows, for example, to store not only an exception message coming from Python but the actual Python exception object, such as to raise it again if the Status is propagated back to Python. It can also enable the consumer of Status to inspect the subsystem-specific error, such as a finer-grained Flight error code.

DataType and Schema equality are significantly faster (ARROW-6038).

The Column class is completely removed, as it did not have a strong enough motivation for existing between ChunkedArray and RecordBatch / Table (ARROW-5893).

C++: Parquet

The 0.15 release includes many improvements to the Apache Parquet C++ internals, resulting in greatly improved read and write performance. We described the work and published some benchmarks in a recent blog post.

C++: CSV reader

The CSV reader is now more flexible in terms of how column names are chosen (ARROW-6231) and column selection (ARROW-5977).

C++: Memory Allocation Layer

Arrow now has the option to allocate memory using the mimalloc memory allocator. jemalloc is still preferred for best performance, but mimalloc is a reasonable alternative to the system allocator on Windows where jemalloc is not currently supported.

Also, we now expose explicit global functions to get a MemoryPool for each of the jemalloc allocator, mimalloc allocator and system allocator (ARROW-6292).

The vendored jemalloc version is bumped from 4.5.x to 5.2.x (ARROW-6549). Performance characteristics may differ on memory allocation-heavy workloads, though we did not notice any significant regression on our suite of micro-benchmarks (and a multi-threaded benchmark of reading a CSV file showed a 25% speedup).

C++: Filesystem layer

A FileSystem implementation to access Amazon S3-compatible filesystems is now available. It depends on the AWS SDK for C++.

C++: I/O layer

Significant improvements were made to the Arrow I/O stack.

  • ARROW-6180: Add RandomAccessFile::GetStream that returns an InputStream over a fixed subset of the file.
  • ARROW-6381: Improve performance of small writes with BufferOutputStream
  • ARROW-2490: Streamline concurrency semantics of InputStream implementations, and add debug checks for race conditions between non-thread-safe InputStream operations.
  • ARROW-6527: Add an OutputStream::Write overload that takes an owned Buffer rather than a raw memory area. This allows OutputStream implementations to safely implement delayed writing without having to copy the data.

C++: Tensors

There are three improvements of Tensor and SparseTensor in this release.

  • Add Tensor::Value template function for element access
  • Add EqualOptions support in Tensor::Equals function, that allows us to control the way to compare two float tensors
  • Add smaller bit-width index supports in SparseTensor

C# Notes

We have fixed some bugs causing incompatibilities between C# and other Arrow implementations.

Java notes

  • Added an initial version of an Avro adapter
  • To improve the JDBC adapter performance, refactored consume data logic and implemented an iterator API to prevent loading all data into one vector
  • Implemented subField encoding for complex type, now List and Struct vectors subField encoding is available
  • Implemented visitor API for vector/range/type/approx equals compare
  • Performed a lot of optimization and refactoring for DictionaryEncoder, supporting all data types and avoiding memory copy via hash table and visitor API
  • Introduced Error Prone into code base to catch more potential errors earlier
  • Fixed the bug where dictionary entries were required in IPC streams even when empty; readers can now also read interleaved messages

Python notes

The FileSystem API, implemented in C++, is now available in Python (ARROW-5494).

The API for extension types has been straightened and definition of custom extension types in Python is now more powerful (ARROW-5610).

Sparse tensors are now available in Python (ARROW-4453).

A potential crash when handling Python dates and datetimes was fixed (ARROW-6597).

Based on a mailing list discussion, we are looking for help with maintaining our Python wheels. Community members have found that the wheels take up a great deal of maintenance time, so if you or your organization depend on pip install pyarrow working, we would appreciate your assistance.

Ruby and C GLib notes

Ruby and C GLib continues to follow the features in the C++ project. Ruby includes the following backward incompatible changes.

  • Remove Arrow::Struct and use Hash instead.
  • Add Arrow::Time for Arrow::Time{32,64}DataType value.
  • Arrow::Decimal128Array#get_value returns BigDecimal.

Ruby improves the performance of Arrow#values.

Rust notes

A number of core Arrow improvements were made to the Rust library.

  • Add explicit SIMD vectorization for the divide kernel
  • Add a feature to disable SIMD
  • Use “if cfg!” pattern
  • Optimizations to BooleanBufferBuilder::append_slice
  • Implemented Debug trait for List/Struct/BinaryArray

Improvements related to Rust Parquet and DataFusion are detailed next.

Rust Parquet

  • Implement Arrow record reader
  • Add converter that is used to convert record reader’s content to arrow primitive array.

Rust DataFusion

  • Preview of new query execution engine using an extensible trait-based physical execution plan that supports parallel execution using threads
  • ExecutionContext now has a register_parquet convenience method for registering Parquet data sources
  • Fixed bug in type coercion optimizer rule
  • TableProvider.scan() now returns a thread-safe BatchIterator
  • Remove use of bare trait objects (switched to using dyn syntax)
  • Adds casting from unsigned to signed integer data types

R notes

A major development since the 0.14 release was the arrival of the arrow R package on CRAN. We wrote about this in August on the Arrow blog. In addition to the package availability on CRAN, we also published package documentation on the Arrow website.

The 0.15 R package includes many of the enhancements in the C++ library release, such as the Parquet performance improvements and the FileSystem API. In addition, there are a number of upgrades that make it easier to read and write data, specify types and schema, and interact with Arrow tables and record batches in R.

For more on what’s in the 0.15 R package, see the changelog.

Community Discussions Ongoing

There are a number of active discussions ongoing on the developer mailing list. We look forward to hearing from the community there.

Faster C++ Apache Parquet performance on dictionary-encoded string data coming in Apache Arrow 0.15

Published 05 Sep 2019
By Wes McKinney (wesm)

We have been implementing a series of optimizations in the Apache Parquet C++ internals to improve read and write efficiency (both performance and memory use) for Arrow columnar binary and string data, with new “native” support for Arrow’s dictionary types. This should have a big impact on users of the C++, MATLAB, Python, R, and Ruby interfaces to Parquet files.

This post reviews work that was done and shows benchmarks comparing Arrow 0.12.1 with the current development version (to be released soon as Arrow 0.15.0).

Summary of work

One of the largest and most complex optimizations involves encoding and decoding Parquet files’ internal dictionary-encoded data streams to and from Arrow’s in-memory dictionary-encoded DictionaryArray representation. Dictionary encoding is a compression strategy in Parquet, and there is no formal “dictionary” or “categorical” type. I will go into more detail about this below.

Some of the particular JIRA issues related to this work include:

  • Vectorize comparators for computing statistics (PARQUET-1523)
  • Read binary directly data directly into dictionary builder (ARROW-3769)
  • Writing Parquet’s dictionary indices directly into dictionary builder (ARROW-3772)
  • Write dense (non-dictionary) Arrow arrays directly into Parquet data encoders (ARROW-6152)
  • Direct writing of arrow::DictionaryArray to Parquet column writers (ARROW-3246)
  • Supporting changing dictionaries (ARROW-3144)
  • Internal IO optimizations and improved raw BYTE_ARRAY encoding performance (ARROW-4398)

One of the challenges of developing the Parquet C++ library is that we maintain low-level read and write APIs that do not involve the Arrow columnar data structures. So we have had to take care to implement Arrow-related optimizations without impacting non-Arrow Parquet users, which includes database systems like Clickhouse and Vertica.

Background: how Parquet files do dictionary encoding

Many direct and indirect users of Apache Arrow use dictionary encoding to improve performance and memory use on binary or string data types that include many repeated values. MATLAB or pandas users will know this as the Categorical type (see MATLAB docs or pandas docs) while in R such encoding is known as factor. In the Arrow C++ library and various bindings we have the DictionaryArray object for representing such data in memory.

For example, an array such as

['apple', 'orange', 'apple', NULL, 'orange', 'orange']

has dictionary-encoded form

dictionary: ['apple', 'orange']
indices: [0, 1, 0, NULL, 1, 1]

The Parquet format uses dictionary encoding to compress data, and it is used for all Parquet data types, not just binary or string data. Parquet further uses bit-packing and run-length encoding (RLE) to compress the dictionary indices, so if you had data like

['apple', 'apple', 'apple', 'apple', 'apple', 'apple', 'orange']

the indices would be encoded like

[rle-run=(6, 0),

The full details of the rle-bitpacking encoding are found in the Parquet specification.

When writing a Parquet file, most implementations will use dictionary encoding to compress a column until the dictionary itself reaches a certain size threshold, usually around 1 megabyte. At this point, the column writer will “fall back” to PLAIN encoding where values are written end-to-end in “data pages” and then usually compressed with Snappy or Gzip. See the following rough diagram:

Internal ColumnChunk structure

Faster reading and writing of dictionary-encoded data

When reading a Parquet file, the dictionary-encoded portions are usually materialized to their non-dictionary-encoded form, causing binary or string values to be duplicated in memory. So an obvious (but not trivial) optimization is to skip this “dense” materialization. There are several issues to deal with:

  • A Parquet file often contains multiple ColumnChunks for each semantic column, and the dictionary values may be different in each ColumnChunk
  • We must gracefully handle the “fall back” portion which is not dictionary-encoded

We pursued several avenues to help with this:

  • Allowing each DictionaryArray to have a different dictionary (before, the dictionary was part of the DictionaryType, which caused problems)
  • We enabled the Parquet dictionary indices to be directly written into an Arrow DictionaryBuilder without rehashing the data
  • When decoding a ColumnChunk, we first append the dictionary values and indices into an Arrow DictionaryBuilder, and when we encounter the “fall back” portion we use a hash table to convert those values to dictionary-encoded form
  • We override the “fall back” logic when writing a ColumnChunk from an DictionaryArray so that reading such data back is more efficient

All of these things together have produced some excellent performance results that we will detail below.

The other class of optimizations we implemented was removing an abstraction layer between the low-level Parquet column data encoder and decoder classes and the Arrow columnar data structures. This involves:

  • Adding ColumnWriter::WriteArrow and Encoder::Put methods that accept arrow::Array objects directly
  • Adding ByteArrayDecoder::DecodeArrow method to decode binary data directly into an arrow::BinaryBuilder.

While the performance improvements from this work are less dramatic than for dictionary-encoded data, they are still meaningful in real-world applications.

Performance Benchmarks

We ran some benchmarks comparing Arrow 0.12.1 with the current master branch. We construct two kinds of Arrow tables with 10 columns each:

  • “Low cardinality” and “high cardinality” variants. The “low cardinality” case has 1,000 unique string values of 32-bytes each. The “high cardinality” has 100,000 unique values
  • “Dense” (non-dictionary) and “Dictionary” variants

See the full benchmark script.

We show both single-threaded and multithreaded read performance. The test machine is an Intel i9-9960X using gcc 8.3.0 (on Ubuntu 18.04) with 16 physical cores and 32 virtual cores. All time measurements are reported in seconds, but we are most interested in showing the relative performance.

First, the writing benchmarks:

Parquet write benchmarks

Writing DictionaryArray is dramatically faster due to the optimizations described above. We have achieved a small improvement in writing dense (non-dictionary) binary arrays.

Then, the reading benchmarks:

Parquet read benchmarks

Here, similarly reading DictionaryArray directly is many times faster.

These benchmarks show that parallel reads of dense binary data may be slightly slower though single-threaded reads are now faster. We may want to do some profiling and see what we can do to bring read performance back in line. Optimizing the dense read path has not been too much of a priority relative to the dictionary read path in this work.

Memory Use Improvements

In addition to faster performance, reading columns as dictionary-encoded can yield significantly less memory use.

In the dict-random case above, we found that the master branch uses 405 MB of RAM at peak while loading a 152 MB dataset. In v0.12.1, loading the same Parquet file without the accelerated dictionary support uses 1.94 GB of peak memory while the resulting non-dictionary table occupies 1.01 GB.

Note that we had a memory overuse bug in versions 0.14.0 and 0.14.1 fixed in ARROW-6060, so if you are hitting this bug you will want to upgrade to 0.15.0 as soon as it comes out.


There are still many Parquet-related optimizations that we may pursue in the future, but the ones here can be very helpful to people working with string-heavy datasets, both in performance and memory use. If you’d like to discuss this development work, we’d be glad to hear from you on our developer mailing list

Apache Arrow R Package On CRAN

Published 08 Aug 2019
By Neal Richardson (npr)

We are very excited to announce that the arrow R package is now available on CRAN.

Apache Arrow is a cross-language development platform for in-memory data that specifies a standardized columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. The arrow package provides an R interface to the Arrow C++ library, including support for working with Parquet and Feather files, as well as lower-level access to Arrow memory and messages.

You can install the package from CRAN with


On macOS and Windows, installing a binary package from CRAN will generally handle Arrow’s C++ dependencies for you. However, the macOS CRAN binaries are unfortunately incomplete for this version, so to install 0.14.1, you’ll first need to use Homebrew to get the Arrow C++ library (brew install apache-arrow), and then from R you can install.packages("arrow", type = "source").

Windows binaries are not yet available on CRAN but should be published soon.

On Linux, you’ll need to first install the C++ library. See the Arrow project installation page to find pre-compiled binary packages for some common Linux distributions, including Debian, Ubuntu, and CentOS. You’ll need to install libparquet-dev on Debian and Ubuntu, or parquet-devel on CentOS. This will also automatically install the Arrow C++ library as a dependency. Other Linux distributions must install the C++ library from source.

If you install the arrow R package from source and the C++ library is not found, the R package functions will notify you that Arrow is not available. Call


for version- and platform-specific guidance on installing the Arrow C++ library.

Parquet files

This package introduces basic read and write support for the Apache Parquet columnar data file format. Prior to its availability, options for accessing Parquet data in R were limited; the most common recommendation was to use Apache Spark. The arrow package greatly simplifies this access and lets you go from a Parquet file to a data.frame and back easily, without having to set up a database.

df <- read_parquet("path/to/file.parquet")

This function, along with the other readers in the package, takes an optional col_select argument, inspired by the vroom package. This argument lets you use the “tidyselect” helper functions, as you can do in dplyr::select(), to specify that you only want to keep certain columns. By narrowing your selection at read time, you can load a data.frame with less memory overhead.

For example, suppose you had written the iris dataset to Parquet. You could read a data.frame with only the columns c("Sepal.Length", "Sepal.Width") by doing

df <- read_parquet("iris.parquet", col_select = starts_with("Sepal"))

Just as you can read, you can write Parquet files:

write_parquet(df, "path/to/different_file.parquet")

Note that this read and write support for Parquet files in R is in its early stages of development. The Python Arrow library (pyarrow) still has much richer support for Parquet files, including working with multi-file datasets. We intend to reach feature equivalency between the R and Python packages in the future.

Feather files

This package also includes a faster and more robust implementation of the Feather file format, providing read_feather() and write_feather(). Feather was one of the initial applications of Apache Arrow for Python and R, providing an efficient, common file format language-agnostic data frame storage, along with implementations in R and Python.

As Arrow progressed, development of Feather moved to the apache/arrow project, and for the last two years, the Python implementation of Feather has just been a wrapper around pyarrow. This meant that as Arrow progressed and bugs were fixed, the Python version of Feather got the improvements but sadly R did not.

With the arrow package, the R implementation of Feather catches up and now depends on the same underlying C++ library as the Python version does. This should result in more reliable and consistent behavior across the two languages, as well as improved performance.

We encourage all R users of feather to switch to using arrow::read_feather() and arrow::write_feather().

Note that both Feather and Parquet are columnar data formats that allow sharing data frames across R, Pandas, and other tools. When should you use Feather and when should you use Parquet? Parquet balances space-efficiency with deserialization costs, making it an ideal choice for remote storage systems like HDFS or Amazon S3. Feather is designed for fast local reads, particularly with solid-state drives, and is not intended for use with remote storage systems. Feather files can be memory-mapped and accessed as Arrow columnar data in-memory without any deserialization while Parquet files always must be decompressed and decoded. See the Arrow project FAQ for more.

Other capabilities

In addition to these readers and writers, the arrow package has wrappers for other readers in the C++ library; see ?read_csv_arrow and ?read_json_arrow. These readers are being developed to optimize for the memory layout of the Arrow columnar format and are not intended as a direct replacement for existing R CSV readers (base::read.csv, readr::read_csv, data.table::fread) that return an R data.frame.

It also provides many lower-level bindings to the C++ library, which enable you to access and manipulate Arrow objects. You can use these to build connectors to other applications and services that use Arrow. One example is Spark: the sparklyr package has support for using Arrow to move data to and from Spark, yielding significant performance gains.


In addition to the work on wiring the R package up to the Arrow Parquet C++ library, a lot of effort went into building and packaging Arrow for R users, ensuring its ease of installation across platforms. We’d like to thank the support of Jeroen Ooms, Javier Luraschi, JJ Allaire, Davis Vaughan, the CRAN team, and many others in the Apache Arrow community for helping us get to this point.

Apache Arrow 0.14.0 Release

Published 02 Jul 2019
By The Apache Arrow PMC (pmc)

The Apache Arrow team is pleased to announce the 0.14.0 release. This covers 3 months of development work and includes 602 resolved issues from 75 distinct contributors. See the Install Page to learn how to get the libraries for your platform. The complete changelog is also available.

This post will give some brief highlights in the project since the 0.13.0 release from April.

New committers

Since the 0.13.0 release, the following have been added:

Thank you for all your contributions!

Upcoming 1.0.0 Format Stability Release

We are planning for our next major release to move from 0.14.0 to 1.0.0. The major version number will indicate stability of the Arrow columnar format and binary protocol. While the format has already been stable since December 2017, we believe it is a good idea to make this stability official and to indicate that it is safe to persist serialized Arrow data in applications. This means that applications will be able to safely upgrade to new Arrow versions without having to worry about backwards incompatibilities. We will write in a future blog post about the stability guarantees we intend to provide to help application developers plan accordingly.


We added support for the following platforms:

  • Debian GNU/Linux buster
  • Ubuntu 19.04

We dropped support for Ubuntu 14.04.

Development Infrastructure and Tooling

As the project has grown larger and more diverse, we are increasingly outgrowing what we can test in public continuous integration services like Travis CI and Appveyor. In addition, we share these resources with the entire Apache Software Foundation, and given the high volume of pull requests into Apache Arrow, maintainers are frequently waiting many hours for the green light to merge patches.

The complexity of our testing is driven by the number of different components and programming languages as well as increasingly long compilation and test execution times as individual libraries grow larger. The 50 minute time limit of public CI services is simply too limited to comprehensively test the project. Additionally, the CI host machines are constrained in their features and memory limits, preventing us from testing features that are only relevant on large amounts of data (10GB or more) or functionality that requires a CUDA-enabled GPU.

Organizations that contribute to Apache Arrow are working on physical build infrastructure and tools to improve build times and build scalability. One such new tool is ursabot, a GitHub-enabled bot that can be used to trigger builds either on physical build or in the cloud. It can also be used to trigger benchmark timing comparisons. If you are contributing to the project, you may see Ursabot being employed to trigger tests in pull requests.

To help assist with migrating away from Travis CI, we are also working to make as many of our builds reproducible with Docker and not reliant on Travis CI-specific configuration details. This will also help contributors reproduce build failures locally without having to wait for Travis CI.

Columnar Format Notes

  • User-defined “extension” types have been formalized in the Arrow format, enabling library users to embed custom data types in the Arrow columnar format. Initial support is available in C++, Java, and Python.
  • A new Duration logical type was added to represent absolute lengths of time.

Arrow Flight notes

Flight now supports many of the features of a complete RPC framework.

  • Authentication APIs are now supported across all languages (ARROW-5137)
  • Encrypted communication using OpenSSL is supported (ARROW-5643, ARROW-5529)
  • Clients can specify timeouts on remote calls (ARROW-5136)
  • On the protocol level, endpoints are now identified with URIs, to support an open-ended number of potential transports (including TLS and Unix sockets, and perhaps even non-gRPC-based transports in the future) (ARROW-4651)
  • Application-defined metadata can be sent alongside data (ARROW-4626, ARROW-4627).

Windows is now a supported platform for Flight in C++ and Python (ARROW-3294), and Python wheels are shipped for all languages (ARROW-3150, ARROW-5656). C++, Python, and Java have been brought to parity, now that actions can return streaming results in Java (ARROW-5254).

C++ notes

188 resolved issues related to the C++ implementation, so we summarize some of the work here.

General platform improvements

  • A FileSystem abstraction (ARROW-767) has been added, which paves the way for a future Arrow Datasets library allowing to access sharded data on arbitrary storage systems, including remote or cloud storage. A first draft of the Datasets API was committed in ARROW-5512. Right now, this comes with no implementation, but we expect to slowly build it up in the coming weeks or months. Early feedback is welcome on this API.
  • The dictionary API has been reworked in ARROW-3144. The dictionary values used to be tied to the DictionaryType instance, which ended up too inflexible. Since dictionary-encoding is more often an optimization than a semantic property of the data, we decided to move the dictionary values to the ArrayData structure, making it natural for dictionary-encoded arrays to share the same DataType instance, regardless of the encoding details.
  • The FixedSizeList and Map types have been implemented, including in integration tests. The Map type is akin to a List of Struct(key, value) entries, but making it explicit that the underlying data has key-value mapping semantics. Also, map entries are always non-null.
  • A Result<T> class has been introduced in ARROW-4800. The aim is to allow to return an error as w ell as a function’s logical result without resorting to pointer-out arguments.
  • The Parquet C++ library has been refactored to use common Arrow IO classes for improved C++ platform interoperability.

Line-delimited JSON reader

A multithreaded line-delimited JSON reader (powered internally by RapidJSON) is now available for use (also in Python and R via bindings) . This will likely be expanded to support more kinds of JSON storage in the future.

New computational kernels

A number of new computational kernels have been developed

  • Compare filter for logical comparisons yielding boolean arrays
  • Filter kernel for selecting elements of an input array according to a boolean selection array.
  • Take kernel, which selects elements by integer index, has been expanded to support nested types

C# Notes

The native C# implementation has continued to mature since 0.13. This release includes a number of performance, memory use, and usability improvements.

Go notes

Go’s support for the Arrow columnar format continues to expand. Go now supports reading and writing the Arrow columnar binary protocol, and it has also been added to the cross language integration tests. There are now four languages (C++, Go, Java, and JavaScript) included in our integration tests to verify cross-language interoperability.

Java notes

  • Support for referencing arbitrary memory using ArrowBuf has been implemented, paving the way for memory map support in Java
  • A number of performance improvements around vector value access were added (see ARROW-5264, ARROW-5290).
  • The Map type has been implemented in Java and integration tested with C++
  • Several microbenchmarks have been added and improved. Including a significant speed-up of zeroing out buffers.
  • A new algorithms package has been started to contain reference implementations of common algorithms. The initial contribution is for Array/Vector sorting.

JavaScript Notes

A new incremental array builder API is available.


Version 0.14.0 features improved Feather file support in the MEX bindings.

Python notes

  • We fixed a problem with the Python wheels causing the Python wheels to be much larger in 0.13.0 than they were in 0.12.0. Since the introduction of LLVM into our build toolchain, the wheels are going to still be significantly bigger. We are interested in approaches to enable pyarrow to be installed in pieces with pip or conda rather than monolithically.
  • It is now possible to define ExtensionTypes with a Python implementation (ARROW-840). Those ExtensionTypes can survive a roundtrip through C++ and serialization.
  • The Flight improvements highlighted above (see C++ notes) are all available from Python. Furthermore, Flight is now bundled in our binary wheels and conda packages for Linux, Windows and macOS (ARROW-3150, ARROW-5656).
  • We will build “manylinux2010” binary wheels for Linux systems, in addition to “manylinux1” wheels (ARROW-2461). Manylinux2010 is a newer standard for more recent systems, with less limiting toolchain constraints. Installing manylinux2010 wheels requires an up-to-date version of pip.
  • Various bug fixes for CSV reading in Python and C++ including the ability to parse Decimal(x, y) columns.

Parquet improvements

  • Column statistics for logical types like unicode strings, unsigned integers, and timestamps are casted to compatible Python types (see ARROW-4139)
  • It’s now possible to configure “data page” sizes when writing a file from Python

Ruby and C GLib notes

The GLib and Ruby bindings have been tracking features in the C++ project. This release includes bindings for Gandiva, JSON reader, and other C++ features.

Rust notes

There is ongoing work in Rust happening on Parquet file support, computational kernels, and the DataFusion query engine. See the full changelog for details.

R notes

We have been working on build and packaging for R so that community members can hopefully release the project to CRAN in the near future. Feature development for R has continued to follow the upstream C++ project.

Community Discussions Ongoing

There are a number of active discussions ongoing on the developer mailing list. We look forward to hearing from the community there:

Apache Arrow 0.13.0 Release

Published 02 Apr 2019
By Wes McKinney (wesm)

The Apache Arrow team is pleased to announce the 0.13.0 release. This covers more than 2 months of development work and includes 550 resolved issues from 81 distinct contributors.

See the Install Page to learn how to get the libraries for your platform. The complete changelog is also available.

While it’s a large release, this post will give some brief highlights in the project since the 0.12.0 release from January.

New committers and PMC member

The Arrow team is growing! Since the 0.12.0 release we have increased the size of our committer and PMC rosters.

Thank you for all your contributions!

Rust DataFusion Query Engine donation

Since the last release, we received a donation of DataFusion, a Rust-native query engine for the Arrow columnar format, whose development had been led prior by Andy Grove. Read more about DataFusion in our February blog post.

This is an exciting development for the Rust community, and we look forward to developing more analytical query processing within the Apache Arrow project.

Arrow Flight gRPC progress

Over the last couple months, we have made significant progress on Arrow Flight, an Arrow-native data messaging framework. We have integration tests to check C++ and Java compatibility, and we have added Python bindings for the C++ library. We will write a future blog post to go into more detail about how Flight works.

C++ notes

There were 231 issues relating to C++ in this release, far too much to summarize in a blog post. Some notable items include:

  • An experimental ExtensionType was developed for creating user-defined data types that can be embedded in the Arrow binary protocol. This is not yet finalized, but feedback would be welcome.
  • We have undertaken a significant reworking of our CMake build system for C++ to make the third party dependencies more configurable. Among other things, this eases work on packaging for Linux distributions. Read more about this in the C++ developer documentation.
  • Laying more groundwork for an Arrow-native in-memory query engine
  • We began building a reader for line-delimited JSON files
  • Gandiva can now be compiled on Windows with Visual Studio

C# Notes

C# .NET development has picked up since the initial code donation last fall. 11 issues were resolved this release cycle.

The Arrow C# package is now available via NuGet.

Go notes

8 Go-related issues were resolved. A notable feature is the addition of a CSV file writer.

Java notes

26 Java issues were resolved. Outside of Flight-related work, some notable items include:

  • Migration to Java 8 date and time APIs from Joda
  • Array type support in JDBC adapter

Javascript Notes

The recent JavaScript 0.4.1 release is the last JavaScript-only release of Apache Arrow. Starting with 0.13 the Javascript implementation is now included in mainline Arrow releases! The version number of the released JavaScript packages will now be in sync with the mainline version number.

Python notes

86 Python-related issues were resolved. Some highlights include:

  • The Gandiva LLVM expression compiler is now available in the Python wheels through the pyarrow.gandiva module.
  • Flight RPC bindings
  • Improved pandas serialization performance with RangeIndex
  • pyarrow can be used without pandas installed

Note that Apache Arrow will continue to support Python 2.7 until January 2020.

Ruby and C GLib notes

36 C/GLib- and Ruby-related issues were resolved. The work continues to follow the upstream work in the C++ project.

  • Arrow::RecordBatch#raw_records was added. It can convert a record batch to a Ruby’s array in 10x-200x faster than the same conversion by a pure-Ruby implementation.

Rust notes

69 Rust-related issues were resolved. Many of these relate to ongoing work in the DataFusion query engine. Some notable items include:

  • Date/time support
  • SIMD for arithmetic operations
  • Writing CSV and reading line-delimited JSON
  • Parquet data source support for DataFusion
  • Prototype DataFrame-style API for DataFusion
  • Continued evolution of Parquet file reader

R development progress

The Arrow R developers have expanded the scope of the R language bindings and additionally worked on packaging support to be able to submit the package to CRAN in the near future. 23 issues were resolved for this release.

We wrote in January about ongoing work to accelerate R work on Apache Spark using Arrow.

Community Discussions Ongoing

There are a number of active discussions ongoing on the developer mailing list. We look forward to hearing from the community there:

  • Benchmarking: we are working to create tools for tracking all of our benchmark results on a commit-by-commit basis in a centralized database schema so that we can monitor for performance regressions over time. We hope to develop a publicly viewable benchmark result dashboard.
  • C++ Datasets: development of a unified API for reading and writing datasets stored in various common formats like Parquet, JSON, and CSV.
  • C++ Query Engine: architecture of a parallel Arrow-native query engine for C++
  • Arrow Flight Evolution: adding features to support different real-world data messaging use cases
  • Arrow Columnar Format evolution: we are discussing a new “duration” or “time interval” type and some other additions to the Arrow columnar format.

Reducing Python String Memory Use in Apache Arrow 0.12

Published 05 Feb 2019
By Wes McKinney (wesm)

Python users who upgrade to recently released pyarrow 0.12 may find that their applications use significantly less memory when converting Arrow string data to pandas format. This includes using pyarrow.parquet.read_table and pandas.read_parquet. This article details some of what is going on under the hood, and why Python applications dealing with large amounts of strings are prone to memory use problems.

Why Python strings can use a lot of memory

Let’s start with some possibly surprising facts. I’m going to create an empty bytes object and an empty str (unicode) object in Python 3.7:

In [1]: val = b''

In [2]: unicode_val = u''

The sys.getsizeof function accurately reports the number of bytes used by built-in Python objects. You might be surprised to find that:

In [4]: import sys
In [5]: sys.getsizeof(val)
Out[5]: 33

In [6]: sys.getsizeof(unicode_val)
Out[6]: 49

Since strings in Python are nul-terminated, we can infer that a bytes object has 32 bytes of overhead while unicode has 48 bytes. One must also account for PyObject* pointer references to the objects, so the actual overhead is 40 and 56 bytes, respectively. With large strings and text, this overhead may not matter much, but when you have a lot of small strings, such as those arising from reading a CSV or Apache Parquet file, they can take up an unexpected amount of memory. pandas represents strings in NumPy arrays of PyObject* pointers, so the total memory used by a unique unicode string is

8 (PyObject*) + 48 (Python C struct) + string_length + 1

Suppose that we read a CSV file with

  • 1 column
  • 1 million rows
  • Each value in the column is a string with 10 characters

On disk this file would take approximately 10MB. Read into memory, however, it could take up over 60MB, as a 10 character string object takes up 67 bytes in a pandas.Series.

How Apache Arrow represents strings

While a Python unicode string can have 57 bytes of overhead, a string in the Arrow columnar format has only 4 (32 bits) or 4.125 (33 bits) bytes of overhead. 32-bit integer offsets encodes the position and size of a string value in a contiguous chunk of memory:

Apache Arrow string memory layout

When you call table.to_pandas() or array.to_pandas() with pyarrow, we have to convert this compact string representation back to pandas’s Python-based strings. This can use a huge amount of memory when we have a large number of small strings. It is a quite common occurrence when working with web analytics data, which compresses to a compact size when stored in the Parquet columnar file format.

Note that the Arrow string memory format has other benefits beyond memory use. It is also much more efficient for analytics due to the guarantee of data locality; all strings are next to each other in memory. In the case of pandas and Python strings, the string data can be located anywhere in the process heap. Arrow PMC member Uwe Korn did some work to extend pandas with Arrow string arrays for improved performance and memory use.

Reducing pandas memory use when converting from Arrow

For many years, the pandas.read_csv function has relied on a trick to limit the amount of string memory allocated. Because pandas uses arrays of PyObject* pointers to refer to objects in the Python heap, we can avoid creating multiple strings with the same value, instead reusing existing objects and incrementing their reference counts.

Schematically, we have the following:

pandas string memory optimization

In pyarrow 0.12, we have implemented this when calling to_pandas. It requires using a hash table to deduplicate the Arrow string data as it’s being converted to pandas. Hashing data is not free, but counterintuitively it can be faster in addition to being vastly more memory efficient in the common case in analytics where we have table columns with many instances of the same string values.

Memory and Performance Benchmarks

We can use the memory_profiler Python package to easily get process memory usage within a running Python application.

import memory_profiler
def mem():
    return memory_profiler.memory_usage()[0]

In a new application I have:

In [7]: mem()
Out[7]: 86.21875

I will generate approximate 1 gigabyte of string data represented as Python strings with length 10. The pandas.util.testing module has a handy rands function for generating random strings. Here is the data generation function:

from pandas.util.testing import rands
def generate_strings(length, nunique, string_length=10):
    unique_values = [rands(string_length) for i in range(nunique)]
    values = unique_values * (length // nunique)
    return values

This generates a certain number of unique strings, then duplicates then to yield the desired number of total strings. So I’m going to create 100 million strings with only 10000 unique values:

In [8]: values = generate_strings(100000000, 10000)

In [9]: mem()
Out[9]: 852.140625

100 million PyObject* values is only 745 MB, so this increase of a little over 770 MB is consistent with what we know so far. Now I’m going to convert this to Arrow format:

In [11]: arr = pa.array(values)

In [12]: mem()
Out[12]: 2276.9609375

Since pyarrow exactly accounts for all of its memory allocations, we also check that

In [13]: pa.total_allocated_bytes()
Out[13]: 1416777280

Since each string takes about 14 bytes (10 bytes plus 4 bytes of overhead), this is what we expect.

Now, converting arr back to pandas is where things get tricky. The minimum amount of memory that pandas can use is a little under 800 MB as above as we need 100 million PyObject* values, which are 8 bytes each.

In [14]: arr_as_pandas = arr.to_pandas()

In [15]: mem()
Out[15]: 3041.78125

Doing the math, we used 765 MB which seems right. We can disable the string deduplication logic by passing deduplicate_objects=False to to_pandas:

In [16]: arr_as_pandas_no_dedup = arr.to_pandas(deduplicate_objects=False)

In [17]: mem()
Out[17]: 10006.95703125

Without object deduplication, we use 6965 megabytes, or an average of 73 bytes per value. This is a little bit higher than the theoretical size of 67 bytes computed above.

One of the more surprising results is that the new behavior is about twice as fast:

In [18]: %time arr_as_pandas_time = arr.to_pandas()
CPU times: user 2.94 s, sys: 213 ms, total: 3.15 s
Wall time: 3.14 s

In [19]: %time arr_as_pandas_no_dedup_time = arr.to_pandas(deduplicate_objects=False)
CPU times: user 4.19 s, sys: 2.04 s, total: 6.23 s
Wall time: 6.21 s

The reason for this is that creating so many Python objects is more expensive than hashing the 10 byte values and looking them up in a hash table.

Note that when you convert Arrow data with mostly unique values back to pandas, the memory use benefits here won’t have as much of an impact.


In Apache Arrow, our goal is to develop computational tools to operate natively on the cache- and SIMD-friendly efficient Arrow columnar format. In the meantime, though, we recognize that users have legacy applications using the native memory layout of pandas or other analytics tools. We will do our best to provide fast and memory-efficient interoperability with pandas and other popular libraries.

DataFusion: A Rust-native Query Engine for Apache Arrow

Published 04 Feb 2019
By Andy Grove (agrove)

We are excited to announce that DataFusion has been donated to the Apache Arrow project. DataFusion is an in-memory query engine for the Rust implementation of Apache Arrow.

Although DataFusion was started two years ago, it was recently re-implemented to be Arrow-native and currently has limited capabilities but does support SQL queries against iterators of RecordBatch and has support for CSV files. There are plans to add support for Parquet files.

SQL support is limited to projection (SELECT), selection (WHERE), and simple aggregates (MIN, MAX, SUM) with an optional GROUP BY clause.

Supported expressions are identifiers, literals, simple math operations (+, -, *, /), binary expressions (AND, OR), equality and comparison operators (=, !=, <, <=, >=, >), and CAST(expr AS type).


The following example demonstrates running a simple aggregate SQL query against a CSV file.

// create execution context
let mut ctx = ExecutionContext::new();

// define schema for data source (csv file)
let schema = Arc::new(Schema::new(vec![
    Field::new("c1", DataType::Utf8, false),
    Field::new("c2", DataType::UInt32, false),
    Field::new("c3", DataType::Int8, false),
    Field::new("c4", DataType::Int16, false),
    Field::new("c5", DataType::Int32, false),
    Field::new("c6", DataType::Int64, false),
    Field::new("c7", DataType::UInt8, false),
    Field::new("c8", DataType::UInt16, false),
    Field::new("c9", DataType::UInt32, false),
    Field::new("c10", DataType::UInt64, false),
    Field::new("c11", DataType::Float32, false),
    Field::new("c12", DataType::Float64, false),
    Field::new("c13", DataType::Utf8, false),

// register csv file with the execution context
let csv_datasource =
    CsvDataSource::new("test/data/aggregate_test_100.csv", schema.clone(), 1024);
ctx.register_datasource("aggregate_test_100", Rc::new(RefCell::new(csv_datasource)));

let sql = "SELECT c1, MIN(c12), MAX(c12) FROM aggregate_test_100 WHERE c11 > 0.1 AND c11 < 0.9 GROUP BY c1";

// execute the query
let relation = ctx.sql(&sql).unwrap();
let mut results = relation.borrow_mut();

// iterate over the results
while let Some(batch) = {
        "RecordBatch has {} rows and {} columns",

    let c1 = batch

    let min = batch

    let max = batch

    for i in 0..batch.num_rows() {
        let c1_value: String = String::from_utf8(c1.value(i).to_vec()).unwrap();
        println!("{}, Min: {}, Max: {}", c1_value, min.value(i), max.value(i),);


The roadmap for DataFusion will depend on interest from the Rust community, but here are some of the short term items that are planned:

  • Extending test coverage of the existing functionality
  • Adding support for Parquet data sources
  • Implementing more SQL features such as JOIN, ORDER BY and LIMIT
  • Implement a DataFrame API as an alternative to SQL
  • Adding support for partitioning and parallel query execution using Rust’s async and await functionality
  • Creating a Docker image to make it easy to use DataFusion as a standalone query tool for interactive and batch queries

Contributors Welcome!

If you are excited about being able to use Rust for data science and would like to contribute to this work then there are many ways to get involved. The simplest way to get started is to try out DataFusion against your own data sources and file bug reports for any issues that you find. You could also check out the current list of issues and have a go at fixing one. You can also join the user mailing list to ask questions.

Speeding up R and Apache Spark using Apache Arrow

Published 25 Jan 2019
By Javier Luraschi

Javier Luraschi is a software engineer at RStudio

Support for Apache Arrow in Apache Spark with R is currently under active development in the sparklyr and SparkR projects. This post explores early, yet promising, performance improvements achieved when using R with Apache Spark, Arrow and sparklyr.


Since this work is under active development, install sparklyr and arrow from GitHub as follows:

devtools::install_github("apache/arrow", subdir = "r", ref = "apache-arrow-0.12.0")
devtools::install_github("rstudio/sparklyr", ref = "apache-arrow-0.12.0")

In this benchmark, we will use dplyr, but similar improvements can be expected from using DBI, or Spark DataFrames in sparklyr. The local Spark connection and dataframe with 10M numeric rows was initialized as follows:


sc <- spark_connect(master = "local", config = list("" = "6g"))
data <- data.frame(y = runif(10^7, 0, 1))


Currently, copying data to Spark using sparklyr is performed by persisting data on-disk from R and reading it back from Spark. This was meant to be used for small datasets since there are better tools to transfer data into distributed storage systems. Nevertheless, many users have requested support to transfer more data at fast speeds into Spark.

Using arrow with sparklyr, we can transfer data directly from R to Spark without having to serialize this data in R or persist in disk.

The following example copies 10M rows from R into Spark using sparklyr with and without arrow, there is close to a 16x improvement using arrow.

This benchmark uses the microbenchmark R package, which runs code multiple times, provides stats on total execution time and plots each excecution time to understand the distribution over each iteration.

  setup = library(arrow),
  arrow_on = {
    sparklyr_df <<- copy_to(sc, data, overwrite = T)
    count(sparklyr_df) %>% collect()
  arrow_off = {
    if ("arrow" %in% .packages()) detach("package:arrow")
    sparklyr_df <<- copy_to(sc, data, overwrite = T)
    count(sparklyr_df) %>% collect()
  times = 10
) %T>% print() %>% ggplot2::autoplot()
 Unit: seconds
      expr       min        lq       mean    median         uq       max neval
  arrow_on  3.011515  4.250025   7.257739  7.273011   8.974331  14.23325    10
 arrow_off 50.051947 68.523081 119.946947 71.898908 138.743419 390.44028    10
Copying data with R into Spark with and without Arrow


Similarly, arrow with sparklyr can now avoid deserializing data in R while collecting data from Spark into R. These improvements are not as significant as copying data since, sparklyr already collects data in columnar format.

The following benchmark collects 10M rows from Spark into R and shows that arrow can bring 3x improvements.

  setup = library(arrow),
  arrow_on = {
  arrow_off = {
    if ("arrow" %in% .packages()) detach("package:arrow")
  times = 10
) %T>% print() %>% ggplot2::autoplot()
Unit: seconds
      expr      min        lq      mean    median        uq       max neval
  arrow_on 4.520593  5.609812  6.154509  5.928099  6.217447  9.432221    10
 arrow_off 7.882841 13.358113 16.670708 16.127704 21.051382 24.373331    10
Collecting data with R from Spark with and without Arrow


Today, custom transformations of data using R functions are performed in sparklyr by moving data in row-format from Spark into an R process through a socket connection, transferring data in row-format is inefficient since multiple data types need to be deserialized over each row, then the data gets converted to columnar format (R was originally designed to use columnar data), once R finishes this computation, data is again converted to row-format, serialized row-by-row and then sent back to Spark over the socket connection.

By adding support for arrow in sparklyr, it makes Spark perform the row-format to column-format conversion in parallel in Spark. Data is then transferred through the socket but no custom serialization takes place. All the R process needs to do is copy this data from the socket into its heap, transform it and copy it back to the socket connection.

The following example transforms 100K rows with and without arrow enabled, arrow makes transformation with R functions close to 41x faster.

  setup = library(arrow),
  arrow_on = {
    sample_n(sparklyr_df, 10^5) %>% spark_apply(~ .x / 2) %>% count()
  arrow_off = {
    if ("arrow" %in% .packages()) detach("package:arrow")
    sample_n(sparklyr_df, 10^5) %>% spark_apply(~ .x / 2) %>% count()
  times = 10
) %T>% print() %>% ggplot2::autoplot()
Unit: seconds
      expr        min         lq       mean     median         uq        max neval
  arrow_on   3.881293   4.038376   5.136604   4.772739   5.759082   7.873711    10
 arrow_off 178.605733 183.654887 213.296238 227.182018 233.601885 238.877341    10
Transforming data with R in Spark with and without Arrow

Additional benchmarks and fine-tuning parameters can be found under sparklyr /rstudio/sparklyr/pull/1611 and SparkR /apache/spark/pull/22954. Looking forward to bringing this feature to the Spark, Arrow and R communities.

Apache Arrow 0.12.0 Release

Published 21 Jan 2019
By Wes McKinney (wesm)

The Apache Arrow team is pleased to announce the 0.12.0 release. This is the largest release yet in the project, covering 3 months of development work and includes 614 resolved issues from 77 distinct contributors.

See the Install Page to learn how to get the libraries for your platform. The complete changelog is also available.

It’s a huge release, but we’ll give some brief highlights and new from the project to help guide you to the parts of the project that may be of interest.

New committers and PMC member

The Arrow team is growing! Since the 0.11.0 release we have added 3 new committers:

We also pleased to announce that Krisztián Szűcs has been promoted from committer to PMC (Project Management Committee) member.

Thank you for all your contributions!

Code donations

Since the last release, we have received 3 code donations into the Apache project.

We are excited to continue to grow the Apache Arrow development community.

Combined project-level documentation

Since the last release, we have merged the Python and C++ documentation to create a combined project-wide documentation site: There is now some prose documentation about many parts of the C++ library. We intend to keep adding documentation for other parts of Apache Arrow to this site.


We start providing the official APT and Yum repositories for C++ and GLib (C). See the install document for details.

C++ notes

Much of the C++ development work the last 3 months concerned internal code refactoring and performance improvements. Some user-visible highlights of note:

  • Experimental support for in-memory sparse tensors (or ndarrays), with support for zero-copy IPC
  • Support for building on Alpine Linux
  • Significantly improved hash table utilities, with improved hash table performance in many parts of the library
  • IO library improvements for both read and write buffering
  • A fast trie implementation for string searching
  • Many improvements to the parallel CSV reader in performance and features. See the changelog

Since the LLVM-based Gandiva expression compiler was donated to Apache Arrow during the last release cycle, development there has been moving along. We expect to have Windows support for Gandiva and to ship this in downstream packages (like Python) in the 0.13 release time frame.

Go notes

The Arrow Go development team has been expanding. The Go library has gained support for many missing features from the columnar format as well as semantic constructs like chunked arrays and tables that are used heavily in the C++ project.

GLib and Ruby notes

Development of the GLib-based C bindings and corresponding Ruby interfaces have advanced in lock-step with the C++, Python, and R libraries. In this release, there are many new features in C and Ruby:

  • Compressed file read/write support
  • Support for using the C++ parallel CSV reader
  • Feather file support
  • Gandiva bindings
  • Plasma bindings

Python notes

We fixed a ton of bugs and made many improvements throughout the Python project. Some highlights from the Python side include:

  • Python 3.7 support: wheels and conda packages are now available for Python 3.7
  • Substantially improved memory use when converting strings types to pandas format, including when reading Parquet files. Parquet users should notice significantly lower memory use in common use cases
  • Support for reading and writing compressed files, can be used for CSV files, IPC, or any other form of IO
  • The new pyarrow.input_stream and pyarrow.output_stream functions support read and write buffering. This is analogous to BufferedIOBase from the Python standard library, but the internals are implemented natively in C++.
  • Gandiva (LLVM expression compiler) bindings, though not yet available in pip/conda yet. Look for this in 0.13.0.
  • Many improvements to Arrow CUDA integration, including interoperability with Numba

R notes

The R library made huge progress in 0.12, with work led by new committer Romain Francois. The R project’s features are not far behind the Python library, and we are hoping to be able to make the R library available to CRAN users for use with Apache Spark or for reading and writing Parquet files over the next quarter.

Users of the feather R library will see significant speed increases in many cases when reading Feather files with the new Arrow R library.

Rust notes

Rust development had an active last 3 months; see the changelog for details.

A native Rust implementation was just donated to the project, and the community intends to provide a similar level of functionality for reading and writing Parquet files using the Arrow in-memory columnar format as an intermediary.

Upcoming Roadmap, Outlook for 2019

Apache Arrow has become a large, diverse open source project. It is now being used in dozens of downstream open source and commercial projects. Work will be proceeding in many areas in 2019:

  • Development of in-memory query execution engines (e.g. in C++, Rust)
  • Expanded support for reading and writing the Apache Parquet format, and other common data formats like Apache Avro, CSV, JSON, and Apache ORC.
  • New Flight RPC system for fast messaging of Arrow datasets
  • Expanded support in existing programming languages
  • New programming language bindings or native implementations

It promises to be an exciting 2019. We look forward to having you involved in the development community.

Gandiva: A LLVM-based Analytical Expression Compiler for Apache Arrow

Published 05 Dec 2018
By Jacques Nadeau (jacques)

Today we’re happy to announce that the Gandiva Initiative for Apache Arrow, an LLVM-based execution kernel, is now part of the Apache Arrow project. Gandiva was kindly donated by Dremio, where it was originally developed and open-sourced. Gandiva extends Arrow’s capabilities to provide high performance analytical execution and is composed of two main components:

  • A runtime expression compiler leveraging LLVM

  • A high performance execution environment

Gandiva works as follows: applications submit an expression tree to the compiler, built in a language agnostic protobuf-based expression representation. From there, Gandiva then compiles the expression tree to native code for the current runtime environment and hardware. Once compiled, the Gandiva execution kernel then consumes and produces Arrow columnar batches. The generated code is highly optimized for parallel processing on modern CPUs. For example, on AVX-128 processors Gandiva can process 8 pairs of 2 byte values in a single vectorized operation, and on AVX-512 processors Gandiva can process 4x as many values in a single operation. Gandiva is built from the ground up to understand Arrow’s in-memory representation and optimize processing against it.

While Gandiva is just starting within the Arrow community, it already supports hundreds of expressions, ranging from math functions to case statements. Gandiva was built as a standalone C++ library built on top of the core Apache Arrow codebase and was donated with C++ and Java APIs construction and execution APIs for projection and filtering operations. The Arrow community is already looking to expand Gandiva’s capabilities. This will include incorporating more operations and supporting many new language bindings. As an example, multiple community members are already actively building new language bindings that allow use of Gandiva within Python and Ruby.

While young within the Arrow community, Gandiva is already shipped and used in production by many Dremio customers as part of Dremio’s execution engine. Experiments have demonstrated 70x performance improvement on many SQL queries. We expect to see similar performance gains for many other projects that leverage Arrow.

The Arrow community is working to ship the first formal Apache Arrow release that includes Gandiva, and we hope this will be available within the next couple months. This should make it much easier for the broader analytics and data science development communities to leverage runtime code generation for high-performance data processing in a variety of contexts and projects.

We started the Arrow project a couple of years ago with the objective of creating an industry-standard columnar in-memory data representation for analytics. Within this short period of time, Apache Arrow has been adopted by dozens of both open source and commercial software products. Some key examples include technologies such as Apache Spark, Pandas, Nvidia RAPIDS, Dremio, and InfluxDB. This success has driven Arrow to now be downloaded more than 1 million times per month. Over 200 developers have already contributed to Apache Arrow. If you’re interested in contributing to Gandiva or any other part of the Apache Arrow project, feel free to reach out on the mailing list and join us!

For additional technical details on Gandiva, you can check out some of the following resources:

Apache Arrow 0.11.0 Release

Published 09 Oct 2018
By Wes McKinney (wesm)

The Apache Arrow team is pleased to announce the 0.11.0 release. It is the product of 2 months of development and includes 287 resolved issues.

See the Install Page to learn how to get the libraries for your platform. The complete changelog is also available.

We discuss some highlights from the release and other project news in this post.

Arrow Flight RPC and Messaging Framework

We are developing a new Arrow-native RPC framework, Arrow Flight, based on gRPC for high performance Arrow-based messaging. Through low-level extensions to gRPC’s internal memory management, we are able to avoid expensive parsing when receiving datasets over the wire, unlocking unprecedented levels of performance in moving datasets from one machine to another. We will be writing more about Flight on the Arrow blog in the future.

Prototype implementations are available in Java and C++, and we will be focused in the coming months on hardening the Flight RPC framework for enterprise-grade production use cases.

Parquet and Arrow C++ communities joining forces

After discussion over the last year, the Apache Arrow and Apache Parquet C++ communities decide to merge the Parquet C++ codebase into the Arrow C++ codebase and work together in a “monorepo” structure. This should result in better developer productivity in core Parquet work as well as in Arrow integration.

Before this codebase merge, we had a circular dependency between the Arrow and Parquet codebases, since the Parquet C++ library is used in the Arrow Python library.

Gandiva LLVM Expression Compiler donation

Dremio Corporation has donated the Gandiva LLVM expression compiler to Apache Arrow. We will be working on cross-platform builds, packaging, and language bindings (e.g. in Python) for Gandiva in the upcoming 0.12 release and beyond. We will write more about Gandiva in the future.

Parquet C GLib Bindings Donation

PMC member Kouhei Sutou has donated GLib bindings for the Parquet C++ libraries, which are designed to work together with the existing Arrow GLib bindings.

C++ CSV Reader Project

We have begun developing a general purpose multithreaded CSV file parser in C++. The purpose of this library is to parse and convert comma-separated text files into Arrow columnar record batches as efficiently as possible. The prototype version features Python bindings, and any language that can use the C++ libraries (including C, R, and Ruby).

New MATLAB bindings

The MathWorks has contributed an initial MEX file binding to the Arrow C++ libraries. Initially, it is possible to read Arrow-based Feather files in MATLAB. We are looking forward to seeing more developments for MATLAB users.

R Library in Development

The community has begun implementing R language bindings and interoperability with the Arrow C++ libraries. This will include support for zero-copy shared memory IPC and other tools needed to improve R integration with Apache Spark and more.

Support for CUDA-based GPUs in Python

This release includes Python bindings to the Arrow CUDA integration C++ library. This work is targeting interoperability with Numba and the GPU Open Analytics Initiative.

Upcoming Roadmap

In the coming months, we will continue to make progress on many fronts, with Gandiva packaging, expanded language support (especially in R), and improved data access (e.g. CSV, Parquet files) in focus.

Apache Arrow 0.10.0 Release

Published 07 Aug 2018
By Wes McKinney (wesm)

The Apache Arrow team is pleased to announce the 0.10.0 release. It is the product of over 4 months of development and includes 470 resolved issues. It is the largest release so far in the project’s history. 90 individuals contributed to this release.

See the Install Page to learn how to get the libraries for your platform. The complete changelog is also available.

We discuss some highlights from the release and other project news in this post.

Offical Binary Packages and Packaging Automation

One of the largest projects in this release cycle was automating our build and packaging tooling to be able to easily and reproducibly create a comprehensive set of binary artifacts which have been approved and released by the Arrow PMC. We developed a tool called Crossbow which uses Appveyor and Travis CI to build each of the different supported packages on all 3 platforms (Linux, macOS, and Windows). As a result of our efforts, we should be able to make more frequent Arrow releases. This work was led by Phillip Cloud, Kouhei Sutou, and Krisztián Szűcs. Bravo!

New Programming Languages: Go, Ruby, Rust

This release also adds 3 new programming languages to the project: Go, Ruby, and Rust. Together with C, C++, Java, JavaScript, and Python, we now have some level of support for 8 programming languages.

Upcoming Roadmap

In the coming months, we will be working to move Apache Arrow closer to a 1.0.0 release. We will continue to grow new features, improve performance and stability, and expand support for currently supported and new programming languages.

Faster, scalable memory allocations in Apache Arrow with jemalloc

Published 20 Jul 2018
By Uwe Korn (uwe)

With the release of the 0.9 version of Apache Arrow, we have switched our default allocator for array buffers from the system allocator to jemalloc on OSX and Linux. This applies to the C++/GLib/Python implementations of Arrow. In most cases changing the default allocator is normally done to avoid problems that occur with many small, frequent (de)allocations. In contrast, in Arrow we normally deal with large in-memory datasets. While jemalloc provides good strategies for avoiding RAM fragmentation for allocations that are lower than a memory page (4kb), it also provides functionality that improves performance on allocations that span several memory pages.

Outside of Apache Arrow, jemalloc powers the infrastructure of Facebook (this is also where most of its development happens). It is also used as the default allocator in Rust as well as it helps Redis reduce the memory fragmentation on Linux (“Allocator”).

One allocation specialty that we require in Arrow is that memory should be 64byte aligned. This is so that we can get the most performance out of SIMD instruction sets like AVX. While the most modern SIMD instructions also work on unaligned memory, their performance is much better on aligned memory. To get the best performance for our analytical applications, we want all memory to be allocated such that SIMD performance is maximized.

For aligned allocations, the POSIX APIs only provide the aligned_alloc(void** ptr, size_t alignment, size_t size) function to allocate aligned memory. There is also posix_memalign(void **ptr, size_t alignment, size_t size) to modify an allocation to the preferred alignment. But neither of them cater for expansions of the allocation. While the realloc function can often expand allocations without moving them physically, it does not ensure that in the case the allocation is moved that the alignment is kept.

In the case when Arrow was built without jemalloc being enabled, this resulted in copying the data on each new expansion of an allocation. To reduce the number of memory copies, we use jemalloc’s *allocx()-APIs to create, modify and free aligned allocations. One of the typical tasks where this gives us a major speedup is on the incremental construction of an Arrow table that consists of several columns. We often don’t know the size of the table in advance and need to expand our allocations as the data is loaded.

To incrementally build a vector using memory expansion of a factor of 2, we would use the following C-code with the standard POSIX APIs:

size_t size = 128 * 1024;
void* ptr = aligned_alloc(64, size);
for (int i = 0; i < 10; i++) {
  size_t new_size = size * 2;
  void* ptr2 = aligned_alloc(64, new_size);
  memcpy(ptr2, ptr, size);
  ptr = ptr2;
  size = new_size;

With jemalloc’s special APIs, we are able to omit the explicit call to memcpy. In the case where a memory expansion cannot be done in-place, it is still called by the allocator but not needed on all occasions. This simplifies our user code to:

size_t size = 128 * 1024;
void* ptr = mallocx(size, MALLOCX_ALIGN(64));
for (int i = 0; i < 10; i++) {
  size *= 2;
  ptr = rallocx(ptr, size, MALLOCX_ALIGN(64));
dallocx(ptr, MALLOCX_ALIGN(64));

To see the real world benefits of using jemalloc, we look at the benchmarks in Arrow C++. There we have modeled a typical use case of incrementally building up an array of primitive values. For the build-up of the array, we don’t know the number of elements in the final array so we need to continuously expand the memory region in which the data is stored. The code for this benchmark is part of the builder-benchmark in the Arrow C++ sources as BuildPrimitiveArrayNoNulls.

Runtimes without jemalloc:

BM_BuildPrimitiveArrayNoNulls/repeats:3                 636726 us   804.114MB/s
BM_BuildPrimitiveArrayNoNulls/repeats:3                 621345 us   824.019MB/s
BM_BuildPrimitiveArrayNoNulls/repeats:3                 625008 us    819.19MB/s
BM_BuildPrimitiveArrayNoNulls/repeats:3_mean            627693 us   815.774MB/s
BM_BuildPrimitiveArrayNoNulls/repeats:3_median          625008 us    819.19MB/s
BM_BuildPrimitiveArrayNoNulls/repeats:3_stddev            8034 us   10.3829MB/s

Runtimes with jemalloc:

BM_BuildPrimitiveArrayNoNulls/repeats:3                 630881 us   811.563MB/s
BM_BuildPrimitiveArrayNoNulls/repeats:3                 352891 us   1.41687GB/s
BM_BuildPrimitiveArrayNoNulls/repeats:3                 351039 us   1.42434GB/s
BM_BuildPrimitiveArrayNoNulls/repeats:3_mean            444937 us   1.21125GB/s
BM_BuildPrimitiveArrayNoNulls/repeats:3_median          352891 us   1.41687GB/s
BM_BuildPrimitiveArrayNoNulls/repeats:3_stddev          161035 us   371.335MB/s

The benchmark was run three times for each configuration to see the performance differences. The first run in each configuration yielded the same performance but in all subsequent runs, the version using jemalloc was about twice as fast. In these cases, the memory region that was used for constructing the array could be expanded in place without moving the data around. This was possible as there were memory pages assigned to the process that were unused but not reclaimed by the operating system. Without jemalloc, we cannot make use of them simply by the fact that the default allocator has no API that provides aligned reallocation.

A Native Go Library for Apache Arrow

Published 22 Mar 2018
By The Apache Arrow PMC (pmc)

Since launching in early 2016, Apache Arrow has been growing fast. We have made nine major releases through the efforts of over 120 distinct contributors. The project’s scope has also expanded. We began by focusing on the development of the standardized in-memory columnar data format, which now serves as a pillar of the project. Since then, we have been growing into a more general cross-language platform for in-memory data analysis through new additions to the project like the Plasma shared memory object store. A primary goal of the project is to enable data system developers to process and move data fast.

So far, we officially have developed native Arrow implementations in C++, Java, and JavaScript. We have created binding layers for the C++ libraries in C (using the GLib libraries) and Python. We have also seen efforts to develop interfaces to the Arrow C++ libraries in Go, Lua, Ruby, and Rust. While binding layers serve many purposes, there can be benefits to native implementations, and so we’ve been keen to see future work on native implementations in growing systems languages like Go and Rust.

This past October, engineers Stuart Carnie, Nathaniel Cook, and Chris Goller, employees of InfluxData, began developing a native [Go language implementation of the Apache Arrow in-memory columnar format for use in Go-based database systems like InfluxDB. We are excited to announce that InfluxData has donated this native Go implementation to the Apache Arrow project, where it will continue to be developed. This work features low-level integration with the Go runtime and native support for SIMD instruction sets. We are looking forward to working more closely with the Go community on solving in-memory analytics and data interoperability problems.

Apache Arrow implementations and bindings

One of the mantras in The Apache Software Foundation is “Community over Code”. By building an open and collaborative development community across many programming language ecosystems, we will be able to development better and longer-lived solutions to the systems problems faced by data developers.

We are excited for what the future holds for the Apache Arrow project. Adding first-class support for a popular systems programming language like Go is an important step along the way. We welcome others from the Go community to get involved in the project. We also welcome others who wish to explore building Arrow support for other programming languages not yet represented. Learn more at and join the mailing list

Apache Arrow 0.9.0 Release

Published 22 Mar 2018
By Wes McKinney (wesm)

The Apache Arrow team is pleased to announce the 0.9.0 release. It is the product of over 3 months of development and includes 260 resolved JIRAs.

While we made some of backwards-incompatible columnar binary format changes in last December’s 0.8.0 release, the 0.9.0 release is backwards-compatible with 0.8.0. We will be working toward a 1.0.0 release this year, which will mark longer-term binary stability for the Arrow columnar format and metadata.

See the Install Page to learn how to get the libraries for your platform. The complete changelog is also available.

We discuss some highlights from the release and other project news in this post. This release has been overall focused more on bug fixes, compatibility, and stability compared with previous releases which have pushed more on new and expanded features.

New Arrow committers and PMC members

Since the last release, we have added 2 new Arrow committers: Brian Hulette and Robert Nishihara. Additionally, Phillip Cloud and Philipp Moritz have been promoted from committer to PMC member. Congratulations and thank you for your contributions!

Plasma Object Store Improvements

The Plasma Object Store now supports managing interprocess shared memory on CUDA-enabled GPUs. We are excited to see more GPU-related functionality develop in Apache Arrow, as this has become a key computing environment for scalable machine learning.

Python Improvements

Antoine Pitrou has joined the Python development efforts and helped significantly this release with interoperability with built-in CPython data structures and NumPy structured data types.

  • New experimental support for reading Apache ORC files
  • pyarrow.array now accepts lists of tuples or Python dicts for creating Arrow struct type arrays.
  • NumPy structured dtypes (which are row/record-oriented) can be directly converted to Arrow struct (column-oriented) arrays
  • Python 3.6 pathlib objects for file paths are now accepted in many file APIs, including for Parquet files
  • Arrow integer arrays with nulls can now be converted to NumPy object arrays with None values
  • New pyarrow.foreign_buffer API for interacting with memory blocks located at particular memory addresses

Java Improvements

Java now fully supports the FixedSizeBinary data type.

JavaScript Improvements

The JavaScript library has been significantly refactored and expanded. We are making separate Apache releases (most recently JS-0.3.1) for JavaScript, which are being published to NPM.

Upcoming Roadmap

In the coming months, we will be working to move Apache Arrow closer to a 1.0.0 release. We will also be discussing plans to develop native Arrow-based computational libraries within the project.

Apache Arrow 0.8.0 Release

Published 18 Dec 2017
By Wes McKinney (wesm)

The Apache Arrow team is pleased to announce the 0.8.0 release. It is the product of 10 weeks of development and includes 286 resolved JIRAs with many new features and bug fixes to the various language implementations. This is the largest release since 0.3.0 earlier this year.

As part of work towards a stabilizing the Arrow format and making a 1.0.0 release sometime in 2018, we made a series of backwards-incompatible changes to the serialized Arrow metadata that requires Arrow readers and writers (0.7.1 and earlier) to upgrade in order to be compatible with 0.8.0 and higher. We expect future backwards-incompatible changes to be rare going forward.

See the Install Page to learn how to get the libraries for your platform. The complete changelog is also available.

We discuss some highlights from the release and other project news in this post.

Projects “Powered By” Apache Arrow

A growing ecosystem of projects are using Arrow to solve in-memory analytics and data interchange problems. We have added a new Powered By page to the Arrow website where we can acknowledge open source projects and companies which are using Arrow. If you would like to add your project to the list as an Arrow user, please let us know.

New Arrow committers

Since the last release, we have added 5 new Apache committers:

  • Phillip Cloud, who has mainly contributed to C++ and Python
  • Bryan Cutler, who has mainly contributed to Java and Spark integration
  • Li Jin, who has mainly contributed to Java and Spark integration
  • Paul Taylor, who has mainly contributed to JavaScript
  • Siddharth Teotia, who has mainly contributed to Java

Welcome to the Arrow team, and thank you for your contributions!

Improved Java vector API, performance improvements

Siddharth Teotia led efforts to revamp the Java vector API to make things simpler and faster. As part of this, we removed the dichotomy between nullable and non-nullable vectors.

See Sidd’s blog post for more about these changes.

Decimal support in C++, Python, consistency with Java

Phillip Cloud led efforts this release to harden details about exact decimal values in the Arrow specification and ensure a consistent implementation across Java, C++, and Python.

Arrow now supports decimals represented internally as a 128-bit little-endian integer, with a set precision and scale (as defined in many SQL-based systems). As part of this work, we needed to change Java’s internal representation from big- to little-endian.

We are now integration testing decimals between Java, C++, and Python, which will facilitate Arrow adoption in Apache Spark and other systems that use both Java and Python.

Decimal data can now be read and written by the Apache Parquet C++ library, including via pyarrow.

In the future, we may implement support for smaller-precision decimals represented by 32- or 64-bit integers.

C++ improvements: expanded kernels library and more

In C++, we have continued developing the new arrow::compute submodule consisting of native computation fuctions for Arrow data. New contributor Licht Takeuchi helped expand the supported types for type casting in compute::Cast. We have also implemented new kernels Unique and DictionaryEncode for computing the distinct elements of an array and dictionary encoding (conversion to categorical), respectively.

We expect the C++ computation “kernel” library to be a major expansion area for the project over the next year and beyond. Here, we can also implement SIMD- and GPU-accelerated versions of basic in-memory analytics functionality.

As minor breaking API change in C++, we have made the RecordBatch and Table APIs “virtual” or abstract interfaces, to enable different implementations of a record batch or table which conform to the standard interface. This will help enable features like lazy IO or column loading.

There was significant work improving the C++ library generally and supporting work happening in Python and C. See the change log for full details.

GLib C improvements: Meson build, GPU support

Developing of the GLib-based C bindings has generally tracked work happening in the C++ library. These bindings are being used to develop data science tools for Ruby users and elsewhere.

The C bindings now support the Meson build system in addition to autotools, which enables them to be built on Windows.

The Arrow GPU extension library is now also supported in the C bindings.

JavaScript: first independent release on NPM

Brian Hulette and Paul Taylor have been continuing to drive efforts on the TypeScript-based JavaScript implementation.

Since the last release, we made a first JavaScript-only Apache release, version 0.2.0, which is now available on NPM. We decided to make separate JavaScript releases to enable the JS library to release more frequently than the rest of the project.

Python improvements

In addition to some of the new features mentioned above, we have made a variety of usability and performance improvements for integrations with pandas, NumPy, Dask, and other Python projects which may make use of pyarrow, the Arrow Python library.

Some of these improvements include:

  • Component-based serialization for more flexible and memory-efficient transport of large or complex Python objects
  • Substantially improved serialization performance for pandas objects when using pyarrow.serialize and pyarrow.deserialize. This includes a special pyarrow.pandas_serialization_context which further accelerates certain internal details of pandas serialization * Support zero-copy reads for
  • pandas.DataFrame using pyarrow.deserialize for objects without Python objects
  • Multithreaded conversions from pandas.DataFrame to pyarrow.Table (we already supported multithreaded conversions from Arrow back to pandas)
  • More efficient conversion from 1-dimensional NumPy arrays to Arrow format
  • New generic buffer compression and decompression APIs pyarrow.compress and pyarrow.decompress
  • Enhanced Parquet cross-compatibility with fastparquet and improved Dask support
  • Python support for accessing Parquet row group column statistics

Upcoming Roadmap

The 0.8.0 release includes some API and format changes, but upcoming releases will focus on ompleting and stabilizing critical functionality to move the project closer to a 1.0.0 release.

With the ecosystem of projects using Arrow expanding rapidly, we will be working to improve and expand the libraries in support of downstream use cases.

We continue to look for more JavaScript, Julia, R, Rust, and other programming language developers to join the project and expand the available implementations and bindings to more languages.

Improvements to Java Vector API in Apache Arrow 0.8.0

Published 18 Dec 2017
By Siddharth Teotia

This post gives insight into the major improvements in the Java implementation of vectors. We undertook this work over the last 10 weeks since the last Arrow release.

Design Goals

  1. Improved maintainability and extensibility
  2. Improved heap memory usage
  3. No performance overhead on hot code paths


Improved maintainability and extensibility

We use templates in several places for compile time Java code generation for different vector classes, readers, writers etc. Templates are helpful as the developers don’t have to write a lot of duplicate code.

However, we realized that over a period of time some specific Java templates became extremely complex with giant if-else blocks, poor code indentation and documentation. All this impacted the ability to easily extend these templates for adding new functionality or improving the existing infrastructure.

So we evaluated the usage of templates for compile time code generation and decided not to use complex templates in some places by writing small amount of duplicate code which is elegant, well documented and extensible.

Improved heap usage

We did extensive memory analysis downstream in Dremio where Arrow is used heavily for in-memory query execution on columnar data. The general conclusion was that Arrow’s Java vector classes have non-negligible heap overhead and volume of objects was too high. There were places in code where we were creating objects unnecessarily and using structures that could be substituted with better alternatives.

No performance overhead on hot code paths

Java vectors used delegation and abstraction heavily throughout the object hierarchy. The performance critical get/set methods of vectors went through a chain of function calls back and forth between different objects before doing meaningful work. We also evaluated the usage of branches in vector APIs and reimplemented some of them by avoiding branches completely.

We took inspiration from how the Java memory code in ArrowBuf works. For all the performance critical methods, ArrowBuf bypasses all the netty object hierarchy, grabs the target virtual address and directly interacts with the memory.

There were cases where branches could be avoided all together.

In case of nullable vectors, we were doing multiple checks to confirm if the value at a given position in the vector is null or not.

Our implementation approach

  • For scalars, the inheritance tree was simplified by writing different abstract base classes for fixed and variable width scalars.
  • The base classes contained all the common functionality across different types.
  • The individual subclasses implemented type specific APIs for fixed and variable width scalar vectors.
  • For the performance critical methods, all the work is done either in the vector class or corresponding ArrowBuf. There is no delegation to any internal object.
  • The mutator and accessor based access to vector APIs is removed. These objects led to unnecessary heap overhead and complicated the use of APIs.
  • Both scalar and complex vectors directly interact with underlying buffers that manage the offsets, data and validity. Earlier we were creating different inner vectors for each vector and delegating all the functionality to inner vectors. This introduced a lot of bugs in memory management, excessive heap overhead and performance penalty due to chain of delegations.
  • We reduced the number of vector classes by removing non-nullable vectors. In the new implementation, all vectors in Java are nullable in nature.

Fast Python Serialization with Ray and Apache Arrow

Published 15 Oct 2017
By Philipp Moritz, Robert Nishihara

This was originally posted on the Ray blog. Philipp Moritz and Robert Nishihara are graduate students at UC Berkeley.

This post elaborates on the integration between Ray and Apache Arrow. The main problem this addresses is data serialization.

From Wikipedia, serialization is

… the process of translating data structures or object state into a format that can be stored … or transmitted … and reconstructed later (possibly in a different computer environment).

Why is any translation necessary? Well, when you create a Python object, it may have pointers to other Python objects, and these objects are all allocated in different regions of memory, and all of this has to make sense when unpacked by another process on another machine.

Serialization and deserialization are bottlenecks in parallel and distributed computing, especially in machine learning applications with large objects and large quantities of data.

Design Goals

As Ray is optimized for machine learning and AI applications, we have focused a lot on serialization and data handling, with the following design goals:

  1. It should be very efficient with large numerical data (this includes NumPy arrays and Pandas DataFrames, as well as objects that recursively contain Numpy arrays and Pandas DataFrames).
  2. It should be about as fast as Pickle for general Python types.
  3. It should be compatible with shared memory, allowing multiple processes to use the same data without copying it.
  4. Deserialization should be extremely fast (when possible, it should not require reading the entire serialized object).
  5. It should be language independent (eventually we’d like to enable Python workers to use objects created by workers in Java or other languages and vice versa).

Our Approach and Alternatives

The go-to serialization approach in Python is the pickle module. Pickle is very general, especially if you use variants like cloudpickle. However, it does not satisfy requirements 1, 3, 4, or 5. Alternatives like json satisfy 5, but not 1-4.

Our Approach: To satisfy requirements 1-5, we chose to use the Apache Arrow format as our underlying data representation. In collaboration with the Apache Arrow team, we built libraries for mapping general Python objects to and from the Arrow format. Some properties of this approach:

  • The data layout is language independent (requirement 5).
  • Offsets into a serialized data blob can be computed in constant time without reading the full object (requirements 1 and 4).
  • Arrow supports zero-copy reads, so objects can naturally be stored in shared memory and used by multiple processes (requirements 1 and 3).
  • We can naturally fall back to pickle for anything we can’t handle well (requirement 2).

Alternatives to Arrow: We could have built on top of Protocol Buffers, but protocol buffers really isn’t designed for numerical data, and that approach wouldn’t satisfy 1, 3, or 4. Building on top of Flatbuffers actually could be made to work, but it would have required implementing a lot of the facilities that Arrow already has and we preferred a columnar data layout more optimized for big data.


Here we show some performance improvements over Python’s pickle module. The experiments were done using pickle.HIGHEST_PROTOCOL. Code for generating these plots is included at the end of the post.

With NumPy arrays: In machine learning and AI applications, data (e.g., images, neural network weights, text documents) are typically represented as data structures containing NumPy arrays. When using NumPy arrays, the speedups are impressive.

The fact that the Ray bars for deserialization are barely visible is not a mistake. This is a consequence of the support for zero-copy reads (the savings largely come from the lack of memory movement).

Note that the biggest wins are with deserialization. The speedups here are multiple orders of magnitude and get better as the NumPy arrays get larger (thanks to design goals 1, 3, and 4). Making deserialization fast is important for two reasons. First, an object may be serialized once and then deserialized many times (e.g., an object that is broadcast to all workers). Second, a common pattern is for many objects to be serialized in parallel and then aggregated and deserialized one at a time on a single worker making deserialization the bottleneck.

Without NumPy arrays: When using regular Python objects, for which we cannot take advantage of shared memory, the results are comparable to pickle.

These are just a few examples of interesting Python objects. The most important case is the case where NumPy arrays are nested within other objects. Note that our serialization library works with very general Python types including custom Python classes and deeply nested objects.


The serialization library can be used directly through pyarrow as follows. More documentation is available here.

x = [(1, 2), 'hello', 3, 4, np.array([5.0, 6.0])]
serialized_x = pyarrow.serialize(x).to_buffer()
deserialized_x = pyarrow.deserialize(serialized_x)

It can be used directly through the Ray API as follows.

x = [(1, 2), 'hello', 3, 4, np.array([5.0, 6.0])]
x_id = ray.put(x)
deserialized_x = ray.get(x_id)

Data Representation

We use Apache Arrow as the underlying language-independent data layout. Objects are stored in two parts: a schema and a data blob. At a high level, the data blob is roughly a flattened concatenation of all of the data values recursively contained in the object, and the schema defines the types and nesting structure of the data blob.

Technical Details: Python sequences (e.g., dictionaries, lists, tuples, sets) are encoded as Arrow UnionArrays of other types (e.g., bools, ints, strings, bytes, floats, doubles, date64s, tensors (i.e., NumPy arrays), lists, tuples, dicts and sets). Nested sequences are encoded using Arrow ListArrays. All tensors are collected and appended to the end of the serialized object, and the UnionArray contains references to these tensors.

To give a concrete example, consider the following object.

[(1, 2), 'hello', 3, 4, np.array([5.0, 6.0])]

It would be represented in Arrow with the following structure.

UnionArray(type_ids=[tuple, string, int, int, ndarray],
           tuples=ListArray(offsets=[0, 2],
                            UnionArray(type_ids=[int, int],
                                       ints=[1, 2])),
           ints=[3, 4],
           ndarrays=[<offset of numpy array>])

Arrow uses Flatbuffers to encode serialized schemas. Using only the schema, we can compute the offsets of each value in the data blob without scanning through the data blob (unlike Pickle, this is what enables fast deserialization). This means that we can avoid copying or otherwise converting large arrays and other values during deserialization. Tensors are appended at the end of the UnionArray and can be efficiently shared and accessed using shared memory.

Note that the actual object would be laid out in memory as shown below.

The layout of a Python object in the heap. Each box is allocated in a different memory region, and arrows between boxes represent pointers.

The Arrow serialized representation would be as follows.

The memory layout of the Arrow-serialized object.

Getting Involved

We welcome contributions, especially in the following areas.

  • Use the C++ and Java implementations of Arrow to implement versions of this for C++ and Java.
  • Implement support for more Python types and better test coverage.

Reproducing the Figures Above

For reference, the figures can be reproduced with the following code. Benchmarking ray.put and ray.get instead of pyarrow.serialize and pyarrow.deserialize gives similar figures. The plots were generated at this commit.

import pickle
import pyarrow
import matplotlib.pyplot as plt
import numpy as np
import timeit

def benchmark_object(obj, number=10):
    # Time serialization and deserialization for pickle.
    pickle_serialize = timeit.timeit(
        lambda: pickle.dumps(obj, protocol=pickle.HIGHEST_PROTOCOL),
    serialized_obj = pickle.dumps(obj, pickle.HIGHEST_PROTOCOL)
    pickle_deserialize = timeit.timeit(lambda: pickle.loads(serialized_obj),

    # Time serialization and deserialization for Ray.
    ray_serialize = timeit.timeit(
        lambda: pyarrow.serialize(obj).to_buffer(), number=number)
    serialized_obj = pyarrow.serialize(obj).to_buffer()
    ray_deserialize = timeit.timeit(
        lambda: pyarrow.deserialize(serialized_obj), number=number)

    return [[pickle_serialize, pickle_deserialize],
            [ray_serialize, ray_deserialize]]

def plot(pickle_times, ray_times, title, i):
    fig, ax = plt.subplots()
    fig.set_size_inches(3.8, 2.7)

    bar_width = 0.35
    index = np.arange(2)
    opacity = 0.6, pickle_times, bar_width,
            alpha=opacity, color='r', label='Pickle') + bar_width, ray_times, bar_width,
            alpha=opacity, color='c', label='Ray')

    plt.title(title, fontweight='bold')
    plt.ylabel('Time (seconds)', fontsize=10)
    labels = ['serialization', 'deserialization']
    plt.xticks(index + bar_width / 2, labels, fontsize=10)
    plt.legend(fontsize=10, bbox_to_anchor=(1, 1))
    plt.savefig('plot-' + str(i) + '.png', format='png')

test_objects = [
    [np.random.randn(50000) for i in range(100)],
    {'weight-' + str(i): np.random.randn(50000) for i in range(100)},
    {i: set(['string1' + str(i), 'string2' + str(i)]) for i in range(100000)},
    [str(i) for i in range(200000)]

titles = [
    'List of large numpy arrays',
    'Dictionary of large numpy arrays',
    'Large dictionary of small sets',
    'Large list of strings'

for i in range(len(test_objects)):
    plot(*benchmark_object(test_objects[i]), titles[i], i)

Apache Arrow 0.7.0 Release

Published 19 Sep 2017
By Wes McKinney (wesm)

The Apache Arrow team is pleased to announce the 0.7.0 release. It includes 133 resolved JIRAs many new features and bug fixes to the various language implementations. The Arrow memory format remains stable since the 0.3.x release.

See the Install Page to learn how to get the libraries for your platform. The complete changelog is also available.

We include some highlights from the release in this post.

New PMC Member: Kouhei Sutou

Since the last release we have added Kou to the Arrow Project Management Committee. He is also a PMC for Apache Subversion, and a major contributor to many other open source projects.

As an active member of the Ruby community in Japan, Kou has been developing the GLib-based C bindings for Arrow with associated Ruby wrappers, to enable Ruby users to benefit from the work that’s happening in Apache Arrow.

We are excited to be collaborating with the Ruby community on shared infrastructure for in-memory analytics and data science.

Expanded JavaScript (TypeScript) Implementation

Paul Taylor from the Falcor and ReactiveX projects has worked to expand the JavaScript implementation (which is written in TypeScript), using the latest in modern JavaScript build and packaging technology. We are looking forward to building out the JS implementation and bringing it up to full functionality with the C++ and Java implementations.

We are looking for more JavaScript developers to join the project and work together to make Arrow for JS work well with many kinds of front end use cases, like real time data visualization.

Type casting for C++ and Python

As part of longer-term efforts to build an Arrow-native in-memory analytics library, we implemented a variety of type conversion functions. These functions are essential in ETL tasks when conforming one table schema to another. These are similar to the astype function in NumPy.

In [17]: import pyarrow as pa

In [18]: arr = pa.array([True, False, None, True])

In [19]: arr
<pyarrow.lib.BooleanArray object at 0x7ff6fb069b88>

In [20]: arr.cast(pa.int32())
<pyarrow.lib.Int32Array object at 0x7ff6fb0383b8>

Over time these will expand to support as many input-and-output type combinations with optimized conversions.

New Arrow GPU (CUDA) Extension Library for C++

To help with GPU-related projects using Arrow, like the GPU Open Analytics Initiative, we have started a C++ add-on library to simplify Arrow memory management on CUDA-enabled graphics cards. We would like to expand this to include a library of reusable CUDA kernel functions for GPU analytics on Arrow columnar memory.

For example, we could write a record batch from CPU memory to GPU device memory like so (some error checking omitted):

#include <arrow/api.h>
#include <arrow/gpu/cuda_api.h>

using namespace arrow;

gpu::CudaDeviceManager* manager;
std::shared_ptr<gpu::CudaContext> context;

manager_->GetContext(kGpuNumber, &context);

std::shared_ptr<RecordBatch> batch = GetCpuData();

std::shared_ptr<gpu::CudaBuffer> device_serialized;
gpu::SerializeRecordBatch(*batch, context_.get(), &device_serialized));

We can then “read” the GPU record batch, but the returned arrow::RecordBatch internally will contain GPU device pointers that you can use for CUDA kernel calls:

std::shared_ptr<RecordBatch> device_batch;
gpu::ReadRecordBatch(batch->schema(), device_serialized,
                     default_memory_pool(), &device_batch));

// Now run some CUDA kernels on device_batch

Decimal Integration Tests

Phillip Cloud has been working on decimal support in C++ to enable Parquet read/write support in C++ and Python, and also end-to-end testing against the Arrow Java libraries.

In the upcoming releases, we hope to complete the remaining data types that need end-to-end testing between Java and C++:

  • Fixed size lists (variable-size lists already implemented)
  • Fixes size binary
  • Unions
  • Maps
  • Time intervals

Other Notable Python Changes

Some highlights of Python development outside of bug fixes and general API improvements include:

  • Simplified put and get arbitrary Python objects in Plasma objects
  • High-speed, memory efficient object serialization. This is important enough that we will likely write a dedicated blog post about it.
  • New flavor='spark' option to pyarrow.parquet.write_table to enable easy writing of Parquet files maximized for Spark compatibility
  • parquet.write_to_dataset function with support for partitioned writes
  • Improved support for Dask filesystems
  • Improved Python usability for IPC: read and write schemas and record batches more easily. See the API docs for more about these.

The Road Ahead

Upcoming Arrow releases will continue to expand the project to cover more use cases. In addition to completing end-to-end testing for all the major data types, some of us will be shifting attention to building Arrow-native in-memory analytics libraries.

We are looking for more JavaScript, R, and other programming language developers to join the project and expand the available implementations and bindings to more languages.

Apache Arrow 0.6.0 Release

Published 16 Aug 2017
By Wes McKinney (wesm)

The Apache Arrow team is pleased to announce the 0.6.0 release. It includes 90 resolved JIRAs with the new Plasma shared memory object store, and improvements and bug fixes to the various language implementations. The Arrow memory format remains stable since the 0.3.x release.

See the Install Page to learn how to get the libraries for your platform. The complete changelog is also available.

Plasma Shared Memory Object Store

This release includes the Plasma Store, which you can read more about in the linked blog post. This system was originally developed as part of the Ray Project at the UC Berkeley RISELab. We recognized that Plasma would be highly valuable to the Arrow community as a tool for shared memory management and zero-copy deserialization. Additionally, we believe we will be able to develop a stronger software stack through sharing of IO and buffer management code.

The Plasma store is a server application which runs as a separate process. A reference C++ client, with Python bindings, is made available in this release. Clients can be developed in Java or other languages in the future to enable simple sharing of complex datasets through shared memory.

Arrow Format Addition: Map type

We added a Map logical type to represent ordered and unordered maps in-memory. This corresponds to the MAP logical type annotation in the Parquet format (where maps are represented as repeated structs).

Map is represented as a list of structs. It is the first example of a logical type whose physical representation is a nested type. We have not yet created implementations of Map containers in any of the implementations, but this can be done in a future release.

As an example, the Python data:

data = [{'a': 1, 'bb': 2, 'cc': 3}, {'dddd': 4}]

Could be represented in an Arrow Map<String, Int32> as:

Map<String, Int32> = List<Struct<keys: String, values: Int32>>
  is_valid: [true, true]
  offsets: [0, 3, 4]
  values: Struct<keys: String, values: Int32>
      - keys: String
          is_valid: [true, true, true, true]
          offsets: [0, 1, 3, 5, 9]
          data: abbccdddd
      - values: Int32
          is_valid: [true, true, true, true]
          data: [1, 2, 3, 4]

Python Changes

Some highlights of Python development outside of bug fixes and general API improvements include:

  • New strings_to_categorical=True option when calling Table.to_pandas will yield pandas Categorical types from Arrow binary and string columns
  • Expanded Hadoop Filesystem (HDFS) functionality to improve compatibility with Dask and other HDFS-aware Python libraries.
  • s3fs and other Dask-oriented filesystems can now be used with pyarrow.parquet.ParquetDataset
  • More graceful handling of pandas’s nanosecond timestamps when writing to Parquet format. You can now pass coerce_timestamps='ms' to cast to milliseconds, or 'us' for microseconds.

Toward Arrow 1.0.0 and Beyond

We are still discussing the roadmap to 1.0.0 release on the developer mailing list. The focus of the 1.0.0 release will likely be memory format stability and hardening integration tests across the remaining data types implemented in Java and C++. Please join the discussion there.

Plasma In-Memory Object Store

Published 08 Aug 2017
By Philipp Moritz and Robert Nishihara

Philipp Moritz and Robert Nishihara are graduate students at UC Berkeley.

Plasma: A High-Performance Shared-Memory Object Store

Motivating Plasma

This blog post presents Plasma, an in-memory object store that is being developed as part of Apache Arrow. Plasma holds immutable objects in shared memory so that they can be accessed efficiently by many clients across process boundaries. In light of the trend toward larger and larger multicore machines, Plasma enables critical performance optimizations in the big data regime.

Plasma was initially developed as part of Ray, and has recently been moved to Apache Arrow in the hopes that it will be broadly useful.

One of the goals of Apache Arrow is to serve as a common data layer enabling zero-copy data exchange between multiple frameworks. A key component of this vision is the use of off-heap memory management (via Plasma) for storing and sharing Arrow-serialized objects between applications.

Expensive serialization and deserialization as well as data copying are a common performance bottleneck in distributed computing. For example, a Python-based execution framework that wishes to distribute computation across multiple Python “worker” processes and then aggregate the results in a single “driver” process may choose to serialize data using the built-in pickle library. Assuming one Python process per core, each worker process would have to copy and deserialize the data, resulting in excessive memory usage. The driver process would then have to deserialize results from each of the workers, resulting in a bottleneck.

Using Plasma plus Arrow, the data being operated on would be placed in the Plasma store once, and all of the workers would read the data without copying or deserializing it (the workers would map the relevant region of memory into their own address spaces). The workers would then put the results of their computation back into the Plasma store, which the driver could then read and aggregate without copying or deserializing the data.

The Plasma API:

Below we illustrate a subset of the API. The C++ API is documented more fully here, and the Python API is documented here.

Object IDs: Each object is associated with a string of bytes.

Creating an object: Objects are stored in Plasma in two stages. First, the object store creates the object by allocating a buffer for it. At this point, the client can write to the buffer and construct the object within the allocated buffer. When the client is done, the client seals the buffer making the object immutable and making it available to other Plasma clients.

# Create an object.
object_id = pyarrow.plasma.ObjectID(20 * b'a')
object_size = 1000
buffer = memoryview(client.create(object_id, object_size))

# Write to the buffer.
for i in range(1000):
    buffer[i] = 0

# Seal the object making it immutable and available to other clients.

Getting an object: After an object has been sealed, any client who knows the object ID can get the object.

# Get the object from the store. This blocks until the object has been sealed.
object_id = pyarrow.plasma.ObjectID(20 * b'a')
[buff] = client.get([object_id])
buffer = memoryview(buff)

If the object has not been sealed yet, then the call to client.get will block until the object has been sealed.

A sorting application

To illustrate the benefits of Plasma, we demonstrate an 11x speedup (on a machine with 20 physical cores) for sorting a large pandas DataFrame (one billion entries). The baseline is the built-in pandas sort function, which sorts the DataFrame in 477 seconds. To leverage multiple cores, we implement the following standard distributed sorting scheme.

  • We assume that the data is partitioned across K pandas DataFrames and that each one already lives in the Plasma store.
  • We subsample the data, sort the subsampled data, and use the result to define L non-overlapping buckets.
  • For each of the K data partitions and each of the L buckets, we find the subset of the data partition that falls in the bucket, and we sort that subset.
  • For each of the L buckets, we gather all of the K sorted subsets that fall in that bucket.
  • For each of the L buckets, we merge the corresponding K sorted subsets.
  • We turn each bucket into a pandas DataFrame and place it in the Plasma store.

Using this scheme, we can sort the DataFrame (the data starts and ends in the Plasma store), in 44 seconds, giving an 11x speedup over the baseline.


The Plasma store runs as a separate process. It is written in C++ and is designed as a single-threaded event loop based on the Redis event loop library. The plasma client library can be linked into applications. Clients communicate with the Plasma store via messages serialized using Google Flatbuffers.

Call for contributions

Plasma is a work in progress, and the API is currently unstable. Today Plasma is primarily used in Ray as an in-memory cache for Arrow serialized objects. We are looking for a broader set of use cases to help refine Plasma’s API. In addition, we are looking for contributions in a variety of areas including improving performance and building other language bindings. Please let us know if you are interested in getting involved with the project.

Speeding up PySpark with Apache Arrow

Published 26 Jul 2017
By BryanCutler

Bryan Cutler is a software engineer at IBM’s Spark Technology Center STC

Beginning with Apache Spark version 2.3, Apache Arrow will be a supported dependency and begin to offer increased performance with columnar data transfer. If you are a Spark user that prefers to work in Python and Pandas, this is a cause to be excited over! The initial work is limited to collecting a Spark DataFrame with toPandas(), which I will discuss below, however there are many additional improvements that are currently underway.

Optimizing Spark Conversion to Pandas

The previous way of converting a Spark DataFrame to Pandas with DataFrame.toPandas() in PySpark was painfully inefficient. Basically, it worked by first collecting all rows to the Spark driver. Next, each row would get serialized into Python’s pickle format and sent to a Python worker process. This child process unpickles each row into a huge list of tuples. Finally, a Pandas DataFrame is created from the list using pandas.DataFrame.from_records().

This all might seem like standard procedure, but suffers from 2 glaring issues: 1) even using CPickle, Python serialization is a slow process and 2) creating a pandas.DataFrame using from_records must slowly iterate over the list of pure Python data and convert each value to Pandas format. See here for a detailed analysis.

Here is where Arrow really shines to help optimize these steps: 1) Once the data is in Arrow memory format, there is no need to serialize/pickle anymore as Arrow data can be sent directly to the Python process, 2) When the Arrow data is received in Python, then pyarrow can utilize zero-copy methods to create a pandas.DataFrame from entire chunks of data at once instead of processing individual scalar values. Additionally, the conversion to Arrow data can be done on the JVM and pushed back for the Spark executors to perform in parallel, drastically reducing the load on the driver.

As of the merging of SPARK-13534, the use of Arrow when calling toPandas() needs to be enabled by setting the SQLConf “spark.sql.execution.arrow.enabled” to “true”. Let’s look at a simple usage example.

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT

Using Python version 2.7.13 (default, Dec 20 2016 23:09:15)
SparkSession available as 'spark'.

In [1]: from pyspark.sql.functions import rand
   ...: df = spark.range(1 << 22).toDF("id").withColumn("x", rand())
   ...: df.printSchema()
 |-- id: long (nullable = false)
 |-- x: double (nullable = false)

In [2]: %time pdf = df.toPandas()
CPU times: user 17.4 s, sys: 792 ms, total: 18.1 s
Wall time: 20.7 s

In [3]: spark.conf.set("spark.sql.execution.arrow.enabled", "true")

In [4]: %time pdf = df.toPandas()
CPU times: user 40 ms, sys: 32 ms, total: 72 ms                                 
Wall time: 737 ms

In [5]: pdf.describe()
                 id             x
count  4.194304e+06  4.194304e+06
mean   2.097152e+06  4.998996e-01
std    1.210791e+06  2.887247e-01
min    0.000000e+00  8.291929e-07
25%    1.048576e+06  2.498116e-01
50%    2.097152e+06  4.999210e-01
75%    3.145727e+06  7.498380e-01
max    4.194303e+06  9.999996e-01

This example was run locally on my laptop using Spark defaults so the times shown should not be taken precisely. Even though, it is clear there is a huge performance boost and using Arrow took something that was excruciatingly slow and speeds it up to be barely noticeable.

Notes on Usage

Here are some things to keep in mind before making use of this new feature. At the time of writing this, pyarrow will not be installed automatically with pyspark and needs to be manually installed, see installation instructions. It is planned to add pyarrow as a pyspark dependency so that > pip install pyspark will also install pyarrow.

Currently, the controlling SQLConf is disabled by default. This can be enabled programmatically as in the example above or by adding the line “spark.sql.execution.arrow.enabled=true” to SPARK_HOME/conf/spark-defaults.conf.

Also, not all Spark data types are currently supported and limited to primitive types. Expanded type support is in the works and expected to also be in the Spark 2.3 release.

Future Improvements

As mentioned, this was just a first step in using Arrow to make life easier for Spark Python users. A few exciting initiatives in the works are to allow for vectorized UDF evaluation (SPARK-21190, SPARK-21404), and the ability to apply a function on grouped data using a Pandas DataFrame (SPARK-20396). Just as Arrow helped in converting a Spark to Pandas, it can also work in the other direction when creating a Spark DataFrame from an existing Pandas DataFrame (SPARK-20791). Stay tuned for more!


Reaching this first milestone was a group effort from both the Apache Arrow and Spark communities. Thanks to the hard work of Wes McKinney, Li Jin, Holden Karau, Reynold Xin, Wenchen Fan, Shane Knapp and many others that helped push this effort forwards.

Apache Arrow 0.5.0 Release

Published 25 Jul 2017
By Wes McKinney (wesm)

The Apache Arrow team is pleased to announce the 0.5.0 release. It includes 130 resolved JIRAs with some new features, expanded integration testing between implementations, and bug fixes. The Arrow memory format remains stable since the 0.3.x and 0.4.x releases.

See the Install Page to learn how to get the libraries for your platform. The complete changelog is also available.

Expanded Integration Testing

In this release, we added compatibility tests for dictionary-encoded data between Java and C++. This enables the distinct values (the dictionary) in a vector to be transmitted as part of an Arrow schema while the record batches contain integers which correspond to the dictionary.

So we might have:

data (string): ['foo', 'bar', 'foo', 'bar']

In dictionary-encoded form, this could be represented as:

indices (int8): [0, 1, 0, 1]
dictionary (string): ['foo', 'bar']

In upcoming releases, we plan to complete integration testing for the remaining data types (including some more complicated types like unions and decimals) on the road to a 1.0.0 release in the future.

C++ Activity

We completed a number of significant pieces of work in the C++ part of Apache Arrow.

Using jemalloc as default memory allocator

We decided to use jemalloc as the default memory allocator unless it is explicitly disabled. This memory allocator has significant performance advantages in Arrow workloads over the default malloc implementation. We will publish a blog post going into more detail about this and why you might care.

Sharing more C++ code with Apache Parquet

We imported the compression library interfaces and dictionary encoding algorithms from the Apache Parquet C++ library. The Parquet library now depends on this code in Arrow, and we will be able to use it more easily for data compression in Arrow use cases.

As part of incorporating Parquet’s dictionary encoding utilities, we have developed an arrow::DictionaryBuilder class to enable building dictionary-encoded arrays iteratively. This can help save memory and yield better performance when interacting with databases, Parquet files, or other sources which may have columns having many duplicates.

Support for LZ4 and ZSTD compressors

We added LZ4 and ZSTD compression library support. In ARROW-300 and other planned work, we intend to add some compression features for data sent via RPC.

Python Activity

We fixed many bugs which were affecting Parquet and Feather users and fixed several other rough edges with normal Arrow use. We also added some additional Arrow type conversions: structs, lists embedded in pandas objects, and Arrow time types (which deserialize to the datetime.time type).

In upcoming releases we plan to continue to improve Dask support and performance for distributed processing of Apache Parquet files with pyarrow.

The Road Ahead

We have much work ahead of us to build out Arrow integrations in other data systems to improve their processing performance and interoperability with other systems.

We are discussing the roadmap to a future 1.0.0 release on the developer mailing list. Please join the discussion there.

Connecting Relational Databases to the Apache Arrow World with turbodbc

Published 16 Jun 2017
By Michael König (MathMagique)

Michael König is the lead developer of the turbodbc project

The Apache Arrow project set out to become the universal data layer for column-oriented data processing systems without incurring serialization costs or compromising on performance on a more general level. While relational databases still lag behind in Apache Arrow adoption, the Python database module turbodbc brings Apache Arrow support to these databases using a much older, more specialized data exchange layer: ODBC.

ODBC is a database interface that offers developers the option to transfer data either in row-wise or column-wise fashion. Previous Python ODBC modules typically use the row-wise approach, and often trade repeated database roundtrips for simplified buffer handling. This makes them less suited for data-intensive applications, particularly when interfacing with modern columnar analytical databases.

In contrast, turbodbc was designed to leverage columnar data processing from day one. Naturally, this implies using the columnar portion of the ODBC API. Equally important, however, is to find new ways of providing columnar data to Python users that exceed the capabilities of the row-wise API mandated by Python’s PEP 249. Turbodbc has adopted Apache Arrow for this very task with the recently released version 2.0.0:

>>> from turbodbc import connect
>>> connection = connect(dsn="My columnar database")
>>> cursor = connection.cursor()
>>> cursor.execute("SELECT some_integers, some_strings FROM my_table")
>>> cursor.fetchallarrow()
some_integers: int64
some_strings: string

With this new addition, the data flow for a result set of a typical SELECT query is like this:

  • The database prepares the result set and exposes it to the ODBC driver using either row-wise or column-wise storage.
  • Turbodbc has the ODBC driver write chunks of the result set into columnar buffers.
  • These buffers are exposed to turbodbc’s Apache Arrow frontend. This frontend will create an Arrow table and fill in the buffered values.
  • The previous steps are repeated until the entire result set is retrieved.

Data flow from relational databases to Python with turbodbc and the Apache Arrow frontend

In practice, it is possible to achieve the following ideal situation: A 64-bit integer column is stored as one contiguous block of memory in a columnar database. A huge chunk of 64-bit integers is transferred over the network and the ODBC driver directly writes it to a turbodbc buffer of 64-bit integers. The Arrow frontend accumulates these values by copying the entire 64-bit buffer into a free portion of an Arrow table’s 64-bit integer column.

Moving data from the database to an Arrow table and, thus, providing it to the Python user can be as simple as copying memory blocks around, megabytes equivalent to hundred thousands of rows at a time. The absence of serialization and conversion logic renders the process extremely efficient.

Once the data is stored in an Arrow table, Python users can continue to do some actual work. They can convert it into a Pandas DataFrame for data analysis (using a quick table.to_pandas()), pass it on to other data processing systems such as Apache Spark or Apache Impala (incubating), or store it in the Apache Parquet file format. This way, non-Python systems are efficiently connected with relational databases.

In the future, turbodbc’s Arrow support will be extended to use more sophisticated features such as dictionary-encoded string fields. We also plan to pick smaller than 64-bit data types where possible. Last but not least, Arrow support will be extended to cover the reverse direction of data flow, so that Python users can quickly insert Arrow tables into relational databases.

If you would like to learn more about turbodbc, check out the GitHub project and the project documentation. If you want to learn more about how turbodbc implements the nitty-gritty details, check out parts one and two of the “Making of turbodbc” series at Blue Yonder’s technology blog.

Apache Arrow 0.4.1 Release

Published 14 Jun 2017
By Wes McKinney (wesm)

The Apache Arrow team is pleased to announce the 0.4.1 release of the project. This is a bug fix release that addresses a regression with Decimal types in the Java implementation introduced in 0.4.0 (see ARROW-1091). There were a total of 31 resolved JIRAs.

See the Install Page to learn how to get the libraries for your platform.

Python Wheel Installers for Windows

Max Risuhin contributed fixes to enable binary wheel installers to be generated for Python 3.5 and 3.6. Thus, 0.4.1 is the first Arrow release for which PyArrow including bundled Apache Parquet support that can be installed with either conda or pip across the 3 major platforms: Linux, macOS, and Windows. Use one of:

pip install pyarrow
conda install pyarrow -c conda-forge

Turbodbc 2.0.0 with Apache Arrow Support

Turbodbc, a fast C++ ODBC interface with Python bindings, released version 2.0.0 including reading SQL result sets as Arrow record batches. The team used the PyArrow C++ API introduced in version 0.4.0 to construct pyarrow.Table objects inside the turbodbc library. Learn more in their documentation and install with one of:

pip install turbodbc
conda install turbodbc -c conda-forge

Apache Arrow 0.4.0 Release

Published 23 May 2017
By Wes McKinney (wesm)

The Apache Arrow team is pleased to announce the 0.4.0 release of the project. While only 17 days since the release, it includes 77 resolved JIRAs with some important new features and bug fixes.

See the Install Page to learn how to get the libraries for your platform.

Expanded JavaScript Implementation

The TypeScript Arrow implementation has undergone some work since 0.3.0 and can now read a substantial portion of the Arrow streaming binary format. As this implementation develops, we will eventually want to include JS in the integration test suite along with Java and C++ to ensure wire cross-compatibility.

Python Support for Apache Parquet on Windows

With the 1.1.0 C++ release of Apache Parquet, we have enabled the pyarrow.parquet extension on Windows for Python 3.5 and 3.6. This should appear in conda-forge packages and PyPI in the near future. Developers can follow the source build instructions.

Generalizing Arrow Streams

In the 0.2.0 release, we defined the first version of the Arrow streaming binary format for low-cost messaging with columnar data. These streams presume that the message components are written as a continuous byte stream over a socket or file.

We would like to be able to support other other transport protocols, like gRPC, for the message components of Arrow streams. To that end, in C++ we defined an abstract stream reader interface, for which the current contiguous streaming format is one implementation:

class RecordBatchReader {
  virtual std::shared_ptr<Schema> schema() const = 0;
  virtual Status GetNextRecordBatch(std::shared_ptr<RecordBatch>* batch) = 0;

It would also be good to define abstract stream reader and writer interfaces in the Java implementation.

In an upcoming blog post, we will explain in more depth how Arrow streams work, but you can learn more about them by reading the IPC specification.

C++ and Cython API for Python Extensions

As other Python libraries with C or C++ extensions use Apache Arrow, they will need to be able to return Python objects wrapping the underlying C++ objects. In this release, we have implemented a prototype C++ API which enables Python wrapper objects to be constructed from C++ extension code:

#include "arrow/python/pyarrow.h"

if (!arrow::py::import_pyarrow()) {
  // Error

std::shared_ptr<arrow::RecordBatch> cpp_batch = GetData(...);
PyObject* py_batch = arrow::py::wrap_batch(cpp_batch);

This API is intended to be usable from Cython code as well:

cimport pyarrow

Python Wheel Installers on macOS

With this release, pip install pyarrow works on macOS (OS X) as well as Linux. We are working on providing binary wheel installers for Windows as well.

Apache Arrow 0.3.0 Release

Published 08 May 2017
By Wes McKinney (wesm)

Translations: 日本語

The Apache Arrow team is pleased to announce the 0.3.0 release of the project. It is the product of an intense 10 weeks of development since the 0.2.0 release from this past February. It includes 306 resolved JIRAs from 23 contributors.

While we have added many new features to the different Arrow implementations, one of the major development focuses in 2017 has been hardening the in-memory format, type metadata, and messaging protocol to provide a stable, production-ready foundation for big data applications. We are excited to be collaborating with the Apache Spark and GeoMesa communities on utilizing Arrow for high performance IO and in-memory data processing.

See the Install Page to learn how to get the libraries for your platform.

We will be publishing more information about the Apache Arrow roadmap as we forge ahead with using Arrow to accelerate big data systems.

We are looking for more contributors from within our existing communities and from other communities (such as Go, R, or Julia) to get involved in Arrow development.

File and Streaming Format Hardening

The 0.2.0 release brought with it the first iterations of the random access and streaming Arrow wire formats. See the IPC specification for implementation details and example blog post with some use cases. These provide low-overhead, zero-copy access to Arrow record batch payloads.

In 0.3.0 we have solidified a number of small details with the binary format and improved our integration and unit testing particularly in the Java, C++, and Python libraries. Using the Google Flatbuffers project has helped with adding new features to our metadata without breaking forward compatibility.

We are not yet ready to make a firm commitment to strong forward compatibility (in case we find something needs to change) in the binary format, but we will make efforts between major releases to not make unnecessary breakages. Contributions to the website and component user and API documentation would also be most welcome.

Dictionary Encoding Support

Emilio Lahr-Vivaz from the GeoMesa project contributed Java support for dictionary-encoded Arrow vectors. We followed up with C++ and Python support (and pandas.Categorical integration). We have not yet implemented full integration tests for dictionaries (for sending this data between C++ and Java), but hope to achieve this in the 0.4.0 Arrow release.

This common data representation technique for categorical data allows multiple record batches to share a common “dictionary”, with the values in the batches being represented as integers referencing the dictionary. This data is called “categorical” or “factor” in statistical languages, while in file formats like Apache Parquet it is strictly used for data compression.

Expanded Date, Time, and Fixed Size Types

A notable omission from the 0.2.0 release was complete and integration-tested support for the gamut of date and time types that occur in the wild. These are needed for Apache Parquet and Apache Spark integration.

  • Date: 32-bit (days unit) and 64-bit (milliseconds unit)
  • Time: 64-bit integer with unit (second, millisecond, microsecond, nanosecond)
  • Timestamp: 64-bit integer with unit, with or without timezone
  • Fixed Size Binary: Primitive values occupying certain number of bytes
  • Fixed Size List: List values with constant size (no separate offsets vector)

We have additionally added experimental support for exact decimals in C++ using Boost.Multiprecision, though we have not yet hardened the Decimal memory format between the Java and C++ implementations.

C++ and Python Support on Windows

We have made many general improvements to development and packaging for general C++ and Python development. 0.3.0 is the first release to bring full C++ and Python support for Windows on Visual Studio (MSVC) 2015 and 2017. In addition to adding Appveyor continuous integration for MSVC, we have also written guides for building from source on Windows: C++ and Python.

For the first time, you can install the Arrow Python library on Windows from conda-forge:

conda install pyarrow -c conda-forge

C (GLib) Bindings, with support for Ruby, Lua, and more

Kouhei Sutou is a new Apache Arrow contributor and has contributed GLib C bindings (to the C++ libraries) for Linux. Using a C middleware framework called GObject Introspection, it is possible to use these bindings seamlessly in Ruby, Lua, Go, and other programming languages. We will probably need to publish some follow up blogs explaining how these bindings work and how to use them.

Apache Spark Integration for PySpark

We have been collaborating with the Apache Spark community on SPARK-13534 to add support for using Arrow to accelerate DataFrame.toPandas in PySpark. We have observed over 40x speedup from the more efficient data serialization.

Using Arrow in PySpark opens the door to many other performance optimizations, particularly around UDF evaluation (e.g. map and filter operations with Python lambda functions).

New Python Feature: Memory Views, Feather, Apache Parquet support

Arrow’s Python library pyarrow is a Cython binding for the libarrow and libarrow_python C++ libraries, which handle inteoperability with NumPy, pandas, and the Python standard library.

At the heart of Arrow’s C++ libraries is the arrow::Buffer object, which is a managed memory view supporting zero-copy reads and slices. Jeff Knupp contributed integration between Arrow buffers and the Python buffer protocol and memoryviews, so now code like this is possible:

In [6]: import pyarrow as pa

In [7]: buf = pa.frombuffer(b'foobarbaz')

In [8]: buf
Out[8]: <pyarrow._io.Buffer at 0x7f6c0a84b538>

In [9]: memoryview(buf)
Out[9]: <memory at 0x7f6c0a8c5e88>

In [10]: buf.to_pybytes()
Out[10]: b'foobarbaz'

We have significantly expanded Apache Parquet support via the C++ Parquet implementation parquet-cpp. This includes support for partitioned datasets on disk or in HDFS. We added initial Arrow-powered Parquet support in the Dask project, and look forward to more collaborations with the Dask developers on distributed processing of pandas data.

With Arrow’s support for pandas maturing, we were able to merge in the Feather format implementation, which is essentially a special case of the Arrow random access format. We’ll be continuing Feather development within the Arrow codebase. For example, Feather can now read and write with Python file objects using Arrow’s Python binding layer.

We also implemented more robust support for pandas-specific data types, like DatetimeTZ and Categorical.

Support for Tensors and beyond in C++ Library

There has been increased interest in using Apache Arrow as a tool for zero-copy shared memory management for machine learning applications. A flagship example is the Ray project from the UC Berkeley RISELab.

Machine learning deals in additional kinds of data structures beyond what the Arrow columnar format supports, like multidimensional arrays aka “tensors”. As such, we implemented the arrow::Tensor C++ type which can utilize the rest of Arrow’s zero-copy shared memory machinery (using arrow::Buffer for managing memory lifetime). In C++ in particular, we will want to provide for additional data structures utilizing common IO and memory management tools.

Start of JavaScript (TypeScript) Implementation

Brian Hulette started developing an Arrow implementation in TypeScript for use in NodeJS and browser-side applications. We are benefitting from Flatbuffers’ first class support for JavaScript.

Improved Website and Developer Documentation

Since 0.2.0 we have implemented a new website stack for publishing documentation and blogs based on Jekyll. Kouhei Sutou developed a Jekyll Jupyter Notebook plugin so that we can use Jupyter to author content for the Arrow website.

On the website, we have now published API documentation for the C, C++, Java, and Python subcomponents. Within these you will find easier-to-follow developer instructions for getting started.


Thanks to all who contributed patches to this release.

$ git shortlog -sn apache-arrow-0.2.0..apache-arrow-0.3.0
    119 Wes McKinney
     55 Kouhei Sutou
     18 Uwe L. Korn
     17 Julien Le Dem
      9 Phillip Cloud
      6 Bryan Cutler
      5 Philipp Moritz
      5 Emilio Lahr-Vivaz
      4 Max Risuhin
      4 Johan Mabille
      4 Jeff Knupp
      3 Steven Phillips
      3 Miki Tebeka
      2 Leif Walsh
      2 Jeff Reback
      2 Brian Hulette
      1 Tsuyoshi Ozawa
      1 rvernica
      1 Nong Li
      1 Julien Lafaye
      1 Itai Incze
      1 Holden Karau
      1 Deepak Majeti