Project News and Blog


Connecting Relational Databases to the Apache Arrow World with turbodbc

Published 16 Jun 2017
By Michael König (MathMagique)

Michael König is the lead developer of the turbodbc project

The Apache Arrow project set out to become the universal data layer for column-oriented data processing systems without incurring serialization costs or compromising on performance on a more general level. While relational databases still lag behind in Apache Arrow adoption, the Python database module turbodbc brings Apache Arrow support to these databases using a much older, more specialized data exchange layer: ODBC.

ODBC is a database interface that offers developers the option to transfer data either in row-wise or column-wise fashion. Previous Python ODBC modules typically use the row-wise approach, and often trade repeated database roundtrips for simplified buffer handling. This makes them less suited for data-intensive applications, particularly when interfacing with modern columnar analytical databases.

In contrast, turbodbc was designed to leverage columnar data processing from day one. Naturally, this implies using the columnar portion of the ODBC API. Equally important, however, is to find new ways of providing columnar data to Python users that exceed the capabilities of the row-wise API mandated by Python’s PEP 249. Turbodbc has adopted Apache Arrow for this very task with the recently released version 2.0.0:

>>> from turbodbc import connect
>>> connection = connect(dsn="My columnar database")
>>> cursor = connection.cursor()
>>> cursor.execute("SELECT some_integers, some_strings FROM my_table")
>>> cursor.fetchallarrow()
pyarrow.Table
some_integers: int64
some_strings: string

With this new addition, the data flow for a result set of a typical SELECT query is like this:

Data flow from relational databases to Python with turbodbc and the Apache Arrow frontend

In practice, it is possible to achieve the following ideal situation: A 64-bit integer column is stored as one contiguous block of memory in a columnar database. A huge chunk of 64-bit integers is transferred over the network and the ODBC driver directly writes it to a turbodbc buffer of 64-bit integers. The Arrow frontend accumulates these values by copying the entire 64-bit buffer into a free portion of an Arrow table’s 64-bit integer column.

Moving data from the database to an Arrow table and, thus, providing it to the Python user can be as simple as copying memory blocks around, megabytes equivalent to hundred thousands of rows at a time. The absence of serialization and conversion logic renders the process extremely efficient.

Once the data is stored in an Arrow table, Python users can continue to do some actual work. They can convert it into a Pandas DataFrame for data analysis (using a quick table.to_pandas()), pass it on to other data processing systems such as Apache Spark or Apache Impala (incubating), or store it in the Apache Parquet file format. This way, non-Python systems are efficiently connected with relational databases.

In the future, turbodbc’s Arrow support will be extended to use more sophisticated features such as dictionary-encoded string fields. We also plan to pick smaller than 64-bit data types where possible. Last but not least, Arrow support will be extended to cover the reverse direction of data flow, so that Python users can quickly insert Arrow tables into relational databases.

If you would like to learn more about turbodbc, check out the GitHub project and the project documentation. If you want to learn more about how turbodbc implements the nitty-gritty details, check out parts one and two of the “Making of turbodbc” series at Blue Yonder’s technology blog.

Apache Arrow 0.4.1 Release

Published 14 Jun 2017
By Wes McKinney (wesm)

The Apache Arrow team is pleased to announce the 0.4.1 release of the project. This is a bug fix release that addresses a regression with Decimal types in the Java implementation introduced in 0.4.0 (see ARROW-1091). There were a total of 31 resolved JIRAs.

See the Install Page to learn how to get the libraries for your platform.

Python Wheel Installers for Windows

Max Risuhin contributed fixes to enable binary wheel installers to be generated for Python 3.5 and 3.6. Thus, 0.4.1 is the first Arrow release for which PyArrow including bundled Apache Parquet support that can be installed with either conda or pip across the 3 major platforms: Linux, macOS, and Windows. Use one of:

pip install pyarrow
conda install pyarrow -c conda-forge

Turbodbc 2.0.0 with Apache Arrow Support

Turbodbc, a fast C++ ODBC interface with Python bindings, released version 2.0.0 including reading SQL result sets as Arrow record batches. The team used the PyArrow C++ API introduced in version 0.4.0 to construct pyarrow.Table objects inside the turbodbc library. Learn more in their documentation and install with one of:

pip install turbodbc
conda install turbodbc -c conda-forge

Apache Arrow 0.4.0 Release

Published 23 May 2017
By Wes McKinney (wesm)

The Apache Arrow team is pleased to announce the 0.4.0 release of the project. While only 17 days since the release, it includes 77 resolved JIRAs with some important new features and bug fixes.

See the Install Page to learn how to get the libraries for your platform.

Expanded JavaScript Implementation

The TypeScript Arrow implementation has undergone some work since 0.3.0 and can now read a substantial portion of the Arrow streaming binary format. As this implementation develops, we will eventually want to include JS in the integration test suite along with Java and C++ to ensure wire cross-compatibility.

Python Support for Apache Parquet on Windows

With the 1.1.0 C++ release of Apache Parquet, we have enabled the pyarrow.parquet extension on Windows for Python 3.5 and 3.6. This should appear in conda-forge packages and PyPI in the near future. Developers can follow the source build instructions.

Generalizing Arrow Streams

In the 0.2.0 release, we defined the first version of the Arrow streaming binary format for low-cost messaging with columnar data. These streams presume that the message components are written as a continuous byte stream over a socket or file.

We would like to be able to support other other transport protocols, like gRPC, for the message components of Arrow streams. To that end, in C++ we defined an abstract stream reader interface, for which the current contiguous streaming format is one implementation:

class RecordBatchReader {
 public:
  virtual std::shared_ptr<Schema> schema() const = 0;
  virtual Status GetNextRecordBatch(std::shared_ptr<RecordBatch>* batch) = 0;
};

It would also be good to define abstract stream reader and writer interfaces in the Java implementation.

In an upcoming blog post, we will explain in more depth how Arrow streams work, but you can learn more about them by reading the IPC specification.

C++ and Cython API for Python Extensions

As other Python libraries with C or C++ extensions use Apache Arrow, they will need to be able to return Python objects wrapping the underlying C++ objects. In this release, we have implemented a prototype C++ API which enables Python wrapper objects to be constructed from C++ extension code:

#include "arrow/python/pyarrow.h"

if (!arrow::py::import_pyarrow()) {
  // Error
}

std::shared_ptr<arrow::RecordBatch> cpp_batch = GetData(...);
PyObject* py_batch = arrow::py::wrap_batch(cpp_batch);

This API is intended to be usable from Cython code as well:

cimport pyarrow
pyarrow.import_pyarrow()

Python Wheel Installers on macOS

With this release, pip install pyarrow works on macOS (OS X) as well as Linux. We are working on providing binary wheel installers for Windows as well.

Apache Arrow 0.3.0 Release

Published 08 May 2017
By Wes McKinney (wesm)

Translations: 日本語

The Apache Arrow team is pleased to announce the 0.3.0 release of the project. It is the product of an intense 10 weeks of development since the 0.2.0 release from this past February. It includes 306 resolved JIRAs from 23 contributors.

While we have added many new features to the different Arrow implementations, one of the major development focuses in 2017 has been hardening the in-memory format, type metadata, and messaging protocol to provide a stable, production-ready foundation for big data applications. We are excited to be collaborating with the Apache Spark and GeoMesa communities on utilizing Arrow for high performance IO and in-memory data processing.

See the Install Page to learn how to get the libraries for your platform.

We will be publishing more information about the Apache Arrow roadmap as we forge ahead with using Arrow to accelerate big data systems.

We are looking for more contributors from within our existing communities and from other communities (such as Go, R, or Julia) to get involved in Arrow development.

File and Streaming Format Hardening

The 0.2.0 release brought with it the first iterations of the random access and streaming Arrow wire formats. See the IPC specification for implementation details and example blog post with some use cases. These provide low-overhead, zero-copy access to Arrow record batch payloads.

In 0.3.0 we have solidified a number of small details with the binary format and improved our integration and unit testing particularly in the Java, C++, and Python libraries. Using the Google Flatbuffers project has helped with adding new features to our metadata without breaking forward compatibility.

We are not yet ready to make a firm commitment to strong forward compatibility (in case we find something needs to change) in the binary format, but we will make efforts between major releases to not make unnecessary breakages. Contributions to the website and component user and API documentation would also be most welcome.

Dictionary Encoding Support

Emilio Lahr-Vivaz from the GeoMesa project contributed Java support for dictionary-encoded Arrow vectors. We followed up with C++ and Python support (and pandas.Categorical integration). We have not yet implemented full integration tests for dictionaries (for sending this data between C++ and Java), but hope to achieve this in the 0.4.0 Arrow release.

This common data representation technique for categorical data allows multiple record batches to share a common “dictionary”, with the values in the batches being represented as integers referencing the dictionary. This data is called “categorical” or “factor” in statistical languages, while in file formats like Apache Parquet it is strictly used for data compression.

Expanded Date, Time, and Fixed Size Types

A notable omission from the 0.2.0 release was complete and integration-tested support for the gamut of date and time types that occur in the wild. These are needed for Apache Parquet and Apache Spark integration.

We have additionally added experimental support for exact decimals in C++ using Boost.Multiprecision, though we have not yet hardened the Decimal memory format between the Java and C++ implementations.

C++ and Python Support on Windows

We have made many general improvements to development and packaging for general C++ and Python development. 0.3.0 is the first release to bring full C++ and Python support for Windows on Visual Studio (MSVC) 2015 and 2017. In addition to adding Appveyor continuous integration for MSVC, we have also written guides for building from source on Windows: C++ and Python.

For the first time, you can install the Arrow Python library on Windows from conda-forge:

conda install pyarrow -c conda-forge

C (GLib) Bindings, with support for Ruby, Lua, and more

Kouhei Sutou is a new Apache Arrow contributor and has contributed GLib C bindings (to the C++ libraries) for Linux. Using a C middleware framework called GObject Introspection, it is possible to use these bindings seamlessly in Ruby, Lua, Go, and other programming languages. We will probably need to publish some follow up blogs explaining how these bindings work and how to use them.

Apache Spark Integration for PySpark

We have been collaborating with the Apache Spark community on SPARK-13534 to add support for using Arrow to accelerate DataFrame.toPandas in PySpark. We have observed over 40x speedup from the more efficient data serialization.

Using Arrow in PySpark opens the door to many other performance optimizations, particularly around UDF evaluation (e.g. map and filter operations with Python lambda functions).

New Python Feature: Memory Views, Feather, Apache Parquet support

Arrow’s Python library pyarrow is a Cython binding for the libarrow and libarrow_python C++ libraries, which handle inteoperability with NumPy, pandas, and the Python standard library.

At the heart of Arrow’s C++ libraries is the arrow::Buffer object, which is a managed memory view supporting zero-copy reads and slices. Jeff Knupp contributed integration between Arrow buffers and the Python buffer protocol and memoryviews, so now code like this is possible:

In [6]: import pyarrow as pa

In [7]: buf = pa.frombuffer(b'foobarbaz')

In [8]: buf
Out[8]: <pyarrow._io.Buffer at 0x7f6c0a84b538>

In [9]: memoryview(buf)
Out[9]: <memory at 0x7f6c0a8c5e88>

In [10]: buf.to_pybytes()
Out[10]: b'foobarbaz'

We have significantly expanded Apache Parquet support via the C++ Parquet implementation parquet-cpp. This includes support for partitioned datasets on disk or in HDFS. We added initial Arrow-powered Parquet support in the Dask project, and look forward to more collaborations with the Dask developers on distributed processing of pandas data.

With Arrow’s support for pandas maturing, we were able to merge in the Feather format implementation, which is essentially a special case of the Arrow random access format. We’ll be continuing Feather development within the Arrow codebase. For example, Feather can now read and write with Python file objects using Arrow’s Python binding layer.

We also implemented more robust support for pandas-specific data types, like DatetimeTZ and Categorical.

Support for Tensors and beyond in C++ Library

There has been increased interest in using Apache Arrow as a tool for zero-copy shared memory management for machine learning applications. A flagship example is the Ray project from the UC Berkeley RISELab.

Machine learning deals in additional kinds of data structures beyond what the Arrow columnar format supports, like multidimensional arrays aka “tensors”. As such, we implemented the arrow::Tensor C++ type which can utilize the rest of Arrow’s zero-copy shared memory machinery (using arrow::Buffer for managing memory lifetime). In C++ in particular, we will want to provide for additional data structures utilizing common IO and memory management tools.

Start of JavaScript (TypeScript) Implementation

Brian Hulette started developing an Arrow implementation in TypeScript for use in NodeJS and browser-side applications. We are benefitting from Flatbuffers’ first class support for JavaScript.

Improved Website and Developer Documentation

Since 0.2.0 we have implemented a new website stack for publishing documentation and blogs based on Jekyll. Kouhei Sutou developed a Jekyll Jupyter Notebook plugin so that we can use Jupyter to author content for the Arrow website.

On the website, we have now published API documentation for the C, C++, Java, and Python subcomponents. Within these you will find easier-to-follow developer instructions for getting started.

Contributors

Thanks to all who contributed patches to this release.

$ git shortlog -sn apache-arrow-0.2.0..apache-arrow-0.3.0
    119 Wes McKinney
     55 Kouhei Sutou
     18 Uwe L. Korn
     17 Julien Le Dem
      9 Phillip Cloud
      6 Bryan Cutler
      5 Philipp Moritz
      5 Emilio Lahr-Vivaz
      4 Max Risuhin
      4 Johan Mabille
      4 Jeff Knupp
      3 Steven Phillips
      3 Miki Tebeka
      2 Leif Walsh
      2 Jeff Reback
      2 Brian Hulette
      1 Tsuyoshi Ozawa
      1 rvernica
      1 Nong Li
      1 Julien Lafaye
      1 Itai Incze
      1 Holden Karau
      1 Deepak Majeti