Expanding Arrow's Reach with a JDBC Driver for Arrow Flight SQL
Published
01 Nov 2022
By
The Apache Arrow PMC (pmc)
We’re excited to announce that as of version 10.0.0, the Arrow project now includes a JDBC driver implementation based on Arrow Flight SQL. This is courtesy of a software grant from Dremio, a data lakehouse platform. Contributors from Dremio developed and open-sourced this driver implementation, in addition to designing and contributing Flight SQL itself.
Flight SQL is a protocol for client-server database interactions. It defines how a client should talk to a server and execute queries, fetch result sets, and so on. Note that despite the name, Flight SQL is not a SQL dialect, or even specific to SQL itself. Underneath, it builds on Arrow Flight RPC, a framework for efficient transfer of Arrow data across the network. While Flight RPC is flexible and can be used in any type of application, from the beginning, it was designed with an eye towards the kinds of use cases that Flight SQL supports.
With this new JDBC driver, applications can talk to any database server implementing the Flight SQL protocol using familiar JDBC APIs. Underneath, the driver sends queries to the server via Flight SQL and adapts the Arrow result set to the JDBC interface, so that the database can support JDBC users without implementing additional APIs or its own JDBC driver.
Why use JDBC with Flight SQL?
JDBC offers a row-oriented API, which is opposite of Arrow’s columnar structure. However, it is a popular and time-tested choice for many applications. For example, many business intelligence (BI) tools take advantage of JDBC to interoperate generically with multiple databases. An Arrow-native database may still wish to be accessible to all of this existing software, without having to implement multiple client drivers itself. Additionally, columnar data transfer alone can be a significant speedup for analytical use cases.
This JDBC driver implementation demonstrates the generality of Arrow and Flight SQL, and increases the reach of Arrow-based applications. Additionally, an ODBC driver implementation based on Flight SQL is also available courtesy of Dremio, though it is not yet part of the Arrow project due to dependency licensing issues.
Now, a database can support the vast body of existing code that uses JDBC or ODBC, as well as Arrow-native applications, just by implementing a single wire protocol: Flight SQL. Some projects instead do things like reimplementing the Postgres wire protocol to benefit from its existing drivers. But for Arrow-native databases, this gives up the benefits of columnar data. On the other hand, Flight SQL is:
- Columnar and Arrow-native, using Arrow for result sets to avoid unnecessary data copies and transformations;
- Designed for implementation by multiple databases, with high-level C++ and Java libraries and a Protobuf protocol definition; and
- Usable both through APIs like JDBC and ODBC thanks to this software grant, as well as directly (or via ADBC) for applications that want columnar data.
Getting Involved
The JDBC driver was merged for the Arrow 10.0.0 release, and the source code can be found in the Arrow repository. Official builds of the driver are available on Maven Central. Dremio is already making use of the driver, and we’re looking forward to seeing what else gets built on top. Of course, there are still improvements to be made. If you’re interested in contributing, or have feedback or questions, please reach out on the mailing list or GitHub.
To learn more about when to use the Flight SQL JDBC driver vs the Flight SQL native client libraries, see this section of Dremio’s presentation, “Apache Arrow Flight SQL: a universal standard for high performance data transfers from databases” (starting at 22:23). For more about how Dremio uses Apache Arrow, see their guide.