How Drivers and the Driver Manager Work Together¶
Note
This document focuses on drivers/applications that implement or consume the C API definitions in adbc.h. That includes C/C++, Python, and Ruby; and possibly C#, Go, and Rust (when implementing or consuming drivers via FFI).
When an application calls a function like
AdbcStatementExecuteQuery()
, how does it “know” what function in
which driver to actually call?
This can happen in a few ways. In the simplest case, the application links to a single driver, and directly calls ADBC functions explicitly defined by the driver:
This doesn’t work with multiple drivers, or applications that don’t/can’t link
directly to drivers (think dynamic loading, perhaps in a language like
Python). For this case, ADBC provides a table of function pointers
(AdbcDriver
), and a way to request this table from a driver.
Then, the application proceeds in two steps. First, it dynamically loads a
driver and calls an entrypoint function to get the function table:
Then, the application uses the driver by calling the functions in the table:
Dealing with the table, however, is messy. So the overall recommended approach is to use the ADBC driver manager. This is a library that pretends to be a single driver that can be linked to and used “like normal”. Internally, it loads the table of function pointers and tracks which database/connection/statement objects need which “actual” driver, making it easy to dynamically load drivers at runtime and use multiple drivers from the same application:
In More Detail¶
The adbc.h header ties everything together. It is the abstract API definition, akin to interface/trait/protocol definitions in other languages. C being C, however, all it consists of is a bunch of function prototypes and struct definitions without any implementation.
A driver, at its core, is just a library that implements those function prototypes in adbc.h. Those functions may be implemented in C, or they can be implemented in a different language and exported through language-specific FFI mechanisms. For example, the Go and C# implementations of ADBC can both export drivers to consumers who expect the C API definitions. As long as the definitions in adbc.h are implemented somehow, then the application is generally none the wiser when it comes to what’s actually underneath.
How does an application call these functions, though? Here, there are several options.
Again, the simplest case is as follows: if (1) the application links directly
to the driver, and (2) the driver exposes the ADBC functions under the same
name as in adbc.h, then the application can just #include <arrow-adbc/adbc.h>
and call AdbcStatementExecuteQuery(...)
directly. Here, the application and
driver have a relationship no different than any other C library.
Unfortunately, this doesn’t work as well in other scenarios. For example, if an application wishes to use multiple ADBC drivers, this no longer works: both drivers define the same functions (the ones in adbc.h), and when the application links both of them, the linker has no way of telling which driver’s function is meant when the application calls an ADBC function. On top of that, this violates the One Definition Rule.
In this case, the driver can provide driver-specific aliases that applications
can use, say PostgresqlStatementExecuteQuery
or
FlightSqlStatementExecuteQuery
. Then, the application can link both
drivers, ignore the Adbc…
functions (and ignore the technical violation of
the One Definition Rule there), and use the aliases instead.
This is rather inconvenient for the application, though. Additionally, this sort of defeats the point of using ADBC, since now the application has a separate API for each driver, even if they’re technically all clones of the same API. And this doesn’t solve the problem for applications that want to load drivers dynamically. For example, a Python script would want to load the driver at runtime. In that case, it would need to know which functions from the driver correspond to which functions in the ADBC API definitions, without having to hardcode this knowledge.
ADBC anticipated this, and defined AdbcDriver
. This is just a
table of function pointers with one entry per ADBC function. That way, an
application can dynamically load a driver and call an entrypoint function that
returns this table of function pointers. (It does have to hardcode or guess
the name of the entrypoint; the ADBC spec lists a set of names it can try,
based on the name of the driver library itself.)
Then, it can use the driver by calling functions in that table:
Of course, calling all functions by jumping through a giant table of function pointers is inconvenient. So ADBC provides the “driver manager”, a library that _pretends_ to be a simple driver and implements all the ADBC functions. Internally, it loads drivers dynamically, requests the tables of function pointers, and keeps track of which connections are using which drivers. The application only needs to call the standard ADBC functions, just like in the simplest case we started out with:
So to recap, a driver should implement these three things:
An implementation of each ADBC function,
A thin wrapper around each implementation function that exports the ADBC name for each function, and
An entrypoint function that returns a
AdbcDriver
table, containing the functions from (1).
Then, an application has these choices of ways to use a driver:
Link the driver directly and call
Adbc…
functions (only in the simplest cases) using (2) above,Link the driver directly/dynamically, load the
AdbcDriver
via (3) above, and call ADBC functions through function pointers (generally not recommended),Link the ADBC driver manager, call
Adbc…
functions, and let the driver manager deal with (3) above (what most applications will want to do).
In other words, it’s usually easiest to just always use the driver manager. But the magic it pulls isn’t required or all that complex.
Note
You may ask: when we have AdbcDriver
, why do we bother
defining both AdbcStatementExecuteQuery
and
SqliteStatementExecuteQuery
(i.e., why do both (1) and (2)
above)? Can’t we just define the Adbc…
version, and put it into
the function table when requested?
Here, implementation constraints come in. At runtime, when the
driver looks up the address of (say) AdbcStatementExecuteQuery
to put it into the table, the dynamic linker will come into play to
figure out where this function is. Unfortunately, it will probably
find it in the driver manager. This is a problem, since then the
driver manager will end up in an infinite loop when it goes to call
the “driver’s” version of the function!
By having a seemingly redundant copy of the function, we can then hide the “real implementation” from the dynamic linker and avoid this behavior.
The driver manager could try to solve this by loading the drivers
with RTLD_DEEPBIND
. This, however, is not portable, and causes
problems if we also want to use things like AddressSanitizer during
development. The driver could also build with flags like
-Bsymbolic-functions
.