Getting Involved#

Right now the primary audience for Apache Arrow are the developers of data systems; most people will use Apache Arrow indirectly through systems that use it for internal data handling and interoperating with other Arrow-enabled systems.

Even if you do not plan to contribute to Apache Arrow itself or Arrow integrations in other projects, we’d be happy to have you involved:

PyArrow Architecture#

PyArrow is for the major part a wrapper around the functionalities that Arrow C++ implementation provides. The library tries to take what’s available in C++ and expose it through a user experience that is more pythonic and less complex to use. So while in some cases it might be easy to map what’s in C++ to what’s in Python, in many cases the C++ classes and methods are used as foundations to build easier to use entities.

Four layers of PyArrow architecture: .py, .pyx, .pxd and low level C++ code.
  • The *.py files in the pyarrow package are usually where the entities exposed to the user are declared. In some cases, those files might directly import the entities from inner implementation if they want to expose it as is without modification.

  • The lib.pyx file is where the majority of the core C++ libarrow capabilities are exposed to Python. Most of the implementation of this module relies on included *.pxi files where the specific pieces are built. While being exposed to Python as pyarrow.lib its content should be considered internal. The public classes are then directly exposed in other modules (like pyarrow itself) by virtue of importing them from pyarrow.lib

  • The _*.pyx files are where the glue code is usually created, it puts together the C++ capabilities turning it into Python classes and methods. They can be considered the internal implementation of the capabilities exposed by the *.py files.

  • The includes/*.pxd files are where the raw C++ library APIs are declared for usage in Cython. Here the C++ classes and methods are declared as they are so that in the other .pyx files they can be used to implement Python classes, functions and helpers.

  • Apart from Arrow C++ library, which dependence is mentioned in the previous line, PyArrow is also based on PyArrow C++, dedicated pieces of code that live in python/pyarrow/src/arrow/python directory and provide the low level code for capabilities like converting to and from numpy or pandas and the classes that allow to use Python objects and callbacks in C++.