Memory and IO Interfaces

This section will introduce you to the major concepts in PyArrow’s memory management and IO systems:

  • Buffers
  • File-like and stream-like objects
  • Memory pools


The Buffer object wraps the C++ arrow::Buffer type and is the primary tool for memory management in Apache Arrow in C++. It permits higher-level array classes to safely interact with memory which they may or may not own. arrow::Buffer can be zero-copy sliced to permit Buffers to cheaply reference other Buffers, while preserving memory lifetime and clean parent-child relationships.

There are many implementations of arrow::Buffer, but they all provide a standard interface: a data pointer and length. This is similar to Python’s built-in buffer protocol and memoryview objects.

A Buffer can be created from any Python object which implements the buffer protocol. Let’s consider a bytes object:

In [1]: import pyarrow as pa

In [2]: data = b'abcdefghijklmnopqrstuvwxyz'

In [3]: buf = pa.frombuffer(data)

In [4]: buf
Out[4]: <pyarrow.lib.Buffer at 0x2b9642a1ea40>

In [5]: buf.size
Out[5]: 26

Creating a Buffer in this way does not allocate any memory; it is a zero-copy view on the memory exported from the data bytes object.

The Buffer’s to_pybytes method can convert to a Python byte string:

In [6]: buf.to_pybytes()
Out[6]: b'abcdefghijklmnopqrstuvwxyz'

Buffers can be used in circumstances where a Python buffer or memoryview is required, and such conversions are also zero-copy:

In [7]: memoryview(buf)
Out[7]: <memory at 0x2b96417c6dc8>

Native Files

The Arrow C++ libraries have several abstract interfaces for different kinds of IO objects:

  • Read-only streams
  • Read-only files supporting random access
  • Write-only streams
  • Write-only files supporting random access
  • File supporting reads, writes, and random access

In the the interest of making these objects behave more like Python’s built-in file objects, we have defined a NativeFile base class which is intended to mimic Python files and able to be used in functions where a Python file (such as file or BytesIO) is expected.

NativeFile has some important features which make it preferable to using Python files with PyArrow where possible:

  • Other Arrow classes can access the internal C++ IO objects natively, and do not need to acquire the Python GIL
  • Native C++ IO may be able to do zero-copy IO, such as with memory maps

There are several kinds of NativeFile options available:

  • OSFile, a native file that uses your operating system’s file descriptors
  • MemoryMappedFile, for reading (zero-copy) and writing with memory maps
  • BufferReader, for reading Buffer objects as a file
  • BufferOutputStream, for writing data in-memory, producing a Buffer at the end
  • HdfsFile, for reading and writing data to the Hadoop Filesystem
  • PythonFile, for interfacing with Python file objects in C++

We will discuss these in the following sections after explaining memory pools.

Memory Pools

All memory allocations and deallocations (like malloc and free in C) are tracked in an instance of arrow::MemoryPool. This means that we can then precisely track amount of memory that has been allocated:

In [8]: pa.total_allocated_bytes()
Out[8]: 13376

PyArrow uses a default built-in memory pool, but in the future there may be additional memory pools (and subpools) to choose from. Let’s consider an BufferOutputStream, which is like a BytesIO:

In [9]: stream = pa.BufferOutputStream()

In [10]: stream.write(b'foo')

In [11]: pa.total_allocated_bytes()
Out[11]: 13632

In [12]: for i in range(1024): stream.write(b'foo')

In [13]: pa.total_allocated_bytes()
Out[13]: 17472

The default allocator requests memory in a minimum increment of 64 bytes. If the stream is garbaged-collected, all of the memory is freed:

In [14]: stream = None

In [15]: pa.total_allocated_bytes()
Out[15]: 13376

Classes and functions that may allocate memory will often have an option to pass in a custom memory pool:

In [16]: my_pool = pa.jemalloc_memory_pool()
AttributeError                            Traceback (most recent call last)
<ipython-input-16-e9c01b97bee0> in <module>()
----> 1 my_pool = pa.jemalloc_memory_pool()

AttributeError: module 'pyarrow' has no attribute 'jemalloc_memory_pool'

In [17]: my_pool
NameError                                 Traceback (most recent call last)
<ipython-input-17-310377b8dff5> in <module>()
----> 1 my_pool

NameError: name 'my_pool' is not defined

In [18]: my_pool.bytes_allocated()
NameError                                 Traceback (most recent call last)
<ipython-input-18-ba01057c1aa4> in <module>()
----> 1 my_pool.bytes_allocated()

NameError: name 'my_pool' is not defined

In [19]: stream = pa.BufferOutputStream(my_pool)
NameError                                 Traceback (most recent call last)
<ipython-input-19-89bc1e2a2ea7> in <module>()
----> 1 stream = pa.BufferOutputStream(my_pool)

NameError: name 'my_pool' is not defined

In [20]: stream.write(b'foo')
AttributeError                            Traceback (most recent call last)
<ipython-input-20-ae0e879b2a89> in <module>()
----> 1 stream.write(b'foo')

AttributeError: 'NoneType' object has no attribute 'write'

In [21]: my_pool.bytes_allocated()
NameError                                 Traceback (most recent call last)
<ipython-input-21-ba01057c1aa4> in <module>()
----> 1 my_pool.bytes_allocated()

NameError: name 'my_pool' is not defined

On-Disk and Memory Mapped Files

PyArrow includes two ways to interact with data on disk: standard operating system-level file APIs, and memory-mapped files. In regular Python we can write:

In [22]: with open('example.dat', 'wb') as f:
   ....:     f.write(b'some example data')

Using pyarrow’s OSFile class, you can write:

In [23]: with pa.OSFile('example2.dat', 'wb') as f:
   ....:     f.write(b'some example data')

For reading files, you can use OSFile or MemoryMappedFile. The difference between these is that OSFile allocates new memory on each read, like Python file objects. In reads from memory maps, the library constructs a buffer referencing the mapped memory without any memory allocation or copying:

In [24]: file_obj = pa.OSFile('example.dat')

In [25]: mmap = pa.memory_map('example.dat')

In [26]:
Out[26]: b'some'

In [27]:
Out[27]: b'some'

The read method implements the standard Python file read API. To read into Arrow Buffer objects, use read_buffer:

In [28]:

In [29]: buf = mmap.read_buffer(4)

In [30]: print(buf)
<pyarrow.lib.Buffer object at 0x2b96432dcab0>

In [31]: buf.to_pybytes()
Out[31]: b'some'

Many tools in PyArrow, particular the Apache Parquet interface and the file and stream messaging tools, are more efficient when used with these NativeFile types than with normal Python file objects.

In-Memory Reading and Writing

To assist with serialization and deserialization of in-memory data, we have file interfaces that can read and write to Arrow Buffers.

In [32]: writer = pa.BufferOutputStream()

In [33]: writer.write(b'hello, friends')

In [34]: buf = writer.get_result()

In [35]: buf
Out[35]: <pyarrow.lib.Buffer at 0x2b96432c94c8>

In [36]: buf.size
Out[36]: 14

In [37]: reader = pa.BufferReader(buf)

In [38]:

In [39]:
Out[39]: b'friends'

These have similar semantics to Python’s built-in io.BytesIO.

Hadoop Filesystem

HdfsFile is an implementation of NativeFile that can read and write to the Hadoop filesytem. Read more in the Filesystems Section.