Memory and IO Interfaces¶
This section will introduce you to the major concepts in PyArrow’s memory management and IO systems:
File-like and stream-like objects
Referencing and Allocating Memory¶
Buffer object wraps the C++
which is the primary tool for memory management in Apache Arrow in C++. It permits
higher-level array classes to safely interact with memory which they may or may
arrow::Buffer can be zero-copy sliced to permit Buffers to cheaply
reference other Buffers, while preserving memory lifetime and clean
There are many implementations of
arrow::Buffer, but they all provide a
standard interface: a data pointer and length. This is similar to Python’s
built-in buffer protocol and
In : import pyarrow as pa In : data = b'abcdefghijklmnopqrstuvwxyz' In : buf = pa.py_buffer(data) In : buf Out: <pyarrow.lib.Buffer at 0x7f46c6127130> In : buf.size Out: 26
Creating a Buffer in this way does not allocate any memory; it is a zero-copy
view on the memory exported from the
data bytes object.
External memory, under the form of a raw pointer and size, can also be
referenced using the
Buffers can be used in circumstances where a Python buffer or memoryview is required, and such conversions are zero-copy:
In : memoryview(buf) Out: <memory at 0x7f46c6399b80>
to_pybytes() method converts the Buffer’s data to a
Python bytestring (thus making a copy of the data):
In : buf.to_pybytes() Out: b'abcdefghijklmnopqrstuvwxyz'
All memory allocations and deallocations (like
free in C)
are tracked in an instance of
MemoryPool. This means that we can
then precisely track amount of memory that has been allocated:
In : pa.total_allocated_bytes() Out: 0
Let’s allocate a resizable
Buffer from the default pool:
In : buf = pa.allocate_buffer(1024, resizable=True) In : pa.total_allocated_bytes() Out: 1024 In : buf.resize(2048) In : pa.total_allocated_bytes() Out: 2048
The default allocator requests memory in a minimum increment of 64 bytes. If the buffer is garbaged-collected, all of the memory is freed:
In : buf = None In : pa.total_allocated_bytes() Out: 0
Besides the default built-in memory pool, there may be additional memory pools to choose (such as mimalloc) from depending on how Arrow was built. One can get the backend name for a memory pool:
>>> pa.default_memory_pool().backend_name 'jemalloc'
On-GPU buffers using Arrow’s optional CUDA integration.
Input and Output¶
The Arrow C++ libraries have several abstract interfaces for different kinds of IO objects:
Read-only files supporting random access
Write-only files supporting random access
File supporting reads, writes, and random access
In the interest of making these objects behave more like Python’s built-in
file objects, we have defined a
NativeFile base class
which implements the same API as regular Python file objects.
NativeFile has some important features which make it
preferable to using Python files with PyArrow where possible:
Other Arrow classes can access the internal C++ IO objects natively, and do not need to acquire the Python GIL
Native C++ IO may be able to do zero-copy IO, such as with memory maps
There are several kinds of
NativeFile options available:
OSFile, a native file that uses your operating system’s file descriptors
MemoryMappedFile, for reading (zero-copy) and writing with memory maps
BufferOutputStream, for writing data in-memory, producing a Buffer at the end
FixedSizeBufferWriter, for writing data into an already allocated Buffer
HdfsFile, for reading and writing data to the Hadoop Filesystem
PythonFile, for interfacing with Python file objects in C++
There are also high-level APIs to make instantiating common kinds of streams easier.
In : buf = memoryview(b"some data") In : stream = pa.input_stream(buf) In : stream.read(4) Out: b'some'
If passed a string or file path, it will open the given file on disk for reading, creating a
OSFile. Optionally, the file can be compressed: if its filename ends with a recognized extension such as
.gz, its contents will automatically be decompressed on reading.
In : import gzip In : with gzip.open('example.gz', 'wb') as f: ....: f.write(b'some data\n' * 3) ....: In : stream = pa.input_stream('example.gz') In : stream.read() Out: b'some data\nsome data\nsome data\n'
If passed a Python file object, it will wrapped in a
PythonFilesuch that the Arrow C++ libraries can read data from it (at the expense of a slight overhead).
output_stream() is the equivalent function for output streams
and allows creating a writable
NativeFile. It has the same
features as explained above for
input_stream(), such as being
able to write to buffers or do on-the-fly compression.
In : with pa.output_stream('example1.dat') as stream: ....: stream.write(b'some data') ....: In : f = open('example1.dat', 'rb') In : f.read() Out: b'some data'
On-Disk and Memory Mapped Files¶
PyArrow includes two ways to interact with data on disk: standard operating system-level file APIs, and memory-mapped files. In regular Python we can write:
In : with open('example2.dat', 'wb') as f: ....: f.write(b'some example data') ....:
OSFile class, you can write:
In : with pa.OSFile('example3.dat', 'wb') as f: ....: f.write(b'some example data') ....:
For reading files, you can use
MemoryMappedFile. The difference between these is that
OSFile allocates new memory on each read, like Python file
objects. In reads from memory maps, the library constructs a buffer referencing
the mapped memory without any memory allocation or copying:
In : file_obj = pa.OSFile('example2.dat') In : mmap = pa.memory_map('example3.dat') In : file_obj.read(4) Out: b'some' In : mmap.read(4) Out: b'some'
read method implements the standard Python file
read API. To read
into Arrow Buffer objects, use
In : mmap.seek(0) Out: 0 In : buf = mmap.read_buffer(4) In : print(buf) <pyarrow.lib.Buffer object at 0x7f46c609de70> In : buf.to_pybytes() Out: b'some'
Many tools in PyArrow, particular the Apache Parquet interface and the file and
stream messaging tools, are more efficient when used with these
types than with normal Python file objects.
In-Memory Reading and Writing¶
To assist with serialization and deserialization of in-memory data, we have file interfaces that can read and write to Arrow Buffers.
In : writer = pa.BufferOutputStream() In : writer.write(b'hello, friends') Out: 14 In : buf = writer.getvalue() In : buf Out: <pyarrow.lib.Buffer at 0x7f46c60a0930> In : buf.size Out: 14 In : reader = pa.BufferReader(buf) In : reader.seek(7) Out: 7 In : reader.read(7) Out: b'friends'
These have similar semantics to Python’s built-in