API Reference

Type and Schema Factory Functions

null() Create instance of null type
bool_() Create instance of boolean type
int8() Create instance of signed int8 type
int16() Create instance of signed int16 type
int32() Create instance of signed int32 type
int64() Create instance of signed int64 type
uint8() Create instance of boolean type
uint16() Create instance of unsigned uint16 type
uint32() Create instance of unsigned uint32 type
uint64() Create instance of unsigned uint64 type
float16() Create half-precision floating point type
float32() Create single-precision floating point type
float64() Create double-precision floating point type
time32(unit) Create instance of 32-bit time (time of day) type with unit resolution
time64(unit) Create instance of 64-bit time (time of day) type with unit resolution
timestamp(unit[, tz]) Create instance of timestamp type with resolution and optional time zone
date32() Create instance of 32-bit date (days since UNIX epoch 1970-01-01)
date64() Create instance of 64-bit date (milliseconds since UNIX epoch 1970-01-01)
binary(int length=-1) Create variable-length binary type
string() Create UTF8 variable-length string type
decimal((int precision, int scale=0) -> DataType) Create decimal type with precision and scale
list_((value_type) -> ListType) Create ListType instance from child data type or field
struct(fields) Create StructType instance from fields
dictionary((DataType index_type, …) Dictionary (categorical, or simply encoded) type
field(name, DataType type, …) Create a pyarrow.Field instance
schema(fields) Construct pyarrow.Schema from collection of fields
from_numpy_dtype(dtype) Convert NumPy dtype to pyarrow.DataType

Tables and Record Batches

ChunkedArray Array backed via one or more memory chunks.
Column Named vector of elements of equal type.
RecordBatch Batch of rows of columns of equal length
Table A collection of top-level named, equal length Arrow arrays.

Tensor type and Functions


Input / Output and Shared Memory

allocate_buffer(int64_t size, …) Allocate mutable fixed-size buffer
BufferReader Zero-copy reader from objects convertible to Arrow buffer
MemoryMappedFile Supports ‘r’, ‘r+w’, ‘w’ modes
memory_map(path[, mode]) Open memory map at file path.
create_memory_map(path, size) Create memory map at indicated path of the given size, return open

File Systems

hdfs.connect([host, port, user, …]) Connect to an HDFS cluster.
class pyarrow.HadoopFileSystem[source]

Serialization and IPC

Message Container for an Arrow IPC message with metadata and optional body
MessageReader Interface for reading Message objects from some source (like an
RecordBatchFileReader(source[, footer_offset]) Class for reading Arrow record batch data from the Arrow binary file format
RecordBatchFileWriter(sink, schema) Writer to create the Arrow binary file format
RecordBatchStreamReader(source) Reader for the Arrow streaming binary format
RecordBatchStreamWriter(sink, schema) Writer for the Arrow streaming binary format
open_file(source[, footer_offset]) Create reader for Arrow file format
open_stream(source) Create reader for Arrow streaming format
read_message(source) Read length-prefixed message from file or buffer-like object
read_record_batch(obj, Schema schema) Read RecordBatch from message, given a known schema
get_record_batch_size(RecordBatch batch) Return total size of serialized RecordBatch including metadata and padding
read_tensor(NativeFile source) Read pyarrow.Tensor from pyarrow.NativeFile object from current position.
write_tensor(Tensor tensor, NativeFile dest) Write pyarrow.Tensor to pyarrow.NativeFile object its current position
get_tensor_size(Tensor tensor) Return total size of serialized Tensor including metadata and padding
serialize(value, …) EXPERIMENTAL: Serialize a Python sequence
serialize_to(value, sink, …) EXPERIMENTAL: Serialize a Python sequence to a file.
deserialize(obj, …) EXPERIMENTAL: Deserialize Python object from Buffer or other Python object
deserialize_from(source, base, …) EXPERIMENTAL: Deserialize a Python sequence from a file.
read_serialized(source[, base]) EXPERIMENTAL: Read serialized Python sequence from file-like object
SerializedPyObject Arrow-serialized representation of Python object

Feather Format

read_feather(source[, columns, nthreads]) Read a pandas.DataFrame from Feather format
write_feather(df, dest) Write a pandas.DataFrame to Feather format

Memory Pools

set_memory_pool(MemoryPool pool)
log_memory_allocations([enable]) Enable or disable memory allocator logging for debugging purposes

Type Classes

DataType Base type for Apache Arrow data type instances.
Field Represents a named field, with a data type, nullability, and optional

In-Memory Object Store

ObjectID An ObjectID represents a string of bytes used to identify Plasma objects.
PlasmaClient The PlasmaClient is used to interface with a plasma store and manager.
PlasmaBuffer This is the type returned by calls to get with a PlasmaClient.

Apache Parquet

ParquetDataset(path_or_paths[, filesystem, …]) Encapsulates details of reading a complete Parquet dataset possibly
ParquetFile(source[, metadata, common_metadata]) Reader interface for a single Parquet file
read_table(source[, columns, nthreads, …]) Read a Table from Parquet format
read_metadata(where) Read FileMetadata from footer of a single Parquet file
read_pandas(source[, columns, nthreads, …]) Read a Table from Parquet format, also reading DataFrame index values if
read_schema(where) Read effective Arrow schema from Parquet file metadata
write_metadata(schema, where[, version, …]) Write metadata-only Parquet file from schema
write_table(table, where[, row_group_size, …]) Write a Table to Parquet format