pyarrow.RecordBatch

class pyarrow.RecordBatch

Bases: object

Batch of rows of columns of equal length

Warning

Do not call this class’s constructor directly, use one of the RecordBatch.from_* functions instead.

__init__()

Initialize self. See help(type(self)) for accurate signature.

Methods

column(self, i) Select single column from record batch
equals(self, RecordBatch other)
from_arrays(list arrays, names, …) Construct a RecordBatch from multiple pyarrow.Arrays
from_pandas(type cls, df, …[, nthreads, …]) Convert pandas.DataFrame to an Arrow RecordBatch
replace_schema_metadata(self, dict metadata=None) EXPERIMENTAL: Create shallow copy of record batch by replacing schema key-value metadata with the indicated new metadata (which may be None, which deletes any existing metadata
serialize(self[, memory_pool]) Write RecordBatch to Buffer as encapsulated IPC message
slice(self[, offset, length]) Compute zero-copy slice of this RecordBatch
to_pandas(self, MemoryPool memory_pool=None) Convert the arrow::RecordBatch to a pandas DataFrame
to_pydict(self) Converted the arrow::RecordBatch to an OrderedDict

Attributes

columns List of all columns in numerical order
num_columns Number of columns
num_rows Number of rows
schema Schema of the RecordBatch and its columns
column(self, i)

Select single column from record batch

Returns:column (pyarrow.Array)
columns

List of all columns in numerical order

Returns:list of pa.Column
equals(self, RecordBatch other)
static from_arrays(list arrays, names, dict metadata=None)

Construct a RecordBatch from multiple pyarrow.Arrays

Parameters:
  • arrays (list of pyarrow.Array) – column-wise data vectors
  • names (pyarrow.Schema or list of str) – schema or list of labels for the columns
Returns:

pyarrow.RecordBatch

from_pandas(type cls, df, Schema schema=None, bool preserve_index=True, nthreads=None, columns=None)

Convert pandas.DataFrame to an Arrow RecordBatch

Parameters:
  • df (pandas.DataFrame) –
  • schema (pyarrow.Schema, optional) – The expected schema of the RecordBatch. This can be used to indicate the type of columns if we cannot infer it automatically.
  • preserve_index (bool, optional) – Whether to store the index as an additional column in the resulting RecordBatch.
  • nthreads (int, default None (may use up to system CPU count threads)) – If greater than 1, convert columns to Arrow in parallel using indicated number of threads
  • columns (list, optional) – List of column to be converted. If None, use all columns.
Returns:

pyarrow.RecordBatch

num_columns

Number of columns

Returns:int
num_rows

Number of rows

Due to the definition of a RecordBatch, all columns have the same number of rows.

Returns:int
replace_schema_metadata(self, dict metadata=None)

EXPERIMENTAL: Create shallow copy of record batch by replacing schema key-value metadata with the indicated new metadata (which may be None, which deletes any existing metadata

Parameters:metadata (dict, default None) –
Returns:shallow_copy (RecordBatch)
schema

Schema of the RecordBatch and its columns

Returns:pyarrow.Schema
serialize(self, memory_pool=None)

Write RecordBatch to Buffer as encapsulated IPC message

Parameters:memory_pool (MemoryPool, default None) – Uses default memory pool if not specified
Returns:serialized (Buffer)
slice(self, offset=0, length=None)

Compute zero-copy slice of this RecordBatch

Parameters:
  • offset (int, default 0) – Offset from start of array to slice
  • length (int, default None) – Length of slice (default is until end of batch starting from offset)
Returns:

sliced (RecordBatch)

to_pandas(self, MemoryPool memory_pool=None, categories=None, bool strings_to_categorical=False, bool zero_copy_only=False, bool integer_object_nulls=False, bool date_as_object=False, bool use_threads=True)

Convert the arrow::RecordBatch to a pandas DataFrame

Parameters:
  • memory_pool (MemoryPool, optional) – Specific memory pool to use to allocate casted columns
  • categories (list, default empty) – List of columns that should be returned as pandas.Categorical
  • strings_to_categorical (boolean, default False) – Encode string (UTF8) and binary types to pandas.Categorical
  • zero_copy_only (boolean, default False) – Raise an ArrowException if this function call would require copying the underlying data
  • integer_object_nulls (boolean, default False) – Cast integers with nulls to objects
  • date_as_object (boolean, default False) – Cast dates to objects
  • use_threads (boolean, default True) – Whether to parallelize the conversion using multiple threads
Returns:

pandas.DataFrame

to_pydict(self)

Converted the arrow::RecordBatch to an OrderedDict

Returns:OrderedDict