pyarrow.Table

class pyarrow.Table

Bases: object

A collection of top-level named, equal length Arrow arrays.

Warning

Do not call this class’s constructor directly, use one of the from_* methods instead.

__init__()

Initialize self. See help(type(self)) for accurate signature.

Methods

add_column(self, int i, Column column) Add column to Table at position.
append_column(self, Column column) Append column at end of columns.
column(self, i) Select a column by its column name, or numeric index.
drop(self, columns) Drop one or more columns and return a new table.
equals(self, Table other) Check if contents of two tables are equal
flatten(self, MemoryPool memory_pool=None) Flatten this Table.
from_arrays(arrays[, names, schema]) Construct a Table from Arrow arrays or columns
from_batches(batches, Schema schema=None) Construct a Table from a sequence or iterator of Arrow RecordBatches
from_pandas(type cls, df, …[, nthreads, …]) Convert pandas.DataFrame to an Arrow Table.
itercolumns(self) Iterator over all columns in their numerical order
remove_column(self, int i) Create new Table with the indicated column removed
replace_schema_metadata(self, dict metadata=None) EXPERIMENTAL: Create shallow copy of table by replacing schema key-value metadata with the indicated new metadata (which may be None, which deletes any existing metadata
set_column(self, int i, Column column) Replace column in Table at position.
to_batches(self[, chunksize]) Convert Table to list of (contiguous) RecordBatch objects, with optimal maximum chunk size
to_pandas(self[, nthreads, …]) Convert the arrow::Table to a pandas DataFrame
to_pydict(self) Converted the arrow::Table to an OrderedDict

Attributes

columns List of all columns in numerical order
num_columns Number of columns in this table
num_rows Number of rows in this table.
schema Schema of the table and its columns
shape Dimensions of the table – (#rows, #columns)
add_column(self, int i, Column column)

Add column to Table at position. Returns new table

append_column(self, Column column)

Append column at end of columns. Returns new table

column(self, i)

Select a column by its column name, or numeric index.

Parameters:i (int or string) –
Returns:pyarrow.Column
columns

List of all columns in numerical order

Returns:list of pa.Column
drop(self, columns)

Drop one or more columns and return a new table.

columns: list of str

Returns pa.Table

equals(self, Table other)

Check if contents of two tables are equal

Parameters:other (pyarrow.Table) –
Returns:are_equal (boolean)
flatten(self, MemoryPool memory_pool=None)

Flatten this Table. Each column with a struct type is flattened into one column per struct field. Other columns are left unchanged.

Parameters:memory_pool (MemoryPool, default None) – For memory allocations, if required, otherwise use default pool
Returns:result (Table)
static from_arrays(arrays, names=None, schema=None, dict metadata=None)

Construct a Table from Arrow arrays or columns

Parameters:
  • arrays (list of pyarrow.Array or pyarrow.Column) – Equal-length arrays that should form the table.
  • names (list of str, optional) – Names for the table columns. If Columns passed, will be inferred. If Arrays passed, this argument is required
Returns:

pyarrow.Table

static from_batches(batches, Schema schema=None)

Construct a Table from a sequence or iterator of Arrow RecordBatches

Parameters:
  • batches (sequence or iterator of RecordBatch) – Sequence of RecordBatch to be converted, all schemas must be equal
  • schema (Schema, default None) – If not passed, will be inferred from the first RecordBatch
Returns:

table (Table)

from_pandas(type cls, df, Schema schema=None, bool preserve_index=True, nthreads=None, columns=None)

Convert pandas.DataFrame to an Arrow Table.

The column types in the resulting Arrow Table are inferred from the dtypes of the pandas.Series in the DataFrame. In the case of non-object Series, the NumPy dtype is translated to its Arrow equivalent. In the case of object, we need to guess the datatype by looking at the Python objects in this Series.

Be aware that Series of the object dtype don’t carry enough information to always lead to a meaningful Arrow type. In the case that we cannot infer a type, e.g. because the DataFrame is of length 0 or the Series only contains None/nan objects, the type is set to null. This behavior can be avoided by constructing an explicit schema and passing it to this function.

Parameters:
  • df (pandas.DataFrame) –
  • schema (pyarrow.Schema, optional) – The expected schema of the Arrow Table. This can be used to indicate the type of columns if we cannot infer it automatically.
  • preserve_index (bool, optional) – Whether to store the index as an additional column in the resulting Table.
  • nthreads (int, default None (may use up to system CPU count threads)) – If greater than 1, convert columns to Arrow in parallel using indicated number of threads
  • columns (list, optional) – List of column to be converted. If None, use all columns.
Returns:

pyarrow.Table

Examples

>>> import pandas as pd
>>> import pyarrow as pa
>>> df = pd.DataFrame({
    ...     'int': [1, 2],
    ...     'str': ['a', 'b']
    ... })
>>> pa.Table.from_pandas(df)
<pyarrow.lib.Table object at 0x7f05d1fb1b40>
itercolumns(self)

Iterator over all columns in their numerical order

num_columns

Number of columns in this table

Returns:int
num_rows

Number of rows in this table.

Due to the definition of a table, all columns have the same number of rows.

Returns:int
remove_column(self, int i)

Create new Table with the indicated column removed

replace_schema_metadata(self, dict metadata=None)

EXPERIMENTAL: Create shallow copy of table by replacing schema key-value metadata with the indicated new metadata (which may be None, which deletes any existing metadata

Parameters:metadata (dict, default None) –
Returns:shallow_copy (Table)
schema

Schema of the table and its columns

Returns:pyarrow.Schema
set_column(self, int i, Column column)

Replace column in Table at position. Returns new table

shape

Dimensions of the table – (#rows, #columns)

Returns:(int, int)
to_batches(self, chunksize=None)

Convert Table to list of (contiguous) RecordBatch objects, with optimal maximum chunk size

Parameters:chunksize (int, default None) – Maximum size for RecordBatch chunks. Individual chunks may be smaller depending on the chunk layout of individual columns
Returns:batches (list of RecordBatch)
to_pandas(self, nthreads=None, strings_to_categorical=False, memory_pool=None, zero_copy_only=False, categories=None, integer_object_nulls=False, use_threads=False)

Convert the arrow::Table to a pandas DataFrame

Parameters:
  • strings_to_categorical (boolean, default False) – Encode string (UTF8) and binary types to pandas.Categorical
  • memory_pool (MemoryPool, optional) – Specific memory pool to use to allocate casted columns
  • zero_copy_only (boolean, default False) – Raise an ArrowException if this function call would require copying the underlying data
  • categories (list, default empty) – List of columns that should be returned as pandas.Categorical
  • integer_object_nulls (boolean, default False) – Cast integers with nulls to objects
  • use_threads (boolean, default False) – Whether to parallelize the conversion using multiple threads
Returns:

pandas.DataFrame

to_pydict(self)

Converted the arrow::Table to an OrderedDict

Returns:OrderedDict