pyarrow.Table

class pyarrow.Table

Bases: object

A collection of top-level named, equal length Arrow arrays.

Warning

Do not call this class’s constructor directly, use one of the from_* methods instead.

__init__()

Initialize self. See help(type(self)) for accurate signature.

Methods

add_column(self, int i, Column column) Add column to Table at position.
append_column(self, Column column) Append column at end of columns.
column(self, i) Select a column by its column name, or numeric index.
drop(self, columns) Drop one or more columns and return a new table.
equals(self, Table other) Check if contents of two tables are equal
flatten(self, MemoryPool memory_pool=None) Flatten this Table.
from_arrays(arrays[, names, schema]) Construct a Table from Arrow arrays or columns
from_batches(batches, Schema schema=None) Construct a Table from a list of Arrow RecordBatches
from_pandas(type cls, df, …[, nthreads, …]) Convert pandas.DataFrame to an Arrow Table
itercolumns(self) Iterator over all columns in their numerical order
remove_column(self, int i) Create new Table with the indicated column removed
replace_schema_metadata(self, dict metadata=None) EXPERIMENTAL: Create shallow copy of table by replacing schema key-value metadata with the indicated new metadata (which may be None, which deletes any existing metadata
to_batches(self[, chunksize]) Convert Table to list of (contiguous) RecordBatch objects, with optimal maximum chunk size
to_pandas(self[, nthreads, …]) Convert the arrow::Table to a pandas DataFrame
to_pydict(self) Converted the arrow::Table to an OrderedDict

Attributes

num_columns Number of columns in this table
num_rows Number of rows in this table.
schema Schema of the table and its columns
shape Dimensions of the table – (#rows, #columns)
add_column(self, int i, Column column)

Add column to Table at position. Returns new table

append_column(self, Column column)

Append column at end of columns. Returns new table

column(self, i)

Select a column by its column name, or numeric index.

Parameters:i (int or string) –
Returns:pyarrow.Column
drop(self, columns)

Drop one or more columns and return a new table.

columns: list of str

Returns pa.Table

equals(self, Table other)

Check if contents of two tables are equal

Parameters:other (pyarrow.Table) –
Returns:are_equal (boolean)
flatten(self, MemoryPool memory_pool=None)

Flatten this Table. Each column with a struct type is flattened into one column per struct field. Other columns are left unchanged.

Parameters:memory_pool (MemoryPool, default None) – For memory allocations, if required, otherwise use default pool
Returns:result (Table)
static from_arrays(arrays, names=None, schema=None, dict metadata=None)

Construct a Table from Arrow arrays or columns

Parameters:
  • arrays (list of pyarrow.Array or pyarrow.Column) – Equal-length arrays that should form the table.
  • names (list of str, optional) – Names for the table columns. If Columns passed, will be inferred. If Arrays passed, this argument is required
Returns:

pyarrow.Table

static from_batches(batches, Schema schema=None)

Construct a Table from a list of Arrow RecordBatches

Parameters:
  • batches (list of RecordBatch) – RecordBatch list to be converted, all schemas must be equal
  • schema (Schema, default None) – If not passed, will be inferred from the first RecordBatch
Returns:

table (Table)

from_pandas(type cls, df, Schema schema=None, bool preserve_index=True, nthreads=None, columns=None)

Convert pandas.DataFrame to an Arrow Table

Parameters:
  • df (pandas.DataFrame) –
  • schema (pyarrow.Schema, optional) – The expected schema of the Arrow Table. This can be used to indicate the type of columns if we cannot infer it automatically.
  • preserve_index (bool, optional) – Whether to store the index as an additional column in the resulting Table.
  • nthreads (int, default None (may use up to system CPU count threads)) – If greater than 1, convert columns to Arrow in parallel using indicated number of threads
  • columns (list, optional) – List of column to be converted. If None, use all columns.
Returns:

pyarrow.Table

Examples

>>> import pandas as pd
>>> import pyarrow as pa
>>> df = pd.DataFrame({
    ...     'int': [1, 2],
    ...     'str': ['a', 'b']
    ... })
>>> pa.Table.from_pandas(df)
<pyarrow.lib.Table object at 0x7f05d1fb1b40>
itercolumns(self)

Iterator over all columns in their numerical order

num_columns

Number of columns in this table

Returns:int
num_rows

Number of rows in this table.

Due to the definition of a table, all columns have the same number of rows.

Returns:int
remove_column(self, int i)

Create new Table with the indicated column removed

replace_schema_metadata(self, dict metadata=None)

EXPERIMENTAL: Create shallow copy of table by replacing schema key-value metadata with the indicated new metadata (which may be None, which deletes any existing metadata

Parameters:metadata (dict, default None) –
Returns:shallow_copy (Table)
schema

Schema of the table and its columns

Returns:pyarrow.Schema
shape

Dimensions of the table – (#rows, #columns)

Returns:(int, int)
to_batches(self, chunksize=None)

Convert Table to list of (contiguous) RecordBatch objects, with optimal maximum chunk size

Parameters:chunksize (int, default None) – Maximum size for RecordBatch chunks. Individual chunks may be smaller depending on the chunk layout of individual columns
Returns:batches (list of RecordBatch)
to_pandas(self, nthreads=None, strings_to_categorical=False, memory_pool=None, zero_copy_only=False, categories=None, integer_object_nulls=False)

Convert the arrow::Table to a pandas DataFrame

Parameters:
  • nthreads (int, default max(1, multiprocessing.cpu_count() / 2)) – For the default, we divide the CPU count by 2 because most modern computers have hyperthreading turned on, so doubling the CPU count beyond the number of physical cores does not help
  • strings_to_categorical (boolean, default False) – Encode string (UTF8) and binary types to pandas.Categorical
  • memory_pool (MemoryPool, optional) – Specific memory pool to use to allocate casted columns
  • zero_copy_only (boolean, default False) – Raise an ArrowException if this function call would require copying the underlying data
  • categories (list, default empty) – List of columns that should be returned as pandas.Categorical
  • integer_object_nulls (boolean, default False) – Cast integers with nulls to objects
Returns:

pandas.DataFrame

to_pydict(self)

Converted the arrow::Table to an OrderedDict

Returns:OrderedDict