Creating Arrow Objects

Recipes related to the creation of Arrays, Tables, Tensors and all other Arrow entities.

Creating Arrays

Arrow keeps data in continuous arrays optimised for memory footprint and SIMD analyses. In Python it’s possible to build pyarrow.Array starting from Python lists (or sequence types in general), numpy arrays and pandas Series.

import pyarrow as pa

array = pa.array([1, 2, 3, 4, 5])
print(array)
[
  1,
  2,
  3,
  4,
  5
]

Arrays can also provide a mask to specify which values should be considered nulls

import numpy as np

array = pa.array([1, 2, 3, 4, 5],
                 mask=np.array([True, False, True, False, True]))

print(array)
[
  null,
  2,
  null,
  4,
  null
]

When building arrays from numpy or pandas, Arrow will leverage optimized code paths that rely on the internal in-memory representation of the data by numpy and pandas

import numpy as np
import pandas as pd

array_from_numpy = pa.array(np.arange(5))
array_from_pandas = pa.array(pd.Series([1, 2, 3, 4, 5]))

Creating Tables

Arrow supports tabular data in pyarrow.Table: each column is represented by a pyarrow.ChunkedArray and tables can be created by pairing multiple arrays with names for their columns

import pyarrow as pa

table = pa.table([
    pa.array([1, 2, 3, 4, 5]),
    pa.array(["a", "b", "c", "d", "e"]),
    pa.array([1.0, 2.0, 3.0, 4.0, 5.0])
], names=["col1", "col2", "col3"])

print(table)
pyarrow.Table
col1: int64
col2: string
col3: double
----
col1: [[1,2,3,4,5]]
col2: [["a","b","c","d","e"]]
col3: [[1,2,3,4,5]]

Create Table from Plain Types

Arrow allows fast zero copy creation of arrow arrays from numpy and pandas arrays and series, but it’s also possible to create Arrow Arrays and Tables from plain Python structures.

The pyarrow.table() function allows creation of Tables from a variety of inputs, including plain python objects

import pyarrow as pa

table = pa.table({
    "col1": [1, 2, 3, 4, 5],
    "col2": ["a", "b", "c", "d", "e"]
})

print(table)
pyarrow.Table
col1: int64
col2: string
----
col1: [[1,2,3,4,5]]
col2: [["a","b","c","d","e"]]

Note

All values provided in the dictionary will be passed to pyarrow.array() for conversion to Arrow arrays, and will benefit from zero copy behaviour when possible.

The pyarrow.Table.from_pylist() method allows the creation of Tables from python lists of row dicts. Types are inferred if a schema is not explicitly passed.

import pyarrow as pa

table = pa.Table.from_pylist([
    {"col1": 1, "col2": "a"},
    {"col1": 2, "col2": "b"},
    {"col1": 3, "col2": "c"},
    {"col1": 4, "col2": "d"},
    {"col1": 5, "col2": "e"}
])

print(table)
pyarrow.Table
col1: int64
col2: string
----
col1: [[1,2,3,4,5]]
col2: [["a","b","c","d","e"]]

Creating Record Batches

Most I/O operations in Arrow happen when shipping batches of data to their destination. pyarrow.RecordBatch is the way Arrow represents batches of data. A RecordBatch can be seen as a slice of a table.

import pyarrow as pa

batch = pa.RecordBatch.from_arrays([
    pa.array([1, 3, 5, 7, 9]),
    pa.array([2, 4, 6, 8, 10])
], names=["odd", "even"])

Multiple batches can be combined into a table using pyarrow.Table.from_batches()

second_batch = pa.RecordBatch.from_arrays([
    pa.array([11, 13, 15, 17, 19]),
    pa.array([12, 14, 16, 18, 20])
], names=["odd", "even"])

table = pa.Table.from_batches([batch, second_batch])
print(table)
pyarrow.Table
odd: int64
even: int64
----
odd: [[1,3,5,7,9],[11,13,15,17,19]]
even: [[2,4,6,8,10],[12,14,16,18,20]]

Equally, pyarrow.Table can be converted to a list of pyarrow.RecordBatch using the pyarrow.Table.to_batches() method

record_batches = table.to_batches(max_chunksize=5)
print(len(record_batches))
2

Store Categorical Data

Arrow provides the pyarrow.DictionaryArray type to represent categorical data without the cost of storing and repeating the categories over and over. This can reduce memory use when columns might have large values (such as text).

If you have an array containing repeated categorical data, it is possible to convert it to a pyarrow.DictionaryArray using pyarrow.Array.dictionary_encode()

arr = pa.array(["red", "green", "blue", "blue", "green", "red"])

categorical = arr.dictionary_encode()
print(categorical)
...
-- dictionary:
  [
    "red",
    "green",
    "blue"
  ]
-- indices:
  [
    0,
    1,
    2,
    2,
    1,
    0
  ]

If you already know the categories and indices then you can skip the encode step and directly create the DictionaryArray using pyarrow.DictionaryArray.from_arrays()

categorical = pa.DictionaryArray.from_arrays(
    indices=[0, 1, 2, 2, 1, 0],
    dictionary=["red", "green", "blue"]
)
print(categorical)
...
-- dictionary:
  [
    "red",
    "green",
    "blue"
  ]
-- indices:
  [
    0,
    1,
    2,
    2,
    1,
    0
  ]