pyarrow.array

pyarrow.array(obj, type=None, mask=None, size=None, from_pandas=None, bool safe=True, MemoryPool memory_pool=None)

Create pyarrow.Array instance from a Python object.

Parameters
  • obj (sequence, iterable, ndarray or Series) – If both type and size are specified may be a single use iterable. If not strongly-typed, Arrow type will be inferred for resulting array.

  • type (pyarrow.DataType) – Explicit type to attempt to coerce to, otherwise will be inferred from the data.

  • mask (array[bool], optional) – Indicate which values are null (True) or not null (False).

  • size (int64, optional) – Size of the elements. If the input is larger than size bail at this length. For iterators, if size is larger than the input iterator this will be treated as a “max size”, but will involve an initial allocation of size followed by a resize to the actual size (so if you know the exact size specifying it correctly will give you better performance).

  • from_pandas (bool, default None) – Use pandas’s semantics for inferring nulls from values in ndarray-like data. If passed, the mask tasks precedence, but if a value is unmasked (not-null), but still null according to pandas semantics, then it is null. Defaults to False if not passed explicitly by user, or True if a pandas object is passed in.

  • safe (bool, default True) – Check for overflows or other unsafe conversions.

  • memory_pool (pyarrow.MemoryPool, optional) – If not passed, will allocate memory from the currently-set default memory pool.

Returns

array (pyarrow.Array or pyarrow.ChunkedArray) – A ChunkedArray instead of an Array is returned if:

  • the object data overflowed binary storage.

  • the object’s __arrow_array__ protocol method returned a chunked array.

Notes

Localized timestamps will currently be returned as UTC (pandas’s native representation). Timezone-naive data will be implicitly interpreted as UTC.

Pandas’s DateOffsets and dateutil.relativedelta.relativedelta are by default converted as MonthDayNanoIntervalArray. relativedelta leapdays are ignored as are all absolute fields on both objects. datetime.timedelta can also be converted to MonthDayNanoIntervalArray but this requires passing MonthDayNanoIntervalType explicitly.

Converting to dictionary array will promote to a wider integer type for indices if the number of distinct values cannot be represented, even if the index type was explicitly set. This means that if there are more than 127 values the returned dictionary array’s index type will be at least pa.int16() even if pa.int8() was passed to the function. Note that an explicit index type will not be demoted even if it is wider than required.

Examples

>>> import pandas as pd
>>> import pyarrow as pa
>>> pa.array(pd.Series([1, 2]))
<pyarrow.lib.Int64Array object at 0x7f674e4c0e10>
[
  1,
  2
]
>>> pa.array(["a", "b", "a"], type=pa.dictionary(pa.int8(), pa.string()))
<pyarrow.lib.DictionaryArray object at 0x7feb288d9040>
-- dictionary:
[
  "a",
  "b"
]
-- indices:
[
  0,
  1,
  0
]
>>> import numpy as np
>>> pa.array(pd.Series([1, 2]), mask=np.array([0, 1], dtype=bool))
<pyarrow.lib.Int64Array object at 0x7f9019e11208>
[
  1,
  null
]
>>> arr = pa.array(range(1024), type=pa.dictionary(pa.int8(), pa.int64()))
>>> arr.type.index_type
DataType(int16)