pyarrow.StringArray

class pyarrow.StringArray

Bases: pyarrow.lib.Array

__init__()

Initialize self. See help(type(self)) for accurate signature.

Methods

buffers(self) Return a list of Buffer objects pointing to this array’s physical storage.
cast(self, target_type, bool safe=True) Cast array values to another data type.
dictionary_encode(self) Compute dictionary-encoded representation of array
equals(self, Array other)
format(self, int indent=0, int window=10)
from_buffers(int length, …) Construct a StringArray from value_offsets and data buffers.
from_pandas(obj[, mask, type]) Convert pandas.Series to an Arrow Array, using pandas’s semantics about what values indicate nulls.
isnull(self)
slice(self[, offset, length]) Compute zero-copy slice of this array
to_numpy(self) Experimental: return a NumPy view of this array.
to_pandas(self, …) Convert to a NumPy array object suitable for use in pandas.
to_pylist(self) Convert to a list of native Python objects.
unique(self) Compute distinct elements in array
validate(self) Perform any validation checks implemented by arrow::ValidateArray.

Attributes

null_count
offset A relative position into another array’s data, to enable zero-copy slicing.
type
buffers(self)

Return a list of Buffer objects pointing to this array’s physical storage.

To correctly interpret these buffers, you need to also apply the offset multiplied with the size of the stored data type.

cast(self, target_type, bool safe=True)

Cast array values to another data type.

Example

>>> from datetime import datetime
>>> import pyarrow as pa
>>> arr = pa.array([datetime(2010, 1, 1), datetime(2015, 1, 1)])
>>> arr.type
TimestampType(timestamp[us])

You can use pyarrow.DataType objects to specify the target type:

>>> arr.cast(pa.timestamp('ms'))
<pyarrow.lib.TimestampArray object at 0x10420eb88>
[
  1262304000000,
  1420070400000
]
>>> arr.cast(pa.timestamp('ms')).type
TimestampType(timestamp[ms])

Alternatively, it is also supported to use the string aliases for these types:

>>> arr.cast('timestamp[ms]')
<pyarrow.lib.TimestampArray object at 0x10420eb88>
[
  1262304000000,
  1420070400000
]
>>> arr.cast('timestamp[ms]').type
TimestampType(timestamp[ms])
Parameters:
  • target_type (DataType) – Type to cast to
  • safe (boolean, default True) – Check for overflows or other unsafe conversions
Returns:

casted (Array)

dictionary_encode(self)

Compute dictionary-encoded representation of array

equals(self, Array other)
format(self, int indent=0, int window=10)
static from_buffers(int length, Buffer value_offsets, Buffer data, Buffer null_bitmap=None, int null_count=-1, int offset=0)

Construct a StringArray from value_offsets and data buffers. If there are nulls in the data, also a null_bitmap and the matching null_count must be passed.

Parameters:
  • length (int) –
  • value_offsets (Buffer) –
  • data (Buffer) –
  • null_bitmap (Buffer, optional) –
  • null_count (int, default 0) –
  • offset (int, default 0) –
Returns:

string_array (StringArray)

static from_pandas(obj, mask=None, type=None, bool safe=True, MemoryPool memory_pool=None)

Convert pandas.Series to an Arrow Array, using pandas’s semantics about what values indicate nulls. See pyarrow.array for more general conversion from arrays or sequences to Arrow arrays.

Parameters:
  • sequence (ndarray, Inded Series) –
  • mask (array (boolean), optional) – Indicate which values are null (True) or not null (False)
  • type (pyarrow.DataType) – Explicit type to attempt to coerce to, otherwise will be inferred from the data
  • safe (boolean, default True) – Check for overflows or other unsafe conversions
  • memory_pool (pyarrow.MemoryPool, optional) – If not passed, will allocate memory from the currently-set default memory pool

Notes

Localized timestamps will currently be returned as UTC (pandas’s native representation). Timezone-naive data will be implicitly interpreted as UTC.

Returns:
  • array (pyarrow.Array or pyarrow.ChunkedArray (if object data)
  • overflows binary buffer)
isnull(self)
null_count
offset

A relative position into another array’s data, to enable zero-copy slicing. This value defaults to zero but must be applied on all operations with the physical storage buffers.

slice(self, offset=0, length=None)

Compute zero-copy slice of this array

Parameters:
  • offset (int, default 0) – Offset from start of array to slice
  • length (int, default None) – Length of slice (default is until end of Array starting from offset)
Returns:

sliced (RecordBatch)

to_numpy(self)

Experimental: return a NumPy view of this array. Only primitive arrays with the same memory layout as NumPy (i.e. integers, floating point), without any nulls, are supported.

Returns:array (numpy.ndarray)
to_pandas(self, bool strings_to_categorical=False, bool zero_copy_only=False, bool integer_object_nulls=False, bool date_as_object=False)

Convert to a NumPy array object suitable for use in pandas.

Parameters:
  • strings_to_categorical (boolean, default False) – Encode string (UTF8) and binary types to pandas.Categorical
  • zero_copy_only (boolean, default False) – Raise an ArrowException if this function call would require copying the underlying data
  • integer_object_nulls (boolean, default False) – Cast integers with nulls to objects
  • date_as_object (boolean, default False) – Cast dates to objects
to_pylist(self)

Convert to a list of native Python objects.

Returns:lst (list)
type
unique(self)

Compute distinct elements in array

validate(self)

Perform any validation checks implemented by arrow::ValidateArray. Raises exception with error message if array does not validate

Raises:ArrowInvalid