Pandas Integration#

To interface with pandas, PyArrow provides various conversion routines to consume pandas structures and convert back to them.

Note

While pandas uses NumPy as a backend, it has enough peculiarities (such as a different type system, and support for null values) that this is a separate topic from NumPy Integration.

To follow examples in this document, make sure to run:

In [1]: import pandas as pd

In [2]: import pyarrow as pa

DataFrames#

The equivalent to a pandas DataFrame in Arrow is a Table. Both consist of a set of named columns of equal length. While pandas only supports flat columns, the Table also provides nested columns, thus it can represent more data than a DataFrame, so a full conversion is not always possible.

Conversion from a Table to a DataFrame is done by calling pyarrow.Table.to_pandas(). The inverse is then achieved by using pyarrow.Table.from_pandas().

import pyarrow as pa
import pandas as pd

df = pd.DataFrame({"a": [1, 2, 3]})
# Convert from pandas to Arrow
table = pa.Table.from_pandas(df)
# Convert back to pandas
df_new = table.to_pandas()

# Infer Arrow schema from pandas
schema = pa.Schema.from_pandas(df)

By default pyarrow tries to preserve and restore the .index data as accurately as possible. See the section below for more about this, and how to disable this logic.

Series#

In Arrow, the most similar structure to a pandas Series is an Array. It is a vector that contains data of the same type as linear memory. You can convert a pandas Series to an Arrow Array using pyarrow.Array.from_pandas(). As Arrow Arrays are always nullable, you can supply an optional mask using the mask parameter to mark all null-entries.

Handling pandas Indexes#

Methods like pyarrow.Table.from_pandas() have a preserve_index option which defines how to preserve (store) or not to preserve (to not store) the data in the index member of the corresponding pandas object. This data is tracked using schema-level metadata in the internal arrow::Schema object.

The default of preserve_index is None, which behaves as follows:

RangeIndex is stored as metadata-only, not requiring any extra storage.
Other index types are stored as one or more physical data columns in the resulting Table

To not store the index at all pass preserve_index=False. Since storing a RangeIndex can cause issues in some limited scenarios (such as storing multiple DataFrame objects in a Parquet file), to force all index data to be serialized in the resulting table, pass preserve_index=True.

Type differences#

With the current design of pandas and Arrow, it is not possible to convert all column types unmodified. One of the main issues here is that pandas has no support for nullable columns of arbitrary type. Also datetime64 is currently fixed to nanosecond resolution. On the other side, Arrow might be still missing support for some types.

pandas -> Arrow Conversion#

Source Type (pandas)	Destination Type (Arrow)
`bool`	`BOOL`
`(u)int{8,16,32,64}`	`(U)INT{8,16,32,64}`
`float32`	`FLOAT`
`float64`	`DOUBLE`
`str` / `unicode`	`STRING`
`pd.Categorical`	`DICTIONARY`
`pd.Timestamp`	`TIMESTAMP(unit=ns)`
`datetime.date`	`DATE`
`datetime.time`	`TIME64`

Arrow -> pandas Conversion#

Source Type (Arrow)	Destination Type (pandas)
`BOOL`	`bool`
`BOOL` with nulls	`object` (with values `True`, `False`, `None`)
`(U)INT{8,16,32,64}`	`(u)int{8,16,32,64}`
`(U)INT{8,16,32,64}` with nulls	`float64`
`FLOAT`	`float32`
`DOUBLE`	`float64`
`STRING`	`str`
`DICTIONARY`	`pd.Categorical`
`TIMESTAMP(unit=*)`	`pd.Timestamp` (`np.datetime64[ns]`)
`DATE`	`object` (with `datetime.date` objects)
`TIME64`	`object` (with `datetime.time` objects)

Categorical types#

Pandas categorical columns are converted to Arrow dictionary arrays, a special array type optimized to handle repeated and limited number of possible values.

In [3]: df = pd.DataFrame({"cat": pd.Categorical(["a", "b", "c", "a", "b", "c"])})

In [4]: df.cat.dtype.categories
Out[4]: Index(['a', 'b', 'c'], dtype='object')

In [5]: df
Out[5]: 
  cat
0   a
1   b
2   c
3   a
4   b
5   c

In [6]: table = pa.Table.from_pandas(df)

In [7]: table
Out[7]: 
pyarrow.Table
cat: dictionary<values=string, indices=int8, ordered=0>
----
cat: [  -- dictionary:
["a","b","c"]  -- indices:
[0,1,2,0,1,2]]

We can inspect the ChunkedArray of the created table and see the same categories of the Pandas DataFrame.

In [8]: column = table[0]

In [9]: chunk = column.chunk(0)

In [10]: chunk.dictionary
Out[10]: 
<pyarrow.lib.StringArray object at 0x7f9d813466e0>
[
  "a",
  "b",
  "c"
]

In [11]: chunk.indices
Out[11]: 
<pyarrow.lib.Int8Array object at 0x7f9d81346560>
[
  0,
  1,
  2,
  0,
  1,
  2
]

Datetime (Timestamp) types#

Pandas Timestamps use the datetime64[ns] type in Pandas and are converted to an Arrow TimestampArray.

In [12]: df = pd.DataFrame({"datetime": pd.date_range("2020-01-01T00:00:00Z", freq="h", periods=3)})

In [13]: df.dtypes
Out[13]: 
datetime    datetime64[ns, UTC]
dtype: object

In [14]: df
Out[14]: 
                   datetime
0 2020-01-01 00:00:00+00:00
1 2020-01-01 01:00:00+00:00
2 2020-01-01 02:00:00+00:00

In [15]: table = pa.Table.from_pandas(df)

In [16]: table
Out[16]: 
pyarrow.Table
datetime: timestamp[ns, tz=UTC]
----
datetime: [[2020-01-01 00:00:00.000000000Z,2020-01-01 01:00:00.000000000Z,2020-01-01 02:00:00.000000000Z]]

In this example the Pandas Timestamp is time zone aware (UTC on this case), and this information is used to create the Arrow TimestampArray.

Date types#

While dates can be handled using the datetime64[ns] type in pandas, some systems work with object arrays of Python’s built-in datetime.date object:

In [17]: from datetime import date

In [18]: s = pd.Series([date(2018, 12, 31), None, date(2000, 1, 1)])

In [19]: s
Out[19]: 
0    2018-12-31
1          None
2    2000-01-01
dtype: object

When converting to an Arrow array, the date32 type will be used by default:

In [20]: arr = pa.array(s)

In [21]: arr.type
Out[21]: DataType(date32[day])

In [22]: arr[0]
Out[22]: <pyarrow.Date32Scalar: datetime.date(2018, 12, 31)>

To use the 64-bit date64, specify this explicitly:

In [23]: arr = pa.array(s, type='date64')

In [24]: arr.type
Out[24]: DataType(date64[ms])

When converting back with to_pandas, object arrays of datetime.date objects are returned:

In [25]: arr.to_pandas()
Out[25]: 
0    2018-12-31
1          None
2    2000-01-01
dtype: object

If you want to use NumPy’s datetime64 dtype instead, pass date_as_object=False:

In [26]: s2 = pd.Series(arr.to_pandas(date_as_object=False))

In [27]: s2.dtype
Out[27]: dtype('<M8[ms]')

Warning

As of Arrow 0.13 the parameter date_as_object is True by default. Older versions must pass date_as_object=True to obtain this behavior

Time types#

The builtin datetime.time objects inside Pandas data structures will be converted to an Arrow time64 and Time64Array respectively.

In [28]: from datetime import time

In [29]: s = pd.Series([time(1, 1, 1), time(2, 2, 2)])

In [30]: s
Out[30]: 
0    01:01:01
1    02:02:02
dtype: object

In [31]: arr = pa.array(s)

In [32]: arr.type
Out[32]: Time64Type(time64[us])

In [33]: arr
Out[33]: 
<pyarrow.lib.Time64Array object at 0x7f9d83bae980>
[
  01:01:01.000000,
  02:02:02.000000
]

When converting to pandas, arrays of datetime.time objects are returned:

In [34]: arr.to_pandas()
Out[34]: 
0    01:01:01
1    02:02:02
dtype: object

Nullable types#

In Arrow all data types are nullable, meaning they support storing missing values. In pandas, however, not all data types have support for missing data. Most notably, the default integer data types do not, and will get casted to float when missing values are introduced. Therefore, when an Arrow array or table gets converted to pandas, integer columns will become float when missing values are present:

>>> arr = pa.array([1, 2, None])
>>> arr
<pyarrow.lib.Int64Array object at 0x7f07d467c640>
[
  1,
  2,
  null
]
>>> arr.to_pandas()
0    1.0
1    2.0
2    NaN
dtype: float64

Pandas has experimental nullable data types (https://pandas.pydata.org/docs/user_guide/integer_na.html). Arrows supports round trip conversion for those:

>>> df = pd.DataFrame({'a': pd.Series([1, 2, None], dtype="Int64")})
>>> df
      a
0     1
1     2
2  <NA>

>>> table = pa.table(df)
>>> table
Out[32]:
pyarrow.Table
a: int64
----
a: [[1,2,null]]

>>> table.to_pandas()
      a
0     1
1     2
2  <NA>

>>> table.to_pandas().dtypes
a    Int64
dtype: object

This roundtrip conversion works because metadata about the original pandas DataFrame gets stored in the Arrow table. However, if you have Arrow data (or e.g. a Parquet file) not originating from a pandas DataFrame with nullable data types, the default conversion to pandas will not use those nullable dtypes.

The pyarrow.Table.to_pandas() method has a types_mapper keyword that can be used to override the default data type used for the resulting pandas DataFrame. This way, you can instruct Arrow to create a pandas DataFrame using nullable dtypes.

>>> table = pa.table({"a": [1, 2, None]})
>>> table.to_pandas()
     a
0  1.0
1  2.0
2  NaN
>>> table.to_pandas(types_mapper={pa.int64(): pd.Int64Dtype()}.get)
      a
0     1
1     2
2  <NA>

The types_mapper keyword expects a function that will return the pandas data type to use given a pyarrow data type. By using the dict.get method, we can create such a function using a dictionary.

If you want to use all currently supported nullable dtypes by pandas, this dictionary becomes:

dtype_mapping = {
    pa.int8(): pd.Int8Dtype(),
    pa.int16(): pd.Int16Dtype(),
    pa.int32(): pd.Int32Dtype(),
    pa.int64(): pd.Int64Dtype(),
    pa.uint8(): pd.UInt8Dtype(),
    pa.uint16(): pd.UInt16Dtype(),
    pa.uint32(): pd.UInt32Dtype(),
    pa.uint64(): pd.UInt64Dtype(),
    pa.bool_(): pd.BooleanDtype(),
    pa.float32(): pd.Float32Dtype(),
    pa.float64(): pd.Float64Dtype(),
    pa.string(): pd.StringDtype(),
}

df = table.to_pandas(types_mapper=dtype_mapping.get)

When using the pandas API for reading Parquet files (pd.read_parquet(..)), this can also be achieved by passing use_nullable_dtypes:

df = pd.read_parquet(path, use_nullable_dtypes=True)

Memory Usage and Zero Copy#

When converting from Arrow data structures to pandas objects using various to_pandas methods, one must occasionally be mindful of issues related to performance and memory usage.

Since pandas’s internal data representation is generally different from the Arrow columnar format, zero copy conversions (where no memory allocation or computation is required) are only possible in certain limited cases.

In the worst case scenario, calling to_pandas will result in two versions of the data in memory, one for Arrow and one for pandas, yielding approximately twice the memory footprint. We have implement some mitigations for this case, particularly when creating large DataFrame objects, that we describe below.

Zero Copy Series Conversions#

Zero copy conversions from Array or ChunkedArray to NumPy arrays or pandas Series are possible in certain narrow cases:

The Arrow data is stored in an integer (signed or unsigned int8 through int64) or floating point type (float16 through float64). This includes many numeric types as well as timestamps.
The Arrow data has no null values (since these are represented using bitmaps which are not supported by pandas).
For ChunkedArray, the data consists of a single chunk, i.e. arr.num_chunks == 1. Multiple chunks will always require a copy because of pandas’s contiguousness requirement.

In these scenarios, to_pandas or to_numpy will be zero copy. In all other scenarios, a copy will be required.

Reducing Memory Use in `Table.to_pandas`#

As of this writing, pandas applies a data management strategy called “consolidation” to collect like-typed DataFrame columns in two-dimensional NumPy arrays, referred to internally as “blocks”. We have gone to great effort to construct the precise “consolidated” blocks so that pandas will not perform any further allocation or copies after we hand off the data to pandas.DataFrame. The obvious downside of this consolidation strategy is that it forces a “memory doubling”.

To try to limit the potential effects of “memory doubling” during Table.to_pandas, we provide a couple of options:

split_blocks=True, when enabled Table.to_pandas produces one internal DataFrame “block” for each column, skipping the “consolidation” step. Note that many pandas operations will trigger consolidation anyway, but the peak memory use may be less than the worst case scenario of a full memory doubling. As a result of this option, we are able to do zero copy conversions of columns in the same cases where we can do zero copy with Array and ChunkedArray.
self_destruct=True, this destroys the internal Arrow memory buffers in each column Table object as they are converted to the pandas-compatible representation, potentially releasing memory to the operating system as soon as a column is converted. Note that this renders the calling Table object unsafe for further use, and any further methods called will cause your Python process to crash.

Used together, the call

df = table.to_pandas(split_blocks=True, self_destruct=True)
del table  # not necessary, but a good practice

will yield significantly lower memory usage in some scenarios. Without these options, to_pandas will always double memory.

Note that self_destruct=True is not guaranteed to save memory. Since the conversion happens column by column, memory is also freed column by column. But if multiple columns share an underlying buffer, then no memory will be freed until all of those columns are converted. In particular, due to implementation details, data that comes from IPC or Flight is prone to this, as memory will be laid out as follows:

Record Batch 0: Allocation 0: array 0 chunk 0, array 1 chunk 0, ...
Record Batch 1: Allocation 1: array 0 chunk 1, array 1 chunk 1, ...
...

In this case, no memory can be freed until the entire table is converted, even with self_destruct=True.