Pandas Integration¶
To interface with pandas, PyArrow provides various conversion routines to consume pandas structures and convert back to them.
Note
While pandas uses NumPy as a backend, it has enough peculiarities (such as a different type system, and support for null values) that this is a separate topic from NumPy Integration.
To follow examples in this document, make sure to run:
In [1]: import pandas as pd
In [2]: import pyarrow as pa
DataFrames¶
The equivalent to a pandas DataFrame in Arrow is a Table. Both consist of a set of named columns of equal length. While pandas only supports flat columns, the Table also provides nested columns, thus it can represent more data than a DataFrame, so a full conversion is not always possible.
Conversion from a Table to a DataFrame is done by calling
pyarrow.Table.to_pandas()
. The inverse is then achieved by using
pyarrow.Table.from_pandas()
.
import pyarrow as pa
import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3]})
# Convert from pandas to Arrow
table = pa.Table.from_pandas(df)
# Convert back to pandas
df_new = table.to_pandas()
# Infer Arrow schema from pandas
schema = pa.Schema.from_pandas(df)
By default pyarrow
tries to preserve and restore the .index
data as accurately as possible. See the section below for more about
this, and how to disable this logic.
Series¶
In Arrow, the most similar structure to a pandas Series is an Array.
It is a vector that contains data of the same type as linear memory. You can
convert a pandas Series to an Arrow Array using pyarrow.Array.from_pandas()
.
As Arrow Arrays are always nullable, you can supply an optional mask using
the mask
parameter to mark all null-entries.
Handling pandas Indexes¶
Methods like pyarrow.Table.from_pandas()
have a
preserve_index
option which defines how to preserve (store) or not
to preserve (to not store) the data in the index
member of the
corresponding pandas object. This data is tracked using schema-level
metadata in the internal arrow::Schema
object.
The default of preserve_index
is None
, which behaves as
follows:
RangeIndex
is stored as metadata-only, not requiring any extra storage.Other index types are stored as one or more physical data columns in the resulting
Table
To not store the index at all pass preserve_index=False
. Since
storing a RangeIndex
can cause issues in some limited scenarios
(such as storing multiple DataFrame objects in a Parquet file), to
force all index data to be serialized in the resulting table, pass
preserve_index=True
.
Type differences¶
With the current design of pandas and Arrow, it is not possible to convert all
column types unmodified. One of the main issues here is that pandas has no
support for nullable columns of arbitrary type. Also datetime64
is currently
fixed to nanosecond resolution. On the other side, Arrow might be still missing
support for some types.
pandas -> Arrow Conversion¶
Source Type (pandas) |
Destination Type (Arrow) |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Arrow -> pandas Conversion¶
Source Type (Arrow) |
Destination Type (pandas) |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Categorical types¶
TODO
Datetime (Timestamp) types¶
TODO
Date types¶
While dates can be handled using the datetime64[ns]
type in
pandas, some systems work with object arrays of Python’s built-in
datetime.date
object:
In [3]: from datetime import date
In [4]: s = pd.Series([date(2018, 12, 31), None, date(2000, 1, 1)])
In [5]: s
Out[5]:
0 2018-12-31
1 None
2 2000-01-01
dtype: object
When converting to an Arrow array, the date32
type will be used by
default:
In [6]: arr = pa.array(s)
In [7]: arr.type
Out[7]: DataType(date32[day])
In [8]: arr[0]
Out[8]: <pyarrow.Date32Scalar: datetime.date(2018, 12, 31)>
To use the 64-bit date64
, specify this explicitly:
In [9]: arr = pa.array(s, type='date64')
In [10]: arr.type
Out[10]: DataType(date64[ms])
When converting back with to_pandas
, object arrays of
datetime.date
objects are returned:
In [11]: arr.to_pandas()
Out[11]:
0 2018-12-31
1 None
2 2000-01-01
dtype: object
If you want to use NumPy’s datetime64
dtype instead, pass
date_as_object=False
:
In [12]: s2 = pd.Series(arr.to_pandas(date_as_object=False))
In [13]: s2.dtype
Out[13]: dtype('<M8[ns]')
Warning
As of Arrow 0.13
the parameter date_as_object
is True
by default. Older versions must pass date_as_object=True
to
obtain this behavior
Time types¶
TODO
Memory Usage and Zero Copy¶
When converting from Arrow data structures to pandas objects using various
to_pandas
methods, one must occasionally be mindful of issues related to
performance and memory usage.
Since pandas’s internal data representation is generally different from the Arrow columnar format, zero copy conversions (where no memory allocation or computation is required) are only possible in certain limited cases.
In the worst case scenario, calling to_pandas
will result in two versions
of the data in memory, one for Arrow and one for pandas, yielding approximately
twice the memory footprint. We have implement some mitigations for this case,
particularly when creating large DataFrame
objects, that we describe below.
Zero Copy Series Conversions¶
Zero copy conversions from Array
or ChunkedArray
to NumPy arrays or
pandas Series are possible in certain narrow cases:
The Arrow data is stored in an integer (signed or unsigned
int8
throughint64
) or floating point type (float16
throughfloat64
). This includes many numeric types as well as timestamps.The Arrow data has no null values (since these are represented using bitmaps which are not supported by pandas).
For
ChunkedArray
, the data consists of a single chunk, i.e.arr.num_chunks == 1
. Multiple chunks will always require a copy because of pandas’s contiguousness requirement.
In these scenarios, to_pandas
or to_numpy
will be zero copy. In all
other scenarios, a copy will be required.
Reducing Memory Use in Table.to_pandas
¶
As of this writing, pandas applies a data management strategy called
“consolidation” to collect like-typed DataFrame columns in two-dimensional
NumPy arrays, referred to internally as “blocks”. We have gone to great effort
to construct the precise “consolidated” blocks so that pandas will not perform
any further allocation or copies after we hand off the data to
pandas.DataFrame
. The obvious downside of this consolidation strategy is
that it forces a “memory doubling”.
To try to limit the potential effects of “memory doubling” during
Table.to_pandas
, we provide a couple of options:
split_blocks=True
, when enabledTable.to_pandas
produces one internal DataFrame “block” for each column, skipping the “consolidation” step. Note that many pandas operations will trigger consolidation anyway, but the peak memory use may be less than the worst case scenario of a full memory doubling. As a result of this option, we are able to do zero copy conversions of columns in the same cases where we can do zero copy withArray
andChunkedArray
.self_destruct=True
, this destroys the internal Arrow memory buffers in each columnTable
object as they are converted to the pandas-compatible representation, potentially releasing memory to the operating system as soon as a column is converted. Note that this renders the callingTable
object unsafe for further use, and any further methods called will cause your Python process to crash.
Used together, the call
df = table.to_pandas(split_blocks=True, self_destruct=True)
del table # not necessary, but a good practice
will yield significantly lower memory usage in some scenarios. Without these
options, to_pandas
will always double memory.
Note that self_destruct=True
is not guaranteed to save memory. Since the
conversion happens column by column, memory is also freed column by column. But
if multiple columns share an underlying buffer, then no memory will be freed
until all of those columns are converted. In particular, due to implementation
details, data that comes from IPC or Flight is prone to this, as memory will be
laid out as follows:
Record Batch 0: Allocation 0: array 0 chunk 0, array 1 chunk 0, ...
Record Batch 1: Allocation 1: array 0 chunk 1, array 1 chunk 1, ...
...
In this case, no memory can be freed until the entire table is converted, even
with self_destruct=True
.