.. Licensed to the Apache Software Foundation (ASF) under one .. or more contributor license agreements. See the NOTICE file .. distributed with this work for additional information .. regarding copyright ownership. The ASF licenses this file .. to you under the Apache License, Version 2.0 (the .. "License"); you may not use this file except in compliance .. with the License. You may obtain a copy of the License at .. http://www.apache.org/licenses/LICENSE-2.0 .. Unless required by applicable law or agreed to in writing, .. software distributed under the License is distributed on an .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY .. KIND, either express or implied. See the License for the .. specific language governing permissions and limitations .. under the License. .. _pandas_interop: Pandas Integration ================== To interface with `pandas `_, PyArrow provides various conversion routines to consume pandas structures and convert back to them. .. note:: While pandas uses NumPy as a backend, it has enough peculiarities (such as a different type system, and support for null values) that this is a separate topic from :ref:`numpy_interop`. To follow examples in this document, make sure to run: .. ipython:: python import pandas as pd import pyarrow as pa DataFrames ---------- The equivalent to a pandas DataFrame in Arrow is a :ref:`Table `. Both consist of a set of named columns of equal length. While pandas only supports flat columns, the Table also provides nested columns, thus it can represent more data than a DataFrame, so a full conversion is not always possible. Conversion from a Table to a DataFrame is done by calling :meth:`pyarrow.Table.to_pandas`. The inverse is then achieved by using :meth:`pyarrow.Table.from_pandas`. .. code-block:: python import pyarrow as pa import pandas as pd df = pd.DataFrame({"a": [1, 2, 3]}) # Convert from pandas to Arrow table = pa.Table.from_pandas(df) # Convert back to pandas df_new = table.to_pandas() # Infer Arrow schema from pandas schema = pa.Schema.from_pandas(df) By default ``pyarrow`` tries to preserve and restore the ``.index`` data as accurately as possible. See the section below for more about this, and how to disable this logic. Series ------ In Arrow, the most similar structure to a pandas Series is an Array. It is a vector that contains data of the same type as linear memory. You can convert a pandas Series to an Arrow Array using :meth:`pyarrow.Array.from_pandas`. As Arrow Arrays are always nullable, you can supply an optional mask using the ``mask`` parameter to mark all null-entries. Handling pandas Indexes ----------------------- Methods like :meth:`pyarrow.Table.from_pandas` have a ``preserve_index`` option which defines how to preserve (store) or not to preserve (to not store) the data in the ``index`` member of the corresponding pandas object. This data is tracked using schema-level metadata in the internal ``arrow::Schema`` object. The default of ``preserve_index`` is ``None``, which behaves as follows: * ``RangeIndex`` is stored as metadata-only, not requiring any extra storage. * Other index types are stored as one or more physical data columns in the resulting :class:`Table` To not store the index at all pass ``preserve_index=False``. Since storing a ``RangeIndex`` can cause issues in some limited scenarios (such as storing multiple DataFrame objects in a Parquet file), to force all index data to be serialized in the resulting table, pass ``preserve_index=True``. Type differences ---------------- With the current design of pandas and Arrow, it is not possible to convert all column types unmodified. One of the main issues here is that pandas has no support for nullable columns of arbitrary type. Also ``datetime64`` is currently fixed to nanosecond resolution. On the other side, Arrow might be still missing support for some types. pandas -> Arrow Conversion ~~~~~~~~~~~~~~~~~~~~~~~~~~ +------------------------+--------------------------+ | Source Type (pandas) | Destination Type (Arrow) | +========================+==========================+ | ``bool`` | ``BOOL`` | +------------------------+--------------------------+ | ``(u)int{8,16,32,64}`` | ``(U)INT{8,16,32,64}`` | +------------------------+--------------------------+ | ``float32`` | ``FLOAT`` | +------------------------+--------------------------+ | ``float64`` | ``DOUBLE`` | +------------------------+--------------------------+ | ``str`` / ``unicode`` | ``STRING`` | +------------------------+--------------------------+ | ``pd.Categorical`` | ``DICTIONARY`` | +------------------------+--------------------------+ | ``pd.Timestamp`` | ``TIMESTAMP(unit=ns)`` | +------------------------+--------------------------+ | ``datetime.date`` | ``DATE`` | +------------------------+--------------------------+ Arrow -> pandas Conversion ~~~~~~~~~~~~~~~~~~~~~~~~~~ +-------------------------------------+--------------------------------------------------------+ | Source Type (Arrow) | Destination Type (pandas) | +=====================================+========================================================+ | ``BOOL`` | ``bool`` | +-------------------------------------+--------------------------------------------------------+ | ``BOOL`` *with nulls* | ``object`` (with values ``True``, ``False``, ``None``) | +-------------------------------------+--------------------------------------------------------+ | ``(U)INT{8,16,32,64}`` | ``(u)int{8,16,32,64}`` | +-------------------------------------+--------------------------------------------------------+ | ``(U)INT{8,16,32,64}`` *with nulls* | ``float64`` | +-------------------------------------+--------------------------------------------------------+ | ``FLOAT`` | ``float32`` | +-------------------------------------+--------------------------------------------------------+ | ``DOUBLE`` | ``float64`` | +-------------------------------------+--------------------------------------------------------+ | ``STRING`` | ``str`` | +-------------------------------------+--------------------------------------------------------+ | ``DICTIONARY`` | ``pd.Categorical`` | +-------------------------------------+--------------------------------------------------------+ | ``TIMESTAMP(unit=*)`` | ``pd.Timestamp`` (``np.datetime64[ns]``) | +-------------------------------------+--------------------------------------------------------+ | ``DATE`` | ``object``(with ``datetime.date`` objects) | +-------------------------------------+--------------------------------------------------------+ Categorical types ~~~~~~~~~~~~~~~~~ TODO Datetime (Timestamp) types ~~~~~~~~~~~~~~~~~~~~~~~~~~ TODO Date types ~~~~~~~~~~ While dates can be handled using the ``datetime64[ns]`` type in pandas, some systems work with object arrays of Python's built-in ``datetime.date`` object: .. ipython:: python from datetime import date s = pd.Series([date(2018, 12, 31), None, date(2000, 1, 1)]) s When converting to an Arrow array, the ``date32`` type will be used by default: .. ipython:: python arr = pa.array(s) arr.type arr[0] To use the 64-bit ``date64``, specify this explicitly: .. ipython:: python arr = pa.array(s, type='date64') arr.type When converting back with ``to_pandas``, object arrays of ``datetime.date`` objects are returned: .. ipython:: python arr.to_pandas() If you want to use NumPy's ``datetime64`` dtype instead, pass ``date_as_object=False``: .. ipython:: python s2 = pd.Series(arr.to_pandas(date_as_object=False)) s2.dtype .. warning:: As of Arrow ``0.13`` the parameter ``date_as_object`` is ``True`` by default. Older versions must pass ``date_as_object=True`` to obtain this behavior Time types ~~~~~~~~~~ TODO Memory Usage and Zero Copy -------------------------- When converting from Arrow data structures to pandas objects using various ``to_pandas`` methods, one must occasionally be mindful of issues related to performance and memory usage. Since pandas's internal data representation is generally different from the Arrow columnar format, zero copy conversions (where no memory allocation or computation is required) are only possible in certain limited cases. In the worst case scenario, calling ``to_pandas`` will result in two versions of the data in memory, one for Arrow and one for pandas, yielding approximately twice the memory footprint. We have implement some mitigations for this case, particularly when creating large ``DataFrame`` objects, that we describe below. Zero Copy Series Conversions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Zero copy conversions from ``Array`` or ``ChunkedArray`` to NumPy arrays or pandas Series are possible in certain narrow cases: * The Arrow data is stored in an integer (signed or unsigned ``int8`` through ``int64``) or floating point type (``float16`` through ``float64``). This includes many numeric types as well as timestamps. * The Arrow data has no null values (since these are represented using bitmaps which are not supported by pandas). * For ``ChunkedArray``, the data consists of a single chunk, i.e. ``arr.num_chunks == 1``. Multiple chunks will always require a copy because of pandas's contiguousness requirement. In these scenarios, ``to_pandas`` or ``to_numpy`` will be zero copy. In all other scenarios, a copy will be required. Reducing Memory Use in ``Table.to_pandas`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ As of this writing, pandas applies a data management strategy called "consolidation" to collect like-typed DataFrame columns in two-dimensional NumPy arrays, referred to internally as "blocks". We have gone to great effort to construct the precise "consolidated" blocks so that pandas will not perform any further allocation or copies after we hand off the data to ``pandas.DataFrame``. The obvious downside of this consolidation strategy is that it forces a "memory doubling". To try to limit the potential effects of "memory doubling" during ``Table.to_pandas``, we provide a couple of options: * ``split_blocks=True``, when enabled ``Table.to_pandas`` produces one internal DataFrame "block" for each column, skipping the "consolidation" step. Note that many pandas operations will trigger consolidation anyway, but the peak memory use may be less than the worst case scenario of a full memory doubling. As a result of this option, we are able to do zero copy conversions of columns in the same cases where we can do zero copy with ``Array`` and ``ChunkedArray``. * ``self_destruct=True``, this destroys the internal Arrow memory buffers in each column ``Table`` object as they are converted to the pandas-compatible representation, potentially releasing memory to the operating system as soon as a column is converted. Note that this renders the calling ``Table`` object unsafe for further use, and any further methods called will cause your Python process to crash. Used together, the call .. code-block:: python df = table.to_pandas(split_blocks=True, self_destruct=True) del table # not necessary, but a good practice will yield significantly lower memory usage in some scenarios. Without these options, ``to_pandas`` will always double memory.