pyarrow.record_batch#

pyarrow.record_batch(data, names=None, schema=None, metadata=None)#

Create a pyarrow.RecordBatch from another Python data structure or sequence of arrays.

Parameters
datapandas.DataFrame, list

A DataFrame or list of arrays or chunked arrays.

nameslist, default None

Column names if list of arrays passed as data. Mutually exclusive with ‘schema’ argument.

schemaSchema, default None

The expected schema of the RecordBatch. If not passed, will be inferred from the data. Mutually exclusive with ‘names’ argument.

metadatadict or Mapping, default None

Optional metadata for the schema (if schema not passed).

Returns
RecordBatch

Examples

>>> import pyarrow as pa
>>> n_legs = pa.array([2, 2, 4, 4, 5, 100])
>>> animals = pa.array(["Flamingo", "Parrot", "Dog", "Horse", "Brittle stars", "Centipede"])
>>> names = ["n_legs", "animals"]

Creating a RecordBatch from a list of arrays with names:

>>> pa.record_batch([n_legs, animals], names=names)
pyarrow.RecordBatch
n_legs: int64
animals: string
>>> pa.record_batch([n_legs, animals], names=["n_legs", "animals"]).to_pandas()
   n_legs        animals
0       2       Flamingo
1       2         Parrot
2       4            Dog
3       4          Horse
4       5  Brittle stars
5     100      Centipede

Creating a RecordBatch from a list of arrays with names and metadata:

>>> my_metadata={"n_legs": "How many legs does an animal have?"}
>>> pa.record_batch([n_legs, animals],
...                  names=names,
...                  metadata = my_metadata)
pyarrow.RecordBatch
n_legs: int64
animals: string
>>> pa.record_batch([n_legs, animals],
...                  names=names,
...                  metadata = my_metadata).schema
n_legs: int64
animals: string
-- schema metadata --
n_legs: 'How many legs does an animal have?'

Creating a RecordBatch from a pandas DataFrame:

>>> import pandas as pd
>>> df = pd.DataFrame({'year': [2020, 2022, 2021, 2022],
...                    'month': [3, 5, 7, 9],
...                    'day': [1, 5, 9, 13],
...                    'n_legs': [2, 4, 5, 100],
...                    'animals': ["Flamingo", "Horse", "Brittle stars", "Centipede"]})
>>> pa.record_batch(df)
pyarrow.RecordBatch
year: int64
month: int64
day: int64
n_legs: int64
animals: string
>>> pa.record_batch(df).to_pandas()
   year  month  day  n_legs        animals
0  2020      3    1       2       Flamingo
1  2022      5    5       4          Horse
2  2021      7    9       5  Brittle stars
3  2022      9   13     100      Centipede

Creating a RecordBatch from a pandas DataFrame with schema:

>>> my_schema = pa.schema([
...     pa.field('n_legs', pa.int64()),
...     pa.field('animals', pa.string())],
...     metadata={"n_legs": "Number of legs per animal"})
>>> pa.record_batch(df, my_schema).schema
n_legs: int64
animals: string
-- schema metadata --
n_legs: 'Number of legs per animal'
pandas: ...
>>> pa.record_batch(df, my_schema).to_pandas()
   n_legs        animals
0       2       Flamingo
1       4          Horse
2       5  Brittle stars
3     100      Centipede